TensorFlow 上的 NumPy API

在 TensorFlow.org 上查看

在 Google Colab 中运行

在 GitHub 上查看源代码

下载笔记本

概述

TensorFlow 实现了一个 NumPy API 的子集，可作为 tf.experimental.numpy 使用。这允许运行由 TensorFlow 加速的 NumPy 代码，同时还可以访问所有 TensorFlow API。

设置

import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
import tensorflow.experimental.numpy as tnp
import timeit

print("Using TensorFlow version %s" % tf.__version__)

启用 NumPy 行为

为了将 tnp 用作 NumPy，请为 TensorFlow 启用 NumPy 行为

tnp.experimental_enable_numpy_behavior()

此调用在 TensorFlow 中启用类型提升，并在将文字转换为张量时更改类型推断，以更严格地遵循 NumPy 标准。

TensorFlow NumPy ND 数组

tf.experimental.numpy.ndarray 的实例，称为 **ND 数组**，表示放置在特定设备上的给定 dtype 的多维密集数组。它是 tf.Tensor 的别名。查看 ND 数组类以了解有用的方法，例如 ndarray.T、ndarray.reshape、ndarray.ravel 等。

首先创建一个 ND 数组对象，然后调用不同的方法。

# Create an ND array and check out different attributes.
ones = tnp.ones([5, 3], dtype=tnp.float32)
print("Created ND array with shape = %s, rank = %s, "
      "dtype = %s on device = %s\n" % (
          ones.shape, ones.ndim, ones.dtype, ones.device))

# `ndarray` is just an alias to `tf.Tensor`.
print("Is `ones` an instance of tf.Tensor: %s\n" % isinstance(ones, tf.Tensor))

# Try commonly used member functions.
print("ndarray.T has shape %s" % str(ones.T.shape))
print("narray.reshape(-1) has shape %s" % ones.reshape(-1).shape)

类型提升

TensorFlow 中有 4 种类型提升选项。

默认情况下，TensorFlow 会引发错误，而不是为混合类型操作提升类型。
运行 tf.numpy.experimental_enable_numpy_behavior() 会将 TensorFlow 切换为使用 NumPy 类型提升规则（如下所述）。
在 TensorFlow 2.15 之后，有两个新的选项（有关详细信息，请参阅 TF NumPy 类型提升）
- tf.numpy.experimental_enable_numpy_behavior(dtype_conversion_mode="all") 使用 Jax 类型提升规则。
- tf.numpy.experimental_enable_numpy_behavior(dtype_conversion_mode="safe") 使用 Jax 类型提升规则，但不允许某些不安全的提升。

NumPy 类型提升

TensorFlow NumPy API 对将文字转换为 ND 数组以及对 ND 数组输入执行类型提升具有明确定义的语义。有关更多详细信息，请参阅 np.result_type。

TensorFlow API 会保持 tf.Tensor 输入不变，并且不会对它们执行类型提升，而 TensorFlow NumPy API 会根据 NumPy 类型提升规则提升所有输入。在下一个示例中，您将执行类型提升。首先，对不同类型的 ND 数组输入执行加法，并注意输出类型。TensorFlow API 不允许这些类型提升中的任何一个。

print("Type promotion for operations")
values = [tnp.asarray(1, dtype=d) for d in
          (tnp.int32, tnp.int64, tnp.float32, tnp.float64)]
for i, v1 in enumerate(values):
  for v2 in values[i + 1:]:
    print("%s + %s => %s" %
          (v1.dtype.name, v2.dtype.name, (v1 + v2).dtype.name))

最后，使用 ndarray.asarray 将文字转换为 ND 数组，并注意结果类型。

print("Type inference during array creation")
print("tnp.asarray(1).dtype == tnp.%s" % tnp.asarray(1).dtype.name)
print("tnp.asarray(1.).dtype == tnp.%s\n" % tnp.asarray(1.).dtype.name)

在将字面量转换为 ND 数组时，NumPy 偏好使用宽类型，例如 tnp.int64 和 tnp.float64。相比之下，tf.convert_to_tensor 偏好使用 tf.int32 和 tf.float32 类型将常量转换为 tf.Tensor。TensorFlow NumPy API 遵循 NumPy 对整数的行为。对于浮点数，experimental_enable_numpy_behavior 的 prefer_float32 参数允许您控制是否优先使用 tf.float32 而不是 tf.float64（默认值为 False）。例如

tnp.experimental_enable_numpy_behavior(prefer_float32=True)
print("When prefer_float32 is True:")
print("tnp.asarray(1.).dtype == tnp.%s" % tnp.asarray(1.).dtype.name)
print("tnp.add(1., 2.).dtype == tnp.%s" % tnp.add(1., 2.).dtype.name)

tnp.experimental_enable_numpy_behavior(prefer_float32=False)
print("When prefer_float32 is False:")
print("tnp.asarray(1.).dtype == tnp.%s" % tnp.asarray(1.).dtype.name)
print("tnp.add(1., 2.).dtype == tnp.%s" % tnp.add(1., 2.).dtype.name)

广播

与 TensorFlow 类似，NumPy 为“广播”值定义了丰富的语义。您可以查看 NumPy 广播指南以获取更多信息，并将其与 TensorFlow 广播语义进行比较。

x = tnp.ones([2, 3])
y = tnp.ones([3])
z = tnp.ones([1, 2, 1])
print("Broadcasting shapes %s, %s and %s gives shape %s" % (
    x.shape, y.shape, z.shape, (x + y + z).shape))

索引

NumPy 定义了非常复杂的索引规则。请参阅 NumPy 索引指南。请注意下面使用 ND 数组作为索引。

x = tnp.arange(24).reshape(2, 3, 4)

print("Basic indexing")
print(x[1, tnp.newaxis, 1:3, ...], "\n")

print("Boolean indexing")
print(x[:, (True, False, True)], "\n")

print("Advanced indexing")
print(x[1, (0, 0, 1), tnp.asarray([0, 1, 1])])

# Mutation is currently not supported
try:
  tnp.arange(6)[1] = -1
except TypeError:
  print("Currently, TensorFlow NumPy does not support mutation.")

示例模型

接下来，您可以了解如何创建模型并在其上运行推理。这个简单的模型应用了一个 relu 层，然后是一个线性投影。后面的部分将展示如何使用 TensorFlow 的 GradientTape 为此模型计算梯度。

class Model(object):
  """Model with a dense and a linear layer."""

  def __init__(self):
    self.weights = None

  def predict(self, inputs):
    if self.weights is None:
      size = inputs.shape[1]
      # Note that type `tnp.float32` is used for performance.
      stddev = tnp.sqrt(size).astype(tnp.float32)
      w1 = tnp.random.randn(size, 64).astype(tnp.float32) / stddev
      bias = tnp.random.randn(64).astype(tnp.float32)
      w2 = tnp.random.randn(64, 2).astype(tnp.float32) / 8
      self.weights = (w1, bias, w2)
    else:
      w1, bias, w2 = self.weights
    y = tnp.matmul(inputs, w1) + bias
    y = tnp.maximum(y, 0)  # Relu
    return tnp.matmul(y, w2)  # Linear projection

model = Model()
# Create input data and compute predictions.
print(model.predict(tnp.ones([2, 32], dtype=tnp.float32)))

TensorFlow NumPy 和 NumPy

TensorFlow NumPy 实现了一个完整的 NumPy 规范的子集。虽然随着时间的推移会添加更多符号，但有一些系统功能在不久的将来不会得到支持。这些包括 NumPy C API 支持、Swig 集成、Fortran 存储顺序、视图和 stride_tricks，以及一些 dtype（例如 np.recarray 和 np.object）。有关更多详细信息，请参阅 TensorFlow NumPy API 文档。

NumPy 互操作性

TensorFlow ND 数组可以与 NumPy 函数互操作。这些对象实现了 __array__ 接口。NumPy 使用此接口将函数参数转换为 np.ndarray 值，然后对其进行处理。

类似地，TensorFlow NumPy 函数可以接受不同类型的输入，包括 np.ndarray。这些输入通过对它们调用 ndarray.asarray 来转换为 ND 数组。

将 ND 数组转换为 np.ndarray 以及从 np.ndarray 转换可能会触发实际的数据复制。有关更多详细信息，请参阅有关缓冲区复制的部分。

# ND array passed into NumPy function.
np_sum = np.sum(tnp.ones([2, 3]))
print("sum = %s. Class: %s" % (float(np_sum), np_sum.__class__))

# `np.ndarray` passed into TensorFlow NumPy function.
tnp_sum = tnp.sum(np.ones([2, 3]))
print("sum = %s. Class: %s" % (float(tnp_sum), tnp_sum.__class__))

# It is easy to plot ND arrays, given the __array__ interface.
labels = 15 + 2 * tnp.random.randn(1, 1000)
_ = plt.hist(labels)

缓冲区复制

将 TensorFlow NumPy 与 NumPy 代码混合可能会触发数据复制。这是因为 TensorFlow NumPy 对内存对齐的要求比 NumPy 更严格。

当将 np.ndarray 传递给 TensorFlow NumPy 时，它将检查对齐要求，并在需要时触发复制。当将 ND 数组 CPU 缓冲区传递给 NumPy 时，通常缓冲区将满足对齐要求，NumPy 将不需要创建副本。

ND 数组可以引用放置在本地 CPU 内存以外的设备上的缓冲区。在这种情况下，调用 NumPy 函数将根据需要触发跨网络或设备的复制。

鉴于此，与 NumPy API 调用的混合通常应谨慎进行，用户应注意数据复制的开销。将 TensorFlow NumPy 调用与 TensorFlow 调用交织通常是安全的，并且可以避免复制数据。有关更多详细信息，请参阅有关 TensorFlow 互操作性的部分。

运算符优先级

TensorFlow NumPy 定义了一个比 NumPy 更高的 __array_priority__。这意味着对于涉及 ND 数组和 np.ndarray 的运算符，前者将优先，即 np.ndarray 输入将被转换为 ND 数组，并将调用 TensorFlow NumPy 的运算符实现。

x = tnp.ones([2]) + np.ones([2])
print("x = %s\nclass = %s" % (x, x.__class__))

TF NumPy 和 TensorFlow

TensorFlow NumPy 建立在 TensorFlow 之上，因此可以与 TensorFlow 无缝互操作。

`tf.Tensor` 和 ND 数组

ND 数组是 tf.Tensor 的别名，因此它们显然可以混合使用，而不会触发实际的数据复制。

x = tf.constant([1, 2])
print(x)

# `asarray` and `convert_to_tensor` here are no-ops.
tnp_x = tnp.asarray(x)
print(tnp_x)
print(tf.convert_to_tensor(tnp_x))

# Note that tf.Tensor.numpy() will continue to return `np.ndarray`.
print(x.numpy(), x.numpy().__class__)

TensorFlow 互操作性

可以将 ND 数组传递给 TensorFlow API，因为 ND 数组只是 tf.Tensor 的别名。如前所述，这种互操作不会进行数据复制，即使对于放置在加速器或远程设备上的数据也是如此。

相反，可以将 tf.Tensor 对象传递给 tf.experimental.numpy API，而无需执行数据复制。

# ND array passed into TensorFlow function.
tf_sum = tf.reduce_sum(tnp.ones([2, 3], tnp.float32))
print("Output = %s" % tf_sum)

# `tf.Tensor` passed into TensorFlow NumPy function.
tnp_sum = tnp.sum(tf.ones([2, 3]))
print("Output = %s" % tnp_sum)

梯度和雅可比矩阵：tf.GradientTape

TensorFlow 的 GradientTape 可用于通过 TensorFlow 和 TensorFlow NumPy 代码进行反向传播。

使用在示例模型部分中创建的模型，并计算梯度和雅可比矩阵。

def create_batch(batch_size=32):
  """Creates a batch of input and labels."""
  return (tnp.random.randn(batch_size, 32).astype(tnp.float32),
          tnp.random.randn(batch_size, 2).astype(tnp.float32))

def compute_gradients(model, inputs, labels):
  """Computes gradients of squared loss between model prediction and labels."""
  with tf.GradientTape() as tape:
    assert model.weights is not None
    # Note that `model.weights` need to be explicitly watched since they
    # are not tf.Variables.
    tape.watch(model.weights)
    # Compute prediction and loss
    prediction = model.predict(inputs)
    loss = tnp.sum(tnp.square(prediction - labels))
  # This call computes the gradient through the computation above.
  return tape.gradient(loss, model.weights)

inputs, labels = create_batch()
gradients = compute_gradients(model, inputs, labels)

# Inspect the shapes of returned gradients to verify they match the
# parameter shapes.
print("Parameter shapes:", [w.shape for w in model.weights])
print("Gradient shapes:", [g.shape for g in gradients])
# Verify that gradients are of type ND array.
assert isinstance(gradients[0], tnp.ndarray)

# Computes a batch of jacobians. Each row is the jacobian of an element in the
# batch of outputs w.r.t. the corresponding input batch element.
def prediction_batch_jacobian(inputs):
  with tf.GradientTape() as tape:
    tape.watch(inputs)
    prediction = model.predict(inputs)
  return prediction, tape.batch_jacobian(prediction, inputs)

inp_batch = tnp.ones([16, 32], tnp.float32)
output, batch_jacobian = prediction_batch_jacobian(inp_batch)
# Note how the batch jacobian shape relates to the input and output shapes.
print("Output shape: %s, input shape: %s" % (output.shape, inp_batch.shape))
print("Batch jacobian shape:", batch_jacobian.shape)

跟踪编译：tf.function

TensorFlow 的 tf.function 通过“跟踪编译”代码，然后优化这些跟踪以获得更快的性能。请参阅图和函数简介。

tf.function 也可用于优化 TensorFlow NumPy 代码。以下是一个简单的示例，用于演示加速。请注意，tf.function 代码的主体包括对 TensorFlow NumPy API 的调用。

inputs, labels = create_batch(512)
print("Eager performance")
compute_gradients(model, inputs, labels)
print(timeit.timeit(lambda: compute_gradients(model, inputs, labels),
                    number=10) * 100, "ms")

print("\ntf.function compiled performance")
compiled_compute_gradients = tf.function(compute_gradients)
compiled_compute_gradients(model, inputs, labels)  # warmup
print(timeit.timeit(lambda: compiled_compute_gradients(model, inputs, labels),
                    number=10) * 100, "ms")

矢量化：tf.vectorized_map

TensorFlow 内置支持矢量化并行循环，这可以使速度提高一个到两个数量级。这些加速可以通过 tf.vectorized_map API 获得，也适用于 TensorFlow NumPy 代码。

有时，计算批次中每个输出相对于相应输入批次元素的梯度非常有用。可以使用 tf.vectorized_map 高效地完成此类计算，如下所示。

@tf.function
def vectorized_per_example_gradients(inputs, labels):
  def single_example_gradient(arg):
    inp, label = arg
    return compute_gradients(model,
                             tnp.expand_dims(inp, 0),
                             tnp.expand_dims(label, 0))
  # Note that a call to `tf.vectorized_map` semantically maps
  # `single_example_gradient` over each row of `inputs` and `labels`.
  # The interface is similar to `tf.map_fn`.
  # The underlying machinery vectorizes away this map loop which gives
  # nice speedups.
  return tf.vectorized_map(single_example_gradient, (inputs, labels))

batch_size = 128
inputs, labels = create_batch(batch_size)

per_example_gradients = vectorized_per_example_gradients(inputs, labels)
for w, p in zip(model.weights, per_example_gradients):
  print("Weight shape: %s, batch size: %s, per example gradient shape: %s " % (
      w.shape, batch_size, p.shape))

# Benchmark the vectorized computation above and compare with
# unvectorized sequential computation using `tf.map_fn`.
@tf.function
def unvectorized_per_example_gradients(inputs, labels):
  def single_example_gradient(arg):
    inp, label = arg
    return compute_gradients(model,
                             tnp.expand_dims(inp, 0),
                             tnp.expand_dims(label, 0))

  return tf.map_fn(single_example_gradient, (inputs, labels),
                   fn_output_signature=(tf.float32, tf.float32, tf.float32))

print("Running vectorized computation")
print(timeit.timeit(lambda: vectorized_per_example_gradients(inputs, labels),
                    number=10) * 100, "ms")

print("\nRunning unvectorized computation")
per_example_gradients = unvectorized_per_example_gradients(inputs, labels)
print(timeit.timeit(lambda: unvectorized_per_example_gradients(inputs, labels),
                    number=10) * 100, "ms")

设备放置

TensorFlow NumPy 可以将操作放置在 CPU、GPU、TPU 和远程设备上。它使用标准的 TensorFlow 机制进行设备放置。以下是一个简单的示例，展示了如何列出所有设备，然后将一些计算放置在特定设备上。

TensorFlow 还具有跨设备复制计算和执行集体约简的 API，这里将不介绍。

列出设备

tf.config.list_logical_devices 和 tf.config.list_physical_devices 可用于查找要使用的设备。

print("All logical devices:", tf.config.list_logical_devices())
print("All physical devices:", tf.config.list_physical_devices())

# Try to get the GPU device. If unavailable, fallback to CPU.
try:
  device = tf.config.list_logical_devices(device_type="GPU")[0]
except IndexError:
  device = "/device:CPU:0"

放置操作：`tf.device`

可以通过在 tf.device 范围内调用操作来将其放置在设备上。

print("Using device: %s" % str(device))
# Run operations in the `tf.device` scope.
# If a GPU is available, these operations execute on the GPU and outputs are
# placed on the GPU memory.
with tf.device(device):
  prediction = model.predict(create_batch(5)[0])

print("prediction is placed on %s" % prediction.device)

跨设备复制 ND 数组：`tnp.copy`

在特定设备范围内调用 tnp.copy 将将数据复制到该设备，除非数据已在该设备上。

with tf.device("/device:CPU:0"):
  prediction_cpu = tnp.copy(prediction)
print(prediction.device)
print(prediction_cpu.device)

性能比较

TensorFlow NumPy 使用高度优化的 TensorFlow 内核，这些内核可以在 CPU、GPU 和 TPU 上调度。TensorFlow 还执行许多编译器优化，例如操作融合，这转化为性能和内存改进。请参阅使用 Grappler 进行 TensorFlow 图优化以了解更多信息。

但是，与 NumPy 相比，TensorFlow 在调度操作方面有更高的开销。对于由小型操作（小于约 10 微秒）组成的负载，这些开销可能会主导运行时间，NumPy 可能会提供更好的性能。对于其他情况，TensorFlow 通常应该提供更好的性能。

运行下面的基准测试，以比较 NumPy 和 TensorFlow NumPy 在不同输入大小下的性能。

def benchmark(f, inputs, number=30, force_gpu_sync=False):
  """Utility to benchmark `f` on each value in `inputs`."""
  times = []
  for inp in inputs:
    def _g():
      if force_gpu_sync:
        one = tnp.asarray(1)
      f(inp)
      if force_gpu_sync:
        with tf.device("CPU:0"):
          tnp.copy(one)  # Force a sync for GPU case

    _g()  # warmup
    t = timeit.timeit(_g, number=number)
    times.append(t * 1000. / number)
  return times


def plot(np_times, tnp_times, compiled_tnp_times, has_gpu, tnp_times_gpu):
  """Plot the different runtimes."""
  plt.xlabel("size")
  plt.ylabel("time (ms)")
  plt.title("Sigmoid benchmark: TF NumPy vs NumPy")
  plt.plot(sizes, np_times, label="NumPy")
  plt.plot(sizes, tnp_times, label="TF NumPy (CPU)")
  plt.plot(sizes, compiled_tnp_times, label="Compiled TF NumPy (CPU)")
  if has_gpu:
    plt.plot(sizes, tnp_times_gpu, label="TF NumPy (GPU)")
  plt.legend()

# Define a simple implementation of `sigmoid`, and benchmark it using
# NumPy and TensorFlow NumPy for different input sizes.

def np_sigmoid(y):
  return 1. / (1. + np.exp(-y))

def tnp_sigmoid(y):
  return 1. / (1. + tnp.exp(-y))

@tf.function
def compiled_tnp_sigmoid(y):
  return tnp_sigmoid(y)

sizes = (2 ** 0, 2 ** 5, 2 ** 10, 2 ** 15, 2 ** 20)
np_inputs = [np.random.randn(size).astype(np.float32) for size in sizes]
np_times = benchmark(np_sigmoid, np_inputs)

with tf.device("/device:CPU:0"):
  tnp_inputs = [tnp.random.randn(size).astype(np.float32) for size in sizes]
  tnp_times = benchmark(tnp_sigmoid, tnp_inputs)
  compiled_tnp_times = benchmark(compiled_tnp_sigmoid, tnp_inputs)

has_gpu = len(tf.config.list_logical_devices("GPU"))
if has_gpu:
  with tf.device("/device:GPU:0"):
    tnp_inputs = [tnp.random.randn(size).astype(np.float32) for size in sizes]
    tnp_times_gpu = benchmark(compiled_tnp_sigmoid, tnp_inputs, 100, True)
else:
  tnp_times_gpu = None
plot(np_times, tnp_times, compiled_tnp_times, has_gpu, tnp_times_gpu)