创建操作

如果您想创建一个 TensorFlow 库中没有涵盖的操作，我们建议您首先尝试使用 Python 编写该操作，作为现有 Python 操作或函数的组合。如果这不可行，您可以创建一个自定义 C++ 操作。创建自定义 C++ 操作可能有很多原因

无法或不容易将您的操作表示为现有操作的组合。
将您的操作表示为现有原语的组合效率不高。
您希望手动融合原语的组合，而未来的编译器可能难以融合。

例如，假设您想实现类似“中值池化”的东西，类似于“MaxPool”运算符，但计算滑动窗口的中值而不是最大值。使用操作组合来完成此操作可能是可行的（例如，使用 ExtractImagePatches 和 TopK），但可能不如本机操作的性能或内存效率高，在其中您可以对单个融合操作执行更巧妙的操作。与往常一样，通常首先值得尝试使用操作组合来表达您想要的内容，只有在证明这很困难或效率低下时才选择添加新的操作。

要合并您的自定义操作，您需要

在 C++ 文件中注册新的操作。操作注册定义了操作功能的接口（规范），该接口独立于操作的实现。例如，操作注册定义了操作的名称以及操作的输入和输出。它还定义了用于张量形状推断的形状函数。
在 C++ 中实现操作。操作的实现称为内核，它是您在步骤 1 中注册的规范的具体实现。可以为不同的输入/输出类型或架构（例如，CPU、GPU）创建多个内核。
创建一个 Python 包装器（可选）。此包装器是用于在 Python 中创建操作的公共 API。从操作注册生成一个默认包装器，可以直接使用或添加到其中。
编写一个函数来计算操作的梯度（可选）。
测试操作。我们通常在 Python 中进行测试，因为这样做很方便，但你也可以在 C++ 中测试操作。如果你定义了梯度，你可以使用 Python tf.test.compute_gradient_error 来验证它们。以 relu_op_test.py 为例，它测试了 Relu 类操作的正向函数及其梯度。

先决条件

熟悉 C++。
必须已安装 TensorFlow 二进制文件，或者必须已下载 TensorFlow 源代码，并且能够构建它。

定义操作接口

通过将操作注册到 TensorFlow 系统来定义操作的接口。在注册过程中，你指定操作的名称、输入（类型和名称）和输出（类型和名称），以及文档字符串和操作可能需要的任何属性。

为了了解它是如何工作的，假设你想要创建一个操作，它接受一个 int32 类型的张量，并输出该张量的副本，其中除了第一个元素之外的所有元素都设置为零。为此，创建一个名为 zero_out.cc 的文件。然后添加对 REGISTER_OP 宏的调用，该宏定义了操作的接口

#include "tensorflow/core/framework/op.h"
#include "tensorflow/core/framework/shape_inference.h"

using namespace tensorflow;

REGISTER_OP("ZeroOut")
    .Input("to_zero: int32")
    .Output("zeroed: int32")
    .SetShapeFn([](::tensorflow::shape_inference::InferenceContext* c) {
      c->set_output(0, c->input(0));
      return Status::OK();
    });

此 ZeroOut 操作接受一个 32 位整数类型的张量 to_zero 作为输入，并输出一个 32 位整数类型的张量 zeroed。该操作还使用一个形状函数来确保输出张量与输入张量具有相同的形状。例如，如果输入是一个形状为 [10, 20] 的张量，那么此形状函数指定输出形状也是 [10, 20]。

实现操作的内核

定义完接口后，提供一个或多个操作的实现。要创建这些内核之一，创建一个扩展 OpKernel 的类，并覆盖 Compute 方法。该 Compute 方法提供一个 context 参数，类型为 OpKernelContext*，你可以从中访问有用的内容，例如输入和输出张量。

将你的内核添加到你上面创建的文件中。内核可能看起来像这样

#include "tensorflow/core/framework/op_kernel.h"

using namespace tensorflow;

class ZeroOutOp : public OpKernel {
 public:
  explicit ZeroOutOp(OpKernelConstruction* context) : OpKernel(context) {}

  void Compute(OpKernelContext* context) override {
    // Grab the input tensor
    const Tensor& input_tensor = context->input(0);
    auto input = input_tensor.flat<int32>();

    // Create an output tensor
    Tensor* output_tensor = NULL;
    OP_REQUIRES_OK(context, context->allocate_output(0, input_tensor.shape(),
                                                     &output_tensor));
    auto output_flat = output_tensor->flat<int32>();

    // Set all but the first element of the output tensor to 0.
    const int N = input.size();
    for (int i = 1; i < N; i++) {
      output_flat(i) = 0;
    }

    // Preserve the first input value if possible.
    if (N > 0) output_flat(0) = input(0);
  }
};

实现内核后，将其注册到 TensorFlow 系统。在注册过程中，你指定此内核将运行的不同约束。例如，你可能有一个为 CPU 制定的内核，以及一个为 GPU 制定的内核。

要为 ZeroOut 操作执行此操作，请将以下内容添加到 zero_out.cc 中

REGISTER_KERNEL_BUILDER(Name("ZeroOut").Device(DEVICE_CPU), ZeroOutOp);

重要提示： 你的 OpKernel 的实例可能被并发访问。你的 Compute 方法必须是线程安全的。使用互斥锁保护对类成员的任何访问。或者更好的是，不要通过类成员共享状态！考虑使用 ResourceMgr 来跟踪操作状态。

多线程 CPU 内核

要编写多线程 CPU 内核，可以使用 work_sharder.h 中的 Shard 函数。此函数将计算函数跨配置用于操作内线程的线程进行分片（请参阅 config.proto 中的 intra_op_parallelism_threads）。

GPU 内核

GPU 内核在两个部分中实现：OpKernel 和 CUDA 内核及其启动代码。

有时 OpKernel 实现是 CPU 和 GPU 内核之间共有的，例如围绕检查输入和分配输出。在这种情况下，建议的实现是

根据设备和张量的基本类型定义 OpKernel 模板。
要执行输出的实际计算，Compute 函数调用一个模板化的仿函数结构。
该仿函数针对 CPUDevice 的特化定义在同一个文件中，但针对 GPUDevice 的特化定义在一个 .cu.cc 文件中，因为它将使用 CUDA 编译器进行编译。

这是一个示例实现。

// kernel_example.h
#ifndef KERNEL_EXAMPLE_H_
#define KERNEL_EXAMPLE_H_

#include <unsupported/Eigen/CXX11/Tensor>

template <typename Device, typename T>
struct ExampleFunctor {
  void operator()(const Device& d, int size, const T* in, T* out);
};

#if GOOGLE_CUDA
// Partially specialize functor for GpuDevice.
template <typename T>
struct ExampleFunctor<Eigen::GpuDevice, T> {
  void operator()(const Eigen::GpuDevice& d, int size, const T* in, T* out);
};
#endif

#endif KERNEL_EXAMPLE_H_

// kernel_example.cc
#include "kernel_example.h"

#include "tensorflow/core/framework/op.h"
#include "tensorflow/core/framework/shape_inference.h"
#include "tensorflow/core/framework/op_kernel.h"

using namespace tensorflow;

using CPUDevice = Eigen::ThreadPoolDevice;
using GPUDevice = Eigen::GpuDevice;

REGISTER_OP("Example")
    .Attr("T: numbertype")
    .Input("input: T")
    .Output("input_times_two: T")
    .SetShapeFn([](::tensorflow::shape_inference::InferenceContext* c) {
      c->set_output(0, c->input(0));
      return Status::OK();
    });

// CPU specialization of actual computation.
template <typename T>
struct ExampleFunctor<CPUDevice, T> {
  void operator()(const CPUDevice& d, int size, const T* in, T* out) {
    for (int i = 0; i < size; ++i) {
      out[i] = 2 * in[i];
    }
  }
};

// OpKernel definition.
// template parameter <T> is the datatype of the tensors.
template <typename Device, typename T>
class ExampleOp : public OpKernel {
 public:
  explicit ExampleOp(OpKernelConstruction* context) : OpKernel(context) {}

  void Compute(OpKernelContext* context) override {
    // Grab the input tensor
    const Tensor& input_tensor = context->input(0);

    // Create an output tensor
    Tensor* output_tensor = NULL;
    OP_REQUIRES_OK(context, context->allocate_output(0, input_tensor.shape(),
                                                     &output_tensor));

    // Do the computation.
    OP_REQUIRES(context, input_tensor.NumElements() <= tensorflow::kint32max,
                errors::InvalidArgument("Too many elements in tensor"));
    ExampleFunctor<Device, T>()(
        context->eigen_device<Device>(),
        static_cast<int>(input_tensor.NumElements()),
        input_tensor.flat<T>().data(),
        output_tensor->flat<T>().data());
  }
};

// Register the CPU kernels.
#define REGISTER_CPU(T)                                          \
  REGISTER_KERNEL_BUILDER(                                       \
      Name("Example").Device(DEVICE_CPU).TypeConstraint<T>("T"), \
      ExampleOp<CPUDevice, T>);
REGISTER_CPU(float);
REGISTER_CPU(int32);

// Register the GPU kernels.
#ifdef GOOGLE_CUDA
#define REGISTER_GPU(T)                                          \
  /* Declare explicit instantiations in kernel_example.cu.cc. */ \
  extern template class ExampleFunctor<GPUDevice, T>;            \
  REGISTER_KERNEL_BUILDER(                                       \
      Name("Example").Device(DEVICE_GPU).TypeConstraint<T>("T"), \
      ExampleOp<GPUDevice, T>);
REGISTER_GPU(float);
REGISTER_GPU(int32);
#endif  // GOOGLE_CUDA

// kernel_example.cu.cc
#ifdef GOOGLE_CUDA
#define EIGEN_USE_GPU
#include "kernel_example.h"
#include "tensorflow/core/util/gpu_kernel_helper.h"

using namespace tensorflow;

using GPUDevice = Eigen::GpuDevice;

// Define the CUDA kernel.
template <typename T>
__global__ void ExampleCudaKernel(const int size, const T* in, T* out) {
  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < size;
       i += blockDim.x * gridDim.x) {
    out[i] = 2 * __ldg(in + i);
  }
}

// Define the GPU implementation that launches the CUDA kernel.
template <typename T>
void ExampleFunctor<GPUDevice, T>::operator()(
    const GPUDevice& d, int size, const T* in, T* out) {
  // Launch the cuda kernel.
  //
  // See core/util/gpu_kernel_helper.h for example of computing
  // block count and thread_per_block count.
  int block_count = 1024;
  int thread_per_block = 20;
  ExampleCudaKernel<T>
      <<<block_count, thread_per_block, 0, d.stream()>>>(size, in, out);
}

// Explicitly instantiate functors for the types of OpKernels registered.
template struct ExampleFunctor<GPUDevice, float>;
template struct ExampleFunctor<GPUDevice, int32>;

#endif  // GOOGLE_CUDA

构建操作库

使用你的系统编译器（TensorFlow 二进制文件安装）编译操作

你应该能够使用系统上可用的 C++ 编译器（例如 g++ 或 clang）编译 zero_out.cc。二进制 PIP 包将你编译操作所需的标头文件和库安装到特定于系统的目录中。但是，TensorFlow python 库提供了 get_include 函数来获取标头目录，而 get_lib 目录包含一个要链接的共享对象。以下是在 Ubuntu 机器上这些函数的输出。

$ python
>>> import tensorflow as tf
>>> tf.sysconfig.get_include()
'/usr/local/lib/python3.6/site-packages/tensorflow/include'
>>> tf.sysconfig.get_lib()
'/usr/local/lib/python3.6/site-packages/tensorflow'

假设你已安装 g++，以下是可以用来将操作编译成动态库的命令序列。

TF_CFLAGS=( $(python -c 'import tensorflow as tf; print(" ".join(tf.sysconfig.get_compile_flags()))') )
TF_LFLAGS=( $(python -c 'import tensorflow as tf; print(" ".join(tf.sysconfig.get_link_flags()))') )
g++ -std=c++14 -shared zero_out.cc -o zero_out.so -fPIC ${TF_CFLAGS[@]} ${TF_LFLAGS[@]} -O2

在 macOS 上，构建 .so 文件时需要额外的标志“-undefined dynamic_lookup”。

关于 gcc 版本 >=5 的说明：从版本 5 开始，gcc 使用新的 C++ ABI。TensorFlow 2.8 及更早版本使用 gcc4 构建，该版本使用旧的 ABI。如果你正在使用这些版本的 TensorFlow，并且尝试使用 gcc>=5 编译你的操作库，请在命令行中添加 -D_GLIBCXX_USE_CXX11_ABI=0，以使库与旧的 ABI 兼容。TensorFlow 2.9+ 包默认与较新的 ABI 兼容。

使用 bazel 编译操作（TensorFlow 源代码安装）

如果你已安装 TensorFlow 源代码，则可以使用 TensorFlow 的构建系统来编译操作。将包含以下 Bazel 构建规则的 BUILD 文件放在 tensorflow/core/user_ops 目录中。

load("//tensorflow:tensorflow.bzl", "tf_custom_op_library")

tf_custom_op_library(
    name = "zero_out.so",
    srcs = ["zero_out.cc"],
)

运行以下命令来构建 zero_out.so。

$ bazel build --config opt //tensorflow/core/user_ops:zero_out.so

要编译具有 CUDA 内核的 Example 操作，你需要使用 tf_custom_op_library 的 gpu_srcs 参数。将包含以下 Bazel 构建规则的 BUILD 文件放在 tensorflow/core/user_ops 目录中的一个新文件夹中（例如“example_gpu”）。

load("//tensorflow:tensorflow.bzl", "tf_custom_op_library")

tf_custom_op_library(
    # kernel_example.cc  kernel_example.cu.cc  kernel_example.h
    name = "kernel_example.so",
    srcs = ["kernel_example.h", "kernel_example.cc"],
    gpu_srcs = ["kernel_example.cu.cc", "kernel_example.h"],
)

运行以下命令来构建 kernel_example.so。

$ bazel build --config opt //tensorflow/core/user_ops/example_gpu:kernel_example.so

注意： 虽然你可以使用标准的 cc_library 规则创建共享库（.so 文件），但我们强烈建议你使用 tf_custom_op_library 宏。它添加了一些必需的依赖项，并执行检查以确保共享库与 TensorFlow 的插件加载机制兼容。

在 Python 中使用操作

TensorFlow Python API 提供了 tf.load_op_library 函数来加载动态库并将操作注册到 TensorFlow 框架。 load_op_library 返回一个 Python 模块，其中包含操作和内核的 Python 包装器。因此，构建完操作后，你可以执行以下操作以从 Python 运行它

import tensorflow as tf
zero_out_module = tf.load_op_library('./zero_out.so')
print(zero_out_module.zero_out([[1, 2], [3, 4]]).numpy())

# Prints
array([[1, 0], [0, 0]], dtype=int32)

请记住，生成的函数将被赋予一个蛇形命名法名称（以符合 PEP8）。因此，如果你的操作在 C++ 文件中名为 ZeroOut，则 python 函数将被称为 zero_out。

要使操作作为可以从 Python 模块 import 的常规函数可用，将 load_op_library 调用放在 Python 源文件中可能很有用，如下所示

import tensorflow as tf

zero_out_module = tf.load_op_library('./zero_out.so')
zero_out = zero_out_module.zero_out

验证操作是否有效

验证你是否已成功实现操作的一个好方法是为其编写测试。使用以下内容创建文件 zero_out_op_test.py

import tensorflow as tf

class ZeroOutTest(tf.test.TestCase):
  def testZeroOut(self):
    zero_out_module = tf.load_op_library('./zero_out.so')
    with self.test_session():
      result = zero_out_module.zero_out([5, 4, 3, 2, 1])
      self.assertAllEqual(result.eval(), [5, 0, 0, 0, 0])

if __name__ == "__main__":
  tf.test.main()

然后运行你的测试（假设你已安装 tensorflow）

$ python zero_out_op_test.py

在操作中构建高级功能

现在你已经了解了如何构建一个基本（并且有些受限）的操作和实现，我们将看看通常需要在操作中构建的一些更复杂的事情。这包括

条件检查和验证
操作注册
GPU 支持
- 为 GPU 设备编译内核
在 Python 中实现梯度
C++ 中的形状函数

条件检查和验证

上面的示例假设操作应用于任何形状的张量。如果它只应用于向量呢？这意味着在上面的 OpKernel 实现中添加一个检查。

  void Compute(OpKernelContext* context) override {
    // Grab the input tensor
    const Tensor& input_tensor = context->input(0);

    OP_REQUIRES(context, TensorShapeUtils::IsVector(input_tensor.shape()),
                errors::InvalidArgument("ZeroOut expects a 1-D vector."));
    // ...
  }

这断言输入是一个向量，如果它不是向量，则返回已设置 InvalidArgument 状态。 OP_REQUIRES 宏接受三个参数

context，它可以是 OpKernelContext 或 OpKernelConstruction 指针（请参阅 tensorflow/core/framework/op_kernel.h），用于其 SetStatus() 方法。
条件。例如，在 tensorflow/core/framework/tensor_shape.h 中有用于验证张量形状的函数
错误本身，它由一个 Status 对象表示，请参阅 tensorflow/core/platform/status.h。一个 Status 既有类型（通常是 InvalidArgument，但请参阅类型列表）也有消息。用于构造错误的函数可以在 tensorflow/core/platform/errors.h 中找到。

或者，如果你想测试从某个函数返回的 Status 对象是否是一个错误，如果是，则返回它，请使用 OP_REQUIRES_OK。这两个宏都会在发生错误时从函数中返回。

操作注册

属性

操作可以具有属性，这些属性的值在操作添加到图时设置。它们用于配置操作，它们的值可以在内核实现和操作注册中输入和输出的类型中访问。如果可能，请优先使用输入而不是属性，因为输入更灵活。这是因为属性是常量，必须在图构建时定义。相反，输入是张量，其值可以是动态的；也就是说，输入可以在每一步都发生变化，可以使用馈送设置等。属性用于不能使用输入完成的事情：任何影响签名（输入或输出的数量或类型）或不能从一步到下一步发生变化的配置。

在注册操作时，通过使用 Attr 方法指定属性的名称和类型来定义属性，该方法期望一个形式为

<name>: <attr-type-expr>

其中 <name> 以字母开头，可以由字母数字字符和下划线组成，而 <attr-type-expr> 是一个类型表达式，其形式为下面所述。

例如，如果您希望 ZeroOut 操作保留用户指定的索引，而不是仅保留第 0 个元素，您可以像这样注册操作

REGISTER_OP("ZeroOut")
    .Attr("preserve_index: int")
    .Input("to_zero: int32")
    .Output("zeroed: int32");

(请注意，属性类型集与用于输入和输出的 tf.DType 不同。)

然后，您的内核可以通过 context 参数在其构造函数中访问此属性

class ZeroOutOp : public OpKernel {
 public:
  explicit ZeroOutOp(OpKernelConstruction* context) : OpKernel(context) {
    // Get the index of the value to preserve
    OP_REQUIRES_OK(context,
                   context->GetAttr("preserve_index", &preserve_index_));
    // Check that preserve_index is positive
    OP_REQUIRES(context, preserve_index_ >= 0,
                errors::InvalidArgument("Need preserve_index >= 0, got ",
                                        preserve_index_));
  }
  void Compute(OpKernelContext* context) override {
    // ...
  }
 private:
  int preserve_index_;
};

然后可以在 Compute 方法中使用它

  void Compute(OpKernelContext* context) override {
    // ...

    // We're using saved attr to validate potentially dynamic input
    // So we check that preserve_index is in range
    OP_REQUIRES(context, preserve_index_ < input.dimension(0),
                errors::InvalidArgument("preserve_index out of range"));

    // Set all the elements of the output tensor to 0
    const int N = input.size();
    for (int i = 0; i < N; i++) {
      output_flat(i) = 0;
    }

    // Preserve the requested input value
    output_flat(preserve_index_) = input(preserve_index_);
  }

属性类型

属性支持以下类型

string：任何字节序列（不需要是 UTF8）。
int：一个有符号整数。
float：一个浮点数。
bool：真或假。
type：DataType 的一个（非引用）值。
shape：一个 TensorShapeProto。
list(<type>)：一个 <type> 列表，其中 <type> 是上述类型之一。请注意，list(list(<type>)) 是无效的。

另请参阅：op_def_builder.cc:FinalizeAttr 以获取最终列表。

默认值和约束

属性可能具有默认值，某些类型的属性可能具有约束。要定义具有约束的属性，您可以使用以下 <attr-type-expr>

{'<string1>', '<string2>'}：该值必须是一个字符串，其值为 <string1> 或 <string2>。当您使用此语法时，类型名称 string 是隐含的。这模拟了一个枚举

REGISTER_OP("EnumExample")
    .Attr("e: {'apple', 'orange'}");

{<type1>, <type2>}：该值的类型为 type，并且必须是 <type1> 或 <type2> 之一，其中 <type1> 和 <type2> 是支持的 tf.DType。您无需指定属性的类型为 type。当您在 {...} 中有一个类型列表时，这是隐含的。例如，在这种情况下，属性 t 是一个类型，它必须是 int32、float 或 bool

REGISTER_OP("RestrictedTypeExample")
    .Attr("t: {int32, float, bool}");

有一些用于常见类型约束的快捷方式

numbertype：类型 type 限制为数字（非字符串和非布尔）类型。
realnumbertype：与 numbertype 相同，但不包括复数类型。
quantizedtype：与 numbertype 相同，但仅包括量化数字类型。

这些允许的特定类型列表由函数（如 tensorflow/core/framework/types.h 中的 NumberTypes()）定义。在此示例中，属性 t 必须是数字类型之一

REGISTER_OP("NumberType")
    .Attr("t: numbertype");

对于此操作

tf.number_type(t=tf.int32)  # Valid
tf.number_type(t=tf.bool)   # Invalid

列表可以与其他列表和单个类型组合。以下操作允许属性 t 为任何数字类型或布尔类型

REGISTER_OP("NumberOrBooleanType")
    .Attr("t: {numbertype, bool}");

对于此操作

tf.number_or_boolean_type(t=tf.int32)  # Valid
tf.number_or_boolean_type(t=tf.bool)   # Valid
tf.number_or_boolean_type(t=tf.string) # Invalid

int >= <n>：该值必须是一个整数，其值大于或等于 <n>，其中 <n> 是一个自然数。例如，以下操作注册指定属性 a 必须具有至少为 2 的值

REGISTER_OP("MinIntExample")
    .Attr("a: int >= 2");

list(<type>) >= <n>：一个类型为 <type> 的列表，其长度大于或等于 <n>。例如，以下操作注册指定属性 a 是一个类型列表（int32 或 float），并且必须至少有 3 个

REGISTER_OP("TypeListExample")
    .Attr("a: list({int32, float}) >= 3");

要为属性设置默认值（使其在生成的代码中可选），请在末尾添加 = <default>，如

REGISTER_OP("AttrDefaultExample")
    .Attr("i: int = 0");

此外，可以同时指定约束和默认值

REGISTER_OP("AttrConstraintAndDefaultExample")
    .Attr("i: int >= 1 = 1");

支持的默认值语法是在生成的 GraphDef 定义的 proto 表示中使用的语法。

以下是为所有类型指定默认值的示例

REGISTER_OP("AttrDefaultExampleForAllTypes")
   .Attr("s: string = 'foo'")
   .Attr("i: int = 0")
   .Attr("f: float = 1.0")
   .Attr("b: bool = true")
   .Attr("ty: type = DT_INT32")
   .Attr("sh: shape = { dim { size: 1 } dim { size: 2 } }")
   .Attr("te: tensor = { dtype: DT_INT32 int_val: 5 }")
   .Attr("l_empty: list(int) = []")
   .Attr("l_int: list(int) = [2, 3, 5, 7]");

特别要注意，类型为 type 的值使用 tf.DType。

多态性

类型多态性

对于可以接受不同类型作为输入或产生不同输出类型的操作，您可以在操作注册中指定一个属性在一个输入或输出类型中。通常，您会为每个支持的类型注册一个 OpKernel。

例如，如果您希望 ZeroOut 操作除了 int32 之外还可以处理 float，您的操作注册可能如下所示

REGISTER_OP("ZeroOut")
    .Attr("T: {float, int32}")
    .Input("to_zero: T")
    .Output("zeroed: T");

您的操作注册现在指定输入的类型必须是 float 或 int32，并且其输出将是相同的类型，因为两者都具有类型 T。

命名

输入、输出和属性通常应该使用 snake_case 命名。唯一的例外是作为输入类型或输出类型中使用的属性。这些属性可以在将操作添加到图时推断出来，因此不会出现在操作的函数中。例如，ZeroOut 的最后一个定义将生成一个看起来像这样的 Python 函数

def zero_out(to_zero, name=None):
  """...
  Args:
    to_zero: A `Tensor`. Must be one of the following types:
        `float32`, `int32`.
    name: A name for the operation (optional).

  Returns:
    A `Tensor`. Has the same type as `to_zero`.
  """

如果 to_zero 传递了一个 int32 张量，那么 T 将自动设置为 int32（实际上是 DT_INT32）。这些推断的属性使用大写或驼峰式命名。

将其与具有确定输出类型的类型属性的操作进行比较

REGISTER_OP("StringToNumber")
    .Input("string_tensor: string")
    .Output("output: out_type")
    .Attr("out_type: {float, int32} = DT_FLOAT");
    .Doc(R"doc(
Converts each string in the input Tensor to the specified numeric type.
)doc");

在这种情况下，用户必须指定输出类型，如生成的 Python 中所示

def string_to_number(string_tensor, out_type=None, name=None):
  """Converts each string in the input Tensor to the specified numeric type.

  Args:
    string_tensor: A `Tensor` of type `string`.
    out_type: An optional `tf.DType` from: `tf.float32, tf.int32`.
      Defaults to `tf.float32`.
    name: A name for the operation (optional).

  Returns:
    A `Tensor` of type `out_type`.
  """

类型多态性示例

#include "tensorflow/core/framework/op_kernel.h"

class ZeroOutInt32Op : public OpKernel {
  // as before
};

class ZeroOutFloatOp : public OpKernel {
 public:
  explicit ZeroOutFloatOp(OpKernelConstruction* context)
      : OpKernel(context) {}

  void Compute(OpKernelContext* context) override {
    // Grab the input tensor
    const Tensor& input_tensor = context->input(0);
    auto input = input_tensor.flat<float>();

    // Create an output tensor
    Tensor* output = NULL;
    OP_REQUIRES_OK(context,
                   context->allocate_output(0, input_tensor.shape(), &output));
    auto output_flat = output->template flat<float>();

    // Set all the elements of the output tensor to 0
    const int N = input.size();
    for (int i = 0; i < N; i++) {
      output_flat(i) = 0;
    }

    // Preserve the first input value
    if (N > 0) output_flat(0) = input(0);
  }
};

// Note that TypeConstraint<int32>("T") means that attr "T" (defined
// in the op registration above) must be "int32" to use this template
// instantiation.
REGISTER_KERNEL_BUILDER(
    Name("ZeroOut")
    .Device(DEVICE_CPU)
    .TypeConstraint<int32>("T"),
    ZeroOutInt32Op);
REGISTER_KERNEL_BUILDER(
    Name("ZeroOut")
    .Device(DEVICE_CPU)
    .TypeConstraint<float>("T"),
    ZeroOutFloatOp);

为了保持向后兼容性，您应该在向现有操作添加属性时指定一个默认值

REGISTER_OP("ZeroOut")
  .Attr("T: {float, int32} = DT_INT32")
  .Input("to_zero: T")
  .Output("zeroed: T")

假设您想添加更多类型，比如 double

REGISTER_OP("ZeroOut")
    .Attr("T: {float, double, int32}")
    .Input("to_zero: T")
    .Output("zeroed: T");

与其像上面那样编写另一个具有冗余代码的 OpKernel，您通常可以使用 C++ 模板。您仍然会为每个重载有一个内核注册（REGISTER_KERNEL_BUILDER 调用）。

template <typename T>
class ZeroOutOp : public OpKernel {
 public:
  explicit ZeroOutOp(OpKernelConstruction* context) : OpKernel(context) {}

  void Compute(OpKernelContext* context) override {
    // Grab the input tensor
    const Tensor& input_tensor = context->input(0);
    auto input = input_tensor.flat<T>();

    // Create an output tensor
    Tensor* output = NULL;
    OP_REQUIRES_OK(context,
                   context->allocate_output(0, input_tensor.shape(), &output));
    auto output_flat = output->template flat<T>();

    // Set all the elements of the output tensor to 0
    const int N = input.size();
    for (int i = 0; i < N; i++) {
      output_flat(i) = 0;
    }

    // Preserve the first input value
    if (N > 0) output_flat(0) = input(0);
  }
};

// Note that TypeConstraint<int32>("T") means that attr "T" (defined
// in the op registration above) must be "int32" to use this template
// instantiation.
REGISTER_KERNEL_BUILDER(
    Name("ZeroOut")
    .Device(DEVICE_CPU)
    .TypeConstraint<int32>("T"),
    ZeroOutOp<int32>);
REGISTER_KERNEL_BUILDER(
    Name("ZeroOut")
    .Device(DEVICE_CPU)
    .TypeConstraint<float>("T"),
    ZeroOutOp<float>);
REGISTER_KERNEL_BUILDER(
    Name("ZeroOut")
    .Device(DEVICE_CPU)
    .TypeConstraint<double>("T"),
    ZeroOutOp<double>);

如果您有多个重载，您可以将注册放在宏中。

#include "tensorflow/core/framework/op_kernel.h"

#define REGISTER_KERNEL(type)                                       \
  REGISTER_KERNEL_BUILDER(                                          \
      Name("ZeroOut").Device(DEVICE_CPU).TypeConstraint<type>("T"), \
      ZeroOutOp<type>)

REGISTER_KERNEL(int32);
REGISTER_KERNEL(float);
REGISTER_KERNEL(double);

#undef REGISTER_KERNEL

根据您为其注册内核的类型列表，您可以使用 tensorflow/core/framework/register_types.h 提供的宏

#include "tensorflow/core/framework/op_kernel.h"
#include "tensorflow/core/framework/register_types.h"

REGISTER_OP("ZeroOut")
    .Attr("T: realnumbertype")
    .Input("to_zero: T")
    .Output("zeroed: T");

template <typename T>
class ZeroOutOp : public OpKernel { ... };

#define REGISTER_KERNEL(type)                                       \
  REGISTER_KERNEL_BUILDER(                                          \
      Name("ZeroOut").Device(DEVICE_CPU).TypeConstraint<type>("T"), \
      ZeroOutOp<type>)

TF_CALL_REAL_NUMBER_TYPES(REGISTER_KERNEL);

#undef REGISTER_KERNEL

列表输入和输出

除了能够接受或产生不同的类型之外，操作还可以消耗或产生可变数量的张量。

在下一个示例中，属性 T 持有一个类型列表，并用作输入 in 和输出 out 的类型。输入和输出是该类型张量的列表（输出中张量的数量和类型与输入相同，因为两者都具有类型 T）。

REGISTER_OP("PolymorphicListExample")
    .Attr("T: list(type)")
    .Input("in: T")
    .Output("out: T");

您还可以对列表中可以指定的类型进行限制。在下一个示例中，输入是 float 和 double 张量的列表。例如，操作接受输入类型 (float, double, float)，在这种情况下，输出类型也将是 (float, double, float)。

REGISTER_OP("ListTypeRestrictionExample")
    .Attr("T: list({float, double})")
    .Input("in: T")
    .Output("out: T");

如果您希望列表中的所有张量都具有相同的类型，您可以执行以下操作

REGISTER_OP("IntListInputExample")
    .Attr("N: int")
    .Input("in: N * int32")
    .Output("out: int32");

这接受一个 int32 张量列表，并使用一个 int 属性 N 来指定列表的长度。

这也可以进行类型多态性。在下一个示例中，输入是具有相同（但未指定）类型（"T"）的张量列表（长度为 "N"），输出是具有匹配类型的单个张量

REGISTER_OP("SameListInputExample")
    .Attr("N: int")
    .Attr("T: type")
    .Input("in: N * T")
    .Output("out: T");

默认情况下，张量列表的最小长度为 1。您可以使用对相应属性的 ">=" 约束来更改该默认值。在下一个示例中，输入是至少包含 2 个 int32 张量的列表

REGISTER_OP("MinLengthIntListExample")
    .Attr("N: int >= 2")
    .Input("in: N * int32")
    .Output("out: int32");

相同的语法适用于 "list(type)" 属性

REGISTER_OP("MinimumLengthPolymorphicListExample")
    .Attr("T: list(type) >= 3")
    .Input("in: T")
    .Output("out: T");

输入和输出

为了总结以上内容，操作注册可以具有多个输入和输出

REGISTER_OP("MultipleInsAndOuts")
    .Input("y: int32")
    .Input("z: float")
    .Output("a: string")
    .Output("b: int32");

每个输入或输出规范的形式为

<name>: <io-type-expr>

其中 <name> 以字母开头，可以由字母数字字符和下划线组成。 <io-type-expr> 是以下类型表达式之一

<type>，其中 <type> 是支持的输入类型（例如 float、int32、string）。这指定了一个给定类型的单个张量。

请参阅 tf.DType。
```
REGISTER_OP("BuiltInTypesExample")
    .Input("integers: int32")
    .Input("complex_numbers: complex64");
```

<attr-type>，其中 <attr-type> 是一个属性的名称，其类型为 type 或 list(type)（可能具有类型限制）。此语法允许多态操作。

REGISTER_OP("PolymorphicSingleInput")
    .Attr("T: type")
    .Input("in: T");

REGISTER_OP("RestrictedPolymorphicSingleInput")
    .Attr("T: {int32, int64}")
    .Input("in: T");

引用类型为 list(type) 的属性允许您接受一系列张量。

REGISTER_OP("ArbitraryTensorSequenceExample")
    .Attr("T: list(type)")
    .Input("in: T")
    .Output("out: T");

REGISTER_OP("RestrictedTensorSequenceExample")
    .Attr("T: list({int32, int64})")
    .Input("in: T")
    .Output("out: T");

请注意，输出 out 中张量的数量和类型与输入 in 中相同，因为两者都具有类型 T。

对于具有相同类型的张量序列：<number> * <type>，其中 <number> 是一个类型为 int 的 Attr 的名称。 <type> 可以是 tf.DType，也可以是类型为 type 的 attr 的名称。作为第一个示例，此操作接受 int32 张量的列表
```
REGISTER_OP("Int32SequenceExample")
    .Attr("NumTensors: int")
    .Input("in: NumTensors * int32")
```
而此操作接受任何类型的张量列表，只要它们都相同
```
REGISTER_OP("SameTypeSequenceExample")
    .Attr("NumTensors: int")
    .Attr("T: type")
    .Input("in: NumTensors * T")
```
对于对张量的引用：Ref(<type>)，其中 <type> 是前面类型之一。

输入类型中使用的任何 attr 都将被推断。按照惯例，这些推断的 attr 使用大写名称（如 T 或 N）。否则，输入、输出和 attr 的名称类似于函数参数（例如 num_outputs）。有关更多详细信息，请参阅关于命名的早期部分。

有关更多详细信息，请参阅 tensorflow/core/framework/op_def_builder.h。

向后兼容性

假设您已经编写了一个不错的自定义操作并与他人共享，因此您有满意的客户使用您的操作。但是，您希望以某种方式更改操作。

通常，对现有已签入规范的更改必须向后兼容：更改操作的规范不能破坏从旧规范构建的先前序列化 GraphDef 协议缓冲区。 GraphDef 兼容性的详细信息在此处描述。

有几种方法可以保持向后兼容性。

添加到操作中的任何新 attr 都必须定义默认值，并且使用该默认值，操作必须具有原始行为。要将操作从非多态更改为多态，您必须为新的类型 attr 提供默认值，以默认情况下保留原始签名。例如，如果您的操作是
```
REGISTER_OP("MyGeneralUnaryOp")
    .Input("in: float")
    .Output("out: float");
```
您可以使用以下方法以向后兼容的方式使其成为多态的
```
REGISTER_OP("MyGeneralUnaryOp")
    .Input("in: T")
    .Output("out: T")
    .Attr("T: numerictype = DT_FLOAT");
```
您可以安全地使对 attr 的约束不那么严格。例如，您可以从 {int32, int64} 更改为 {int32, int64, float} 或 type。或者您可以从 {"apple", "orange"} 更改为 {"apple", "banana", "orange"} 或 string。
您可以将单个输入/输出更改为列表输入/输出，只要列表类型的默认值与旧签名匹配。
您可以添加新的列表输入/输出，如果它默认为空。
对您创建的任何新操作进行命名空间，方法是在操作名称前添加项目独有的内容。这可以避免您的操作与 TensorFlow 未来版本中可能包含的任何操作发生冲突。
提前计划！尝试预测操作的未来用途。某些签名更改无法以兼容的方式完成（例如，将相同类型的列表更改为不同类型的列表）。

安全和不安全更改的完整列表可以在 tensorflow/core/framework/op_compatibility_test.cc 中找到。如果您无法对操作进行向后兼容的更改，则使用新的名称创建一个具有新语义的新操作。

另请注意，虽然这些更改可以保持 GraphDef 兼容性，但生成的 Python 代码可能会以与旧调用者不兼容的方式更改。Python API 可以通过在手动编写的 Python 包装器中进行仔细的更改来保持兼容性，方法是保留旧签名，除了可能在末尾添加新的可选参数。通常，不兼容的更改只能在 TensorFlow 更改主要版本时进行，并且必须符合 GraphDef 版本语义。

GPU 支持

您可以实现不同的 OpKernels 并为 CPU 注册一个，为 GPU 注册另一个，就像您可以为不同类型注册内核一样。在 tensorflow/core/kernels/ 中有几个具有 GPU 支持的内核示例。请注意，一些内核在 .cc 文件中有一个 CPU 版本，在以 _gpu.cu.cc 结尾的文件中有一个 GPU 版本，以及一些在 .h 文件中共享的代码。

例如，tf.pad 在 tensorflow/core/kernels/pad_op.cc 中包含除 GPU 内核之外的所有内容。GPU 内核位于 tensorflow/core/kernels/pad_op_gpu.cu.cc 中，共享代码是定义在 tensorflow/core/kernels/pad_op.h 中的模板类。我们以这种方式组织代码有两个原因：它允许您在 CPU 和 GPU 实现之间共享通用代码，并且它将 GPU 实现放在一个单独的文件中，以便它只能由 GPU 编译器编译。

需要注意的是，即使使用 pad 的 GPU 内核版本，它仍然需要其 "paddings" 输入在 CPU 内存中。要标记输入或输出保留在 CPU 上，请在内核注册中添加 HostMemory() 调用，例如

#define REGISTER_GPU_KERNEL(T)                         \
  REGISTER_KERNEL_BUILDER(Name("Pad")                  \
                              .Device(DEVICE_GPU)      \
                              .TypeConstraint<T>("T")  \
                              .HostMemory("paddings"), \
                          PadOp<GPUDevice, T>)

为 GPU 设备编译内核

查看 cuda_op_kernel.cu.cc 以获取使用 CUDA 内核实现操作的示例。 tf_custom_op_library 接受一个 gpu_srcs 参数，其中可以指定包含 CUDA 内核的源文件列表（*.cu.cc 文件）。为了与 TensorFlow 的二进制安装一起使用，CUDA 内核必须使用 NVIDIA 的 nvcc 编译器进行编译。以下是用于编译 cuda_op_kernel.cu.cc 和 cuda_op_kernel.cc 为单个动态加载库的命令序列

nvcc -std=c++14 -c -o cuda_op_kernel.cu.o cuda_op_kernel.cu.cc \
  ${TF_CFLAGS[@]} -D GOOGLE_CUDA=1 -x cu -Xcompiler -fPIC

g++ -std=c++14 -shared -o cuda_op_kernel.so cuda_op_kernel.cc \
  cuda_op_kernel.cu.o ${TF_CFLAGS[@]} -fPIC -lcudart ${TF_LFLAGS[@]}

cuda_op_kernel.so 在上面生成后，可以使用 tf.load_op_library 函数像往常一样在 Python 中加载。

请注意，如果您的 CUDA 库未安装在 /usr/local/lib64 中，则需要在上面的第二个（g++）命令中显式指定路径。例如，如果您的 CUDA 安装在 /usr/local/cuda-8.0 中，请添加 -L /usr/local/cuda-8.0/lib64/。

在 Python 中实现梯度

给定一个操作图，TensorFlow 使用自动微分（反向传播）来添加表示相对于现有操作的梯度的新的操作。为了使自动微分对新操作起作用，您必须注册一个梯度函数，该函数根据相对于操作输出的梯度计算相对于操作输入的梯度。

在数学上，如果一个操作计算 \(y = f(x)\)，则注册的梯度操作将损失 \(L\) 相对于 \(y\) 的梯度 \(\partial L/ \partial y\) 转换为损失 \(L\) 相对于 \(x\) 的梯度 \(\partial L/ \partial x\)，通过链式法则

\[\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \frac{\partial y}{\partial x} = \frac{\partial L}{\partial y} \frac{\partial f}{\partial x}.\]

在 ZeroOut 的情况下，输入中只有一个条目会影响输出，因此相对于输入的梯度是一个稀疏的“独热”张量。这表示如下

from tensorflow.python.framework import ops
from tensorflow.python.ops import array_ops
from tensorflow.python.ops import sparse_ops

@ops.RegisterGradient("ZeroOut")
def _zero_out_grad(op, grad):
  """The gradients for `zero_out`.

  Args:
    op: The `zero_out` `Operation` that we are differentiating, which we can use
      to find the inputs and outputs of the original op.
    grad: Gradient with respect to the output of the `zero_out` op.

  Returns:
    Gradients with respect to the input of `zero_out`.
  """
  to_zero = op.inputs[0]
  shape = array_ops.shape(to_zero)
  index = array_ops.zeros_like(shape)
  first_grad = array_ops.reshape(grad, [-1])[0]
  to_zero_grad = sparse_ops.sparse_to_dense([index], shape, first_grad, 0)
  return [to_zero_grad]  # List of one Tensor, since we have one input

有关使用 tf.RegisterGradient 注册梯度函数的详细信息

对于具有一个输出的操作，梯度函数将接受一个 tf.Operation、op 和一个 tf.Tensor grad，并从张量 op.inputs[i]、op.outputs[i] 和 grad 中构建新的操作。有关任何 attr 的信息可以通过 tf.Operation.get_attr 找到。
如果操作具有多个输出，则梯度函数将接受 op 和 grads，其中 grads 是相对于每个输出的梯度列表。梯度函数的结果必须是表示相对于每个输入的梯度的 Tensor 对象列表。
如果某些输入没有明确定义的梯度，例如用作索引的整数输入，则相应的返回梯度应为 None。例如，对于接受浮点张量 x 和整数索引 i 的操作，梯度函数将 return [x_grad, None]。
如果操作根本没有有意义的梯度，您通常不必注册任何梯度，只要操作的梯度永远不需要，您就可以正常工作。在某些情况下，操作没有明确定义的梯度，但可以参与梯度的计算。在这里，您可以使用 ops.NotDifferentiable 自动将零向后传播。

请注意，在调用梯度函数时，只有操作的数据流图可用，而不是张量数据本身。因此，所有计算都必须使用其他 tensorflow 操作来执行，以便在图执行时运行。

在为操作类型注册自定义梯度时添加类型提示，以使代码更具可读性、可调试性、更易于维护以及通过数据验证更健壮。例如，当在函数中使用 op 作为参数时，请指定梯度函数将接受 tf.Operation 作为其参数类型。

C++ 中的形状函数

TensorFlow API 具有一个名为“形状推断”的功能，它可以在不执行图的情况下提供关于张量形状的信息。形状推断由“形状函数”支持，这些函数在 C++ 的 REGISTER_OP 声明中为每个操作类型注册，并执行两个角色：在图构建期间断言输入的形状兼容，以及指定输出的形状。

形状函数被定义为对 shape_inference::InferenceContext 类的操作。例如，在 ZeroOut 的形状函数中

    .SetShapeFn([](::tensorflow::shape_inference::InferenceContext* c) {
      c->set_output(0, c->input(0));
      return Status::OK();
    });

c->set_output(0, c->input(0)); 声明第一个输出的形状应该设置为第一个输入的形状。如果输出通过其索引选择，如上面的示例所示，则 set_output 的第二个参数应该是一个 ShapeHandle 对象。您可以通过其默认构造函数创建一个空的 ShapeHandle 对象。索引为 idx 的输入的 ShapeHandle 对象可以通过 c->input(idx) 获得。

有一些常见的形状函数适用于许多操作，例如 shape_inference::UnchangedShape，它可以在 common_shape_fns.h 中找到，并按如下方式使用

REGISTER_OP("ZeroOut")
    .Input("to_zero: int32")
    .Output("zeroed: int32")
    .SetShapeFn(::tensorflow::shape_inference::UnchangedShape);

形状函数也可以约束输入的形状。对于 ZeroOut 带有向量形状约束的版本，形状函数将如下所示

    .SetShapeFn([](::tensorflow::shape_inference::InferenceContext* c) {
      ::tensorflow::shape_inference::ShapeHandle input;
      TF_RETURN_IF_ERROR(c->WithRank(c->input(0), 1, &input));
      c->set_output(0, input);
      return Status::OK();
    });

WithRank 调用验证输入形状 c->input(0) 是否具有一个维度（或者如果输入形状未知，则输出形状将是一个具有一个未知维度的向量）。

如果您的操作是多态的，具有多个输入，您可以使用 InferenceContext 的成员来确定要检查的形状数量，并使用 Merge 来验证所有形状是否兼容（或者，访问指示长度的属性，使用 InferenceContext::GetAttr，它提供对操作属性的访问）。

    .SetShapeFn([](::tensorflow::shape_inference::InferenceContext* c) {
      ::tensorflow::shape_inference::ShapeHandle input;
      ::tensorflow::shape_inference::ShapeHandle output;
      for (size_t i = 0; i < c->num_inputs(); ++i) {
        TF_RETURN_IF_ERROR(c->WithRank(c->input(i), 2, &input));
        TF_RETURN_IF_ERROR(c->Merge(output, input, &output));
      }
      c->set_output(0, output);
      return Status::OK();
    });

由于形状推断是一个可选功能，并且张量的形状可能动态变化，因此形状函数必须对任何输入的形状信息不完整具有鲁棒性。 InferenceContext 中的 Merge 方法允许调用者断言两个形状相同，即使它们中的一个或两个都没有完整的信息。所有核心 TensorFlow 操作都定义了形状函数，并提供了许多不同的使用示例。

InferenceContext 类有许多函数可用于定义形状函数操作。例如，您可以使用 InferenceContext::Dim 和 InferenceContext::WithValue 来验证特定维度是否具有非常特定的值；您可以使用 InferenceContext::Add 和 InferenceContext::Multiply 来指定输出维度是两个输入维度的总和/乘积。请参阅 InferenceContext 类以了解您可以指定的所有各种形状操作。以下示例将第一个输出的形状设置为 (n, 3)，其中第一个输入的形状为 (n, ...)

.SetShapeFn([](::tensorflow::shape_inference::InferenceContext* c) {
    c->set_output(0, c->Matrix(c->Dim(c->input(0), 0), 3));
    return Status::OK();
});

如果您有一个复杂的形状函数，您应该考虑添加一个测试来验证各种输入形状组合是否产生预期的输出形状组合。您可以在我们的一些核心操作测试中看到如何编写这些测试的示例。（INFER_OK 和 INFER_ERROR 的语法有点神秘，但试图在测试中以紧凑的方式表示输入和输出形状规范。现在，请参阅这些测试中的周围注释以了解形状字符串规范）。

为您的自定义操作构建一个 pip 包

要为您的操作构建一个 pip 包，请参阅 tensorflow/custom-op 示例。本指南展示了如何从 TensorFlow pip 包构建自定义操作，而不是从源代码构建 TensorFlow。