将 tf.feature_columns 迁移到 Keras 预处理层

tf.feature_columns to Keras preprocessing layers" />
在 TensorFlow.org 上查看 在 Google Colab 中运行 在 GitHub 上查看源代码 下载笔记本

训练模型通常需要进行一定程度的特征预处理,尤其是在处理结构化数据时。在 TensorFlow 1 中训练 tf.estimator.Estimator 时,通常使用 tf.feature_column API 进行特征预处理。在 TensorFlow 2 中,您可以直接使用 Keras 预处理层来完成此操作。

本迁移指南演示了使用特征列和预处理层进行的常见特征转换,然后使用这两个 API 训练一个完整的模型。

首先,从一些必要的导入开始

import tensorflow as tf
import tensorflow.compat.v1 as tf1
import math
2024-01-17 02:27:02.309609: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-17 02:27:02.309651: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-17 02:27:02.311142: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered

现在,添加一个用于调用特征列的实用函数以进行演示

def call_feature_columns(feature_columns, inputs):
  # This is a convenient way to call a `feature_column` outside of an estimator
  # to display its output.
  feature_layer = tf1.keras.layers.DenseFeatures(feature_columns)
  return feature_layer(inputs)

输入处理

要将特征列与估计器一起使用,模型输入始终需要是张量字典

input_dict = {
  'foo': tf.constant([1]),
  'bar': tf.constant([0]),
  'baz': tf.constant([-1])
}

每个特征列都需要使用一个键来索引到源数据中。所有特征列的输出将被连接起来,并由估计器模型使用。

columns = [
  tf1.feature_column.numeric_column('foo'),
  tf1.feature_column.numeric_column('bar'),
  tf1.feature_column.numeric_column('baz'),
]
call_feature_columns(columns, input_dict)
WARNING:tensorflow:From /tmpfs/tmp/ipykernel_19805/3124623333.py:2: numeric_column (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
Use Keras preprocessing layers instead, either directly or via the `tf.keras.utils.FeatureSpace` utility. Each of `tf.feature_column.*` has a functional equivalent in `tf.keras.layers` for feature preprocessing when training a Keras model.
<tf.Tensor: shape=(1, 3), dtype=float32, numpy=array([[ 0., -1.,  1.]], dtype=float32)>

在 Keras 中,模型输入更加灵活。一个 tf.keras.Model 可以处理单个张量输入、张量特征列表或张量特征字典。您可以通过在模型创建时传递 tf.keras.Input 字典来处理字典输入。输入不会自动连接,这使得它们可以以更加灵活的方式使用。它们可以使用 tf.keras.layers.Concatenate 连接。

inputs = {
  'foo': tf.keras.Input(shape=()),
  'bar': tf.keras.Input(shape=()),
  'baz': tf.keras.Input(shape=()),
}
# Inputs are typically transformed by preprocessing layers before concatenation.
outputs = tf.keras.layers.Concatenate()(inputs.values())
model = tf.keras.Model(inputs=inputs, outputs=outputs)
model(input_dict)
<tf.Tensor: shape=(3,), dtype=float32, numpy=array([ 1.,  0., -1.], dtype=float32)>

将整数 ID 进行独热编码

一种常见的特征转换是将已知范围的整数输入进行独热编码。以下是一个使用特征列的示例

categorical_col = tf1.feature_column.categorical_column_with_identity(
    'type', num_buckets=3)
indicator_col = tf1.feature_column.indicator_column(categorical_col)
call_feature_columns(indicator_col, {'type': [0, 1, 2]})
WARNING:tensorflow:From /tmpfs/tmp/ipykernel_19805/1369923821.py:1: categorical_column_with_identity (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
Use Keras preprocessing layers instead, either directly or via the `tf.keras.utils.FeatureSpace` utility. Each of `tf.feature_column.*` has a functional equivalent in `tf.keras.layers` for feature preprocessing when training a Keras model.
WARNING:tensorflow:From /tmpfs/tmp/ipykernel_19805/1369923821.py:3: indicator_column (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
Use Keras preprocessing layers instead, either directly or via the `tf.keras.utils.FeatureSpace` utility. Each of `tf.feature_column.*` has a functional equivalent in `tf.keras.layers` for feature preprocessing when training a Keras model.
<tf.Tensor: shape=(3, 3), dtype=float32, numpy=
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]], dtype=float32)>

使用 Keras 预处理层,这些列可以用单个 tf.keras.layers.CategoryEncoding 层替换,并将 output_mode 设置为 'one_hot'

one_hot_layer = tf.keras.layers.CategoryEncoding(
    num_tokens=3, output_mode='one_hot')
one_hot_layer([0, 1, 2])
<tf.Tensor: shape=(3, 3), dtype=float32, numpy=
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]], dtype=float32)>

规范化数值特征

在使用特征列处理连续的浮点特征时,您需要使用 tf.feature_column.numeric_column。如果输入已经过规范化,则将其转换为 Keras 非常简单。您可以直接将 tf.keras.Input 用到您的模型中,如上所示。

numeric_column 也可以用于规范化输入

def normalize(x):
  mean, variance = (2.0, 1.0)
  return (x - mean) / math.sqrt(variance)
numeric_col = tf1.feature_column.numeric_column('col', normalizer_fn=normalize)
call_feature_columns(numeric_col, {'col': tf.constant([[0.], [1.], [2.]])})
<tf.Tensor: shape=(3, 1), dtype=float32, numpy=
array([[-2.],
       [-1.],
       [ 0.]], dtype=float32)>

相反,在 Keras 中,可以使用 tf.keras.layers.Normalization 进行这种规范化。

normalization_layer = tf.keras.layers.Normalization(mean=2.0, variance=1.0)
normalization_layer(tf.constant([[0.], [1.], [2.]]))
<tf.Tensor: shape=(3, 1), dtype=float32, numpy=
array([[-2.],
       [-1.],
       [ 0.]], dtype=float32)>

将数值特征分桶并进行独热编码

对连续的浮点输入进行的另一种常见转换是将其分桶为固定范围的整数。

在特征列中,这可以通过 tf.feature_column.bucketized_column 实现

numeric_col = tf1.feature_column.numeric_column('col')
bucketized_col = tf1.feature_column.bucketized_column(numeric_col, [1, 4, 5])
call_feature_columns(bucketized_col, {'col': tf.constant([1., 2., 3., 4., 5.])})
WARNING:tensorflow:From /tmpfs/tmp/ipykernel_19805/3043215186.py:2: bucketized_column (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
Use Keras preprocessing layers instead, either directly or via the `tf.keras.utils.FeatureSpace` utility. Each of `tf.feature_column.*` has a functional equivalent in `tf.keras.layers` for feature preprocessing when training a Keras model.
<tf.Tensor: shape=(5, 4), dtype=float32, numpy=
array([[0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]], dtype=float32)>

在 Keras 中,这可以用 tf.keras.layers.Discretization 替换

discretization_layer = tf.keras.layers.Discretization(bin_boundaries=[1, 4, 5])
one_hot_layer = tf.keras.layers.CategoryEncoding(
    num_tokens=4, output_mode='one_hot')
one_hot_layer(discretization_layer([1., 2., 3., 4., 5.]))
<tf.Tensor: shape=(5, 4), dtype=float32, numpy=
array([[0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]], dtype=float32)>

使用词汇表对字符串数据进行独热编码

处理字符串特征通常需要进行词汇表查找,以将字符串转换为索引。以下是一个使用特征列查找字符串,然后对索引进行独热编码的示例

vocab_col = tf1.feature_column.categorical_column_with_vocabulary_list(
    'sizes',
    vocabulary_list=['small', 'medium', 'large'],
    num_oov_buckets=0)
indicator_col = tf1.feature_column.indicator_column(vocab_col)
call_feature_columns(indicator_col, {'sizes': ['small', 'medium', 'large']})
WARNING:tensorflow:From /tmpfs/tmp/ipykernel_19805/2845961037.py:1: categorical_column_with_vocabulary_list (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
Use Keras preprocessing layers instead, either directly or via the `tf.keras.utils.FeatureSpace` utility. Each of `tf.feature_column.*` has a functional equivalent in `tf.keras.layers` for feature preprocessing when training a Keras model.
<tf.Tensor: shape=(3, 3), dtype=float32, numpy=
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]], dtype=float32)>

使用 Keras 预处理层,使用 tf.keras.layers.StringLookup 层,并将 output_mode 设置为 'one_hot'

string_lookup_layer = tf.keras.layers.StringLookup(
    vocabulary=['small', 'medium', 'large'],
    num_oov_indices=0,
    output_mode='one_hot')
string_lookup_layer(['small', 'medium', 'large'])
<tf.Tensor: shape=(3, 3), dtype=float32, numpy=
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]], dtype=float32)>

使用词汇表嵌入字符串数据

对于较大的词汇表,通常需要嵌入才能获得良好的性能。以下是如何使用特征列嵌入字符串特征的示例。

vocab_col = tf1.feature_column.categorical_column_with_vocabulary_list(
    'col',
    vocabulary_list=['small', 'medium', 'large'],
    num_oov_buckets=0)
embedding_col = tf1.feature_column.embedding_column(vocab_col, 4)
call_feature_columns(embedding_col, {'col': ['small', 'medium', 'large']})
WARNING:tensorflow:From /tmpfs/tmp/ipykernel_19805/999372599.py:5: embedding_column (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
Use Keras preprocessing layers instead, either directly or via the `tf.keras.utils.FeatureSpace` utility. Each of `tf.feature_column.*` has a functional equivalent in `tf.keras.layers` for feature preprocessing when training a Keras model.
<tf.Tensor: shape=(3, 4), dtype=float32, numpy=
array([[ 0.19490445, -0.41044798, -0.5603343 ,  0.32616043],
       [-0.03856034,  0.45223498,  0.15508214,  0.02957107],
       [-0.20157112,  0.27702358, -0.079571  , -0.6845111 ]],
      dtype=float32)>

使用 Keras 预处理层,可以通过组合 tf.keras.layers.StringLookup 层和 tf.keras.layers.Embedding 层来实现。 StringLookup 的默认输出将是整数索引,可以直接馈送到嵌入中。

string_lookup_layer = tf.keras.layers.StringLookup(
    vocabulary=['small', 'medium', 'large'], num_oov_indices=0)
embedding = tf.keras.layers.Embedding(3, 4)
embedding(string_lookup_layer(['small', 'medium', 'large']))
<tf.Tensor: shape=(3, 4), dtype=float32, numpy=
array([[-0.01309206, -0.01742087,  0.04137394,  0.03209827],
       [ 0.009628  ,  0.03649005,  0.0101964 , -0.02682712],
       [ 0.02947212,  0.04911048,  0.00290644, -0.00840418]],
      dtype=float32)>

对加权分类数据求和

在某些情况下,您需要处理分类数据,其中每个类别的出现都带有关联的权重。在特征列中,这是通过 tf.feature_column.weighted_categorical_column 处理的。当与 indicator_column 配合使用时,这将对每个类别求和权重。

ids = tf.constant([[5, 11, 5, 17, 17]])
weights = tf.constant([[0.5, 1.5, 0.7, 1.8, 0.2]])

categorical_col = tf1.feature_column.categorical_column_with_identity(
    'ids', num_buckets=20)
weighted_categorical_col = tf1.feature_column.weighted_categorical_column(
    categorical_col, 'weights')
indicator_col = tf1.feature_column.indicator_column(weighted_categorical_col)
call_feature_columns(indicator_col, {'ids': ids, 'weights': weights})
WARNING:tensorflow:From /tmpfs/tmp/ipykernel_19805/3529191023.py:6: weighted_categorical_column (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
Use Keras preprocessing layers instead, either directly or via the `tf.keras.utils.FeatureSpace` utility. Each of `tf.feature_column.*` has a functional equivalent in `tf.keras.layers` for feature preprocessing when training a Keras model.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/feature_column/feature_column_v2.py:4033: sparse_merge (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.
Instructions for updating:
No similar op available at this time.
<tf.Tensor: shape=(1, 20), dtype=float32, numpy=
array([[0. , 0. , 0. , 0. , 0. , 1.2, 0. , 0. , 0. , 0. , 0. , 1.5, 0. ,

        0. , 0. , 0. , 0. , 2. , 0. , 0. ]], dtype=float32)>

在 Keras 中,可以通过将 count_weights 输入传递给 tf.keras.layers.CategoryEncoding 以及 output_mode='count' 来实现。

ids = tf.constant([[5, 11, 5, 17, 17]])
weights = tf.constant([[0.5, 1.5, 0.7, 1.8, 0.2]])

# Using sparse output is more efficient when `num_tokens` is large.
count_layer = tf.keras.layers.CategoryEncoding(
    num_tokens=20, output_mode='count', sparse=True)
tf.sparse.to_dense(count_layer(ids, count_weights=weights))
<tf.Tensor: shape=(1, 20), dtype=float32, numpy=
array([[0. , 0. , 0. , 0. , 0. , 1.2, 0. , 0. , 0. , 0. , 0. , 1.5, 0. ,

        0. , 0. , 0. , 0. , 2. , 0. , 0. ]], dtype=float32)>

嵌入加权分类数据

您可能想要嵌入加权分类输入。在特征列中, embedding_column 包含一个 combiner 参数。如果任何样本包含一个类别的多个条目,它们将根据参数设置(默认情况下为 'mean')进行组合。

ids = tf.constant([[5, 11, 5, 17, 17]])
weights = tf.constant([[0.5, 1.5, 0.7, 1.8, 0.2]])

categorical_col = tf1.feature_column.categorical_column_with_identity(
    'ids', num_buckets=20)
weighted_categorical_col = tf1.feature_column.weighted_categorical_column(
    categorical_col, 'weights')
embedding_col = tf1.feature_column.embedding_column(
    weighted_categorical_col, 4, combiner='mean')
call_feature_columns(embedding_col, {'ids': ids, 'weights': weights})
<tf.Tensor: shape=(1, 4), dtype=float32, numpy=
array([[ 0.3714184 , -0.01248874, -0.3219157 , -0.39734876]],
      dtype=float32)>

在 Keras 中, tf.keras.layers.Embedding 没有 combiner 选项,但您可以使用 tf.keras.layers.Dense 实现相同的效果。上面的 embedding_column 只是根据类别权重线性组合嵌入向量。虽然乍一看并不明显,但它与将您的分类输入表示为大小为 (num_tokens) 的稀疏权重向量,并将它们乘以形状为 (embedding_size, num_tokens)Dense 内核完全等效。

ids = tf.constant([[5, 11, 5, 17, 17]])
weights = tf.constant([[0.5, 1.5, 0.7, 1.8, 0.2]])

# For `combiner='mean'`, normalize your weights to sum to 1. Removing this line
# would be equivalent to an `embedding_column` with `combiner='sum'`.
weights = weights / tf.reduce_sum(weights, axis=-1, keepdims=True)

count_layer = tf.keras.layers.CategoryEncoding(
    num_tokens=20, output_mode='count', sparse=True)
embedding_layer = tf.keras.layers.Dense(4, use_bias=False)
embedding_layer(count_layer(ids, count_weights=weights))
<tf.Tensor: shape=(1, 4), dtype=float32, numpy=
array([[-0.24909994, -0.02287644,  0.06182466, -0.11370523]],
      dtype=float32)>

完整的训练示例

为了展示完整的训练工作流程,首先准备一些具有三种不同类型特征的数据

features = {
    'type': [0, 1, 1],
    'size': ['small', 'small', 'medium'],
    'weight': [2.7, 1.8, 1.6],
}
labels = [1, 1, 0]
predict_features = {'type': [0], 'size': ['foo'], 'weight': [-0.7]}

为 TensorFlow 1 和 TensorFlow 2 工作流程定义一些通用常量

vocab = ['small', 'medium', 'large']
one_hot_dims = 3
embedding_dims = 4
weight_mean = 2.0
weight_variance = 1.0

使用特征列

特征列必须在创建时作为列表传递给估计器,并且将在训练期间隐式调用。

categorical_col = tf1.feature_column.categorical_column_with_identity(
    'type', num_buckets=one_hot_dims)
# Convert index to one-hot; e.g., [2] -> [0,0,1].
indicator_col = tf1.feature_column.indicator_column(categorical_col)

# Convert strings to indices; e.g., ['small'] -> [1].
vocab_col = tf1.feature_column.categorical_column_with_vocabulary_list(
    'size', vocabulary_list=vocab, num_oov_buckets=1)
# Embed the indices.
embedding_col = tf1.feature_column.embedding_column(vocab_col, embedding_dims)

normalizer_fn = lambda x: (x - weight_mean) / math.sqrt(weight_variance)
# Normalize the numeric inputs; e.g., [2.0] -> [0.0].
numeric_col = tf1.feature_column.numeric_column(
    'weight', normalizer_fn=normalizer_fn)

estimator = tf1.estimator.DNNClassifier(
    feature_columns=[indicator_col, embedding_col, numeric_col],
    hidden_units=[1])

def _input_fn():
  return tf1.data.Dataset.from_tensor_slices((features, labels)).batch(1)

estimator.train(_input_fn)
WARNING:tensorflow:From /tmpfs/tmp/ipykernel_19805/1997355744.py:17: DNNClassifier.__init__ (from tensorflow_estimator.python.estimator.canned.dnn) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.keras instead.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow_estimator/python/estimator/canned/dnn.py:807: Estimator.__init__ (from tensorflow_estimator.python.estimator.estimator) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.keras instead.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow_estimator/python/estimator/estimator.py:1844: RunConfig.__init__ (from tensorflow_estimator.python.estimator.run_config) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.keras instead.
INFO:tensorflow:Using default config.
WARNING:tensorflow:Using temporary folder as model directory: /tmpfs/tmp/tmpm02fidwe
INFO:tensorflow:Using config: {'_model_dir': '/tmpfs/tmp/tmpm02fidwe', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_checkpoint_save_graph_def': True, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Calling model_fn.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow_estimator/python/estimator/canned/dnn.py:446: dnn_logit_fn_builder (from tensorflow_estimator.python.estimator.canned.dnn) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.keras instead.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/training/adagrad.py:138: calling Constant.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow_estimator/python/estimator/model_fn.py:250: EstimatorSpec.__new__ (from tensorflow_estimator.python.estimator.model_fn) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.keras instead.
INFO:tensorflow:Done calling model_fn.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow_estimator/python/estimator/estimator.py:1416: NanTensorHook.__init__ (from tensorflow.python.training.basic_session_run_hooks) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.keras instead.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow_estimator/python/estimator/estimator.py:1419: LoggingTensorHook.__init__ (from tensorflow.python.training.basic_session_run_hooks) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.keras instead.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/training/basic_session_run_hooks.py:232: SecondOrStepTimer.__init__ (from tensorflow.python.training.basic_session_run_hooks) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.keras instead.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow_estimator/python/estimator/estimator.py:1456: CheckpointSaverHook.__init__ (from tensorflow.python.training.basic_session_run_hooks) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.keras instead.
INFO:tensorflow:Create CheckpointSaverHook.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/training/monitored_session.py:579: StepCounterHook.__init__ (from tensorflow.python.training.basic_session_run_hooks) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.keras instead.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/training/monitored_session.py:586: SummarySaverHook.__init__ (from tensorflow.python.training.basic_session_run_hooks) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.keras instead.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2024-01-17 02:27:09.322220: W tensorflow/core/common_runtime/type_inference.cc:339] Type inference failed. This indicates an invalid graph that escaped type checking. Error message: INVALID_ARGUMENT: expected compatible input types, but input 1:
type_id: TFT_OPTIONAL
args {
  type_id: TFT_PRODUCT
  args {
    type_id: TFT_TENSOR
    args {
      type_id: TFT_INT64
    }
  }
}
 is neither a subtype nor a supertype of the combined inputs preceding it:
type_id: TFT_OPTIONAL
args {
  type_id: TFT_PRODUCT
  args {
    type_id: TFT_TENSOR
    args {
      type_id: TFT_INT32
    }
  }
}

    for Tuple type infernce function 0
    while inferring type of node 'dnn/zero_fraction/cond/output/_18'
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 0...
INFO:tensorflow:Saving checkpoints for 0 into /tmpfs/tmp/tmpm02fidwe/model.ckpt.
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 0...
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/training/monitored_session.py:1455: SessionRunArgs.__new__ (from tensorflow.python.training.session_run_hook) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.keras instead.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/training/monitored_session.py:1454: SessionRunContext.__init__ (from tensorflow.python.training.session_run_hook) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.keras instead.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/training/monitored_session.py:1474: SessionRunValues.__new__ (from tensorflow.python.training.session_run_hook) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.keras instead.
INFO:tensorflow:loss = 0.6289518, step = 0
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 3...
INFO:tensorflow:Saving checkpoints for 3 into /tmpfs/tmp/tmpm02fidwe/model.ckpt.
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 3...
INFO:tensorflow:Loss for final step: 0.81661654.
<tensorflow_estimator.python.estimator.canned.dnn.DNNClassifier at 0x7f4edc48e8b0>

特征列还将用于在模型上运行推理时转换输入数据。

def _predict_fn():
  return tf1.data.Dataset.from_tensor_slices(predict_features).batch(1)

next(estimator.predict(_predict_fn))
INFO:tensorflow:Calling model_fn.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow_estimator/python/estimator/canned/head.py:596: ClassificationOutput.__init__ (from tensorflow.python.saved_model.model_utils.export_output) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.keras instead.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow_estimator/python/estimator/canned/head.py:1307: RegressionOutput.__init__ (from tensorflow.python.saved_model.model_utils.export_output) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.keras instead.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow_estimator/python/estimator/canned/head.py:1309: PredictOutput.__init__ (from tensorflow.python.saved_model.model_utils.export_output) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.keras instead.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmpfs/tmp/tmpm02fidwe/model.ckpt-3
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
{'logits': array([0.57964283], dtype=float32),
 'logistic': array([0.6409852], dtype=float32),
 'probabilities': array([0.35901475, 0.6409852 ], dtype=float32),
 'class_ids': array([1]),
 'classes': array([b'1'], dtype=object),
 'all_class_ids': array([0, 1], dtype=int32),
 'all_classes': array([b'0', b'1'], dtype=object)}

使用 Keras 预处理层

Keras 预处理层在调用位置方面更加灵活。可以将层直接应用于张量,在 tf.data 输入管道内使用,或直接构建到可训练的 Keras 模型中。

在本示例中,您将在 tf.data 输入管道内应用预处理层。为此,您可以定义一个单独的 tf.keras.Model 来预处理您的输入特征。此模型不可训练,但它是一种将预处理层分组的便捷方法。

inputs = {
  'type': tf.keras.Input(shape=(), dtype='int64'),
  'size': tf.keras.Input(shape=(), dtype='string'),
  'weight': tf.keras.Input(shape=(), dtype='float32'),
}
# Convert index to one-hot; e.g., [2] -> [0,0,1].
type_output = tf.keras.layers.CategoryEncoding(
      one_hot_dims, output_mode='one_hot')(inputs['type'])
# Convert size strings to indices; e.g., ['small'] -> [1].
size_output = tf.keras.layers.StringLookup(vocabulary=vocab)(inputs['size'])
# Normalize the numeric inputs; e.g., [2.0] -> [0.0].
weight_output = tf.keras.layers.Normalization(
      axis=None, mean=weight_mean, variance=weight_variance)(inputs['weight'])
outputs = {
  'type': type_output,
  'size': size_output,
  'weight': weight_output,
}
preprocessing_model = tf.keras.Model(inputs, outputs)

您现在可以在对 tf.data.Dataset.map 的调用中应用此模型。请注意,传递给 map 的函数将自动转换为 tf.function,并且编写 tf.function 代码的常见注意事项适用(没有副作用)。

# Apply the preprocessing in tf.data.Dataset.map.
dataset = tf.data.Dataset.from_tensor_slices((features, labels)).batch(1)
dataset = dataset.map(lambda x, y: (preprocessing_model(x), y),
                      num_parallel_calls=tf.data.AUTOTUNE)
# Display a preprocessed input sample.
next(dataset.take(1).as_numpy_iterator())
({'type': array([[1., 0., 0.]], dtype=float32),
  'size': array([1]),
  'weight': array([0.70000005], dtype=float32)},
 array([1], dtype=int32))

接下来,您可以定义一个单独的 Model,其中包含可训练层。请注意,此模型的输入现在反映了预处理后的特征类型和形状。

inputs = {
  'type': tf.keras.Input(shape=(one_hot_dims,), dtype='float32'),
  'size': tf.keras.Input(shape=(), dtype='int64'),
  'weight': tf.keras.Input(shape=(), dtype='float32'),
}
# Since the embedding is trainable, it needs to be part of the training model.
embedding = tf.keras.layers.Embedding(len(vocab), embedding_dims)
outputs = tf.keras.layers.Concatenate()([
  inputs['type'],
  embedding(inputs['size']),
  tf.expand_dims(inputs['weight'], -1),
])
outputs = tf.keras.layers.Dense(1)(outputs)
training_model = tf.keras.Model(inputs, outputs)

您现在可以使用 tf.keras.Model.fit 训练 training_model

# Train on the preprocessed data.
training_model.compile(
    loss=tf.keras.losses.BinaryCrossentropy(from_logits=True))
training_model.fit(dataset)
3/3 [==============================] - 1s 5ms/step - loss: 0.8194
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1705458433.603835   19973 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
<keras.src.callbacks.History at 0x7f4e700e7a30>

最后,在推理时,将这些独立阶段组合成一个处理原始特征输入的单个模型可能很有用。

inputs = preprocessing_model.input
outputs = training_model(preprocessing_model(inputs))
inference_model = tf.keras.Model(inputs, outputs)

predict_dataset = tf.data.Dataset.from_tensor_slices(predict_features).batch(1)
inference_model.predict(predict_dataset)
1/1 [==============================] - 0s 102ms/step
array([[-0.95717776]], dtype=float32)

此组合模型可以保存为 .keras 文件,以便以后使用。

inference_model.save('model.keras')
restored_model = tf.keras.models.load_model('model.keras')
restored_model.predict(predict_dataset)
1/1 [==============================] - 0s 79ms/step
array([[-0.95717776]], dtype=float32)

特征列等效表

为了参考,以下是特征列和 Keras 预处理层之间的近似对应关系

特征列 Keras 层
tf.feature_column.bucketized_column tf.keras.layers.Discretization
tf.feature_column.categorical_column_with_hash_bucket tf.keras.layers.Hashing
tf.feature_column.categorical_column_with_identity tf.keras.layers.CategoryEncoding
tf.feature_column.categorical_column_with_vocabulary_file tf.keras.layers.StringLookuptf.keras.layers.IntegerLookup
tf.feature_column.categorical_column_with_vocabulary_list tf.keras.layers.StringLookuptf.keras.layers.IntegerLookup
tf.feature_column.crossed_column tf.keras.layers.experimental.preprocessing.HashedCrossing
tf.feature_column.embedding_column tf.keras.layers.Embedding
tf.feature_column.indicator_column output_mode='one_hot'output_mode='multi_hot'*
tf.feature_column.numeric_column tf.keras.layers.Normalization
tf.feature_column.sequence_categorical_column_with_hash_bucket tf.keras.layers.Hashing
tf.feature_column.sequence_categorical_column_with_identity tf.keras.layers.CategoryEncoding
tf.feature_column.sequence_categorical_column_with_vocabulary_file tf.keras.layers.StringLookuptf.keras.layers.IntegerLookuptf.keras.layer.TextVectorization
tf.feature_column.sequence_categorical_column_with_vocabulary_list tf.keras.layers.StringLookuptf.keras.layers.IntegerLookuptf.keras.layer.TextVectorization
tf.feature_column.sequence_numeric_column tf.keras.layers.Normalization
tf.feature_column.weighted_categorical_column tf.keras.layers.CategoryEncoding

* 可以将 output_mode 传递给 tf.keras.layers.CategoryEncodingtf.keras.layers.StringLookuptf.keras.layers.IntegerLookuptf.keras.layers.TextVectorization

tf.keras.layers.TextVectorization 可以直接处理自由格式文本输入(例如,整个句子或段落)。这不是 TensorFlow 1 中分类序列处理的一对一替换,但可以为临时文本预处理提供便捷的替换。

后续步骤