新！使用 Simple ML for Sheets 将机器学习应用于 Google Sheets 中的数据阅读更多

生成预测

在 TensorFlow.org 上查看

在 Google Colab 中运行

在 GitHub 上查看

下载笔记本

欢迎使用 TensorFlow 决策森林 (TF-DF) 的 预测 Colab。在这个 Colab 中，您将学习使用 Python API 使用之前训练过的 TF-DF 模型生成预测的不同方法。

备注：此 Colab 中显示的 Python API 易于使用，非常适合实验。但是，其他 API（例如 TensorFlow Serving 和 C++ API）更适合生产系统，因为它们更快且更稳定。所有 Serving API 的完整列表可在此获得。

在这个 Colab 中，您将

在使用 pd_dataframe_to_tf_dataset 创建的 TensorFlow 数据集上使用 model.predict() 函数。
在手动创建的 TensorFlow 数据集上使用 model.predict() 函数。
在 Numpy 数组上使用 model.predict() 函数。
使用 CLI API 生成预测。
使用 CLI API 对模型的推理速度进行基准测试。

重要备注

用于预测的数据集应具有与用于训练的数据集 相同的特征名称和类型。否则，可能会引发错误。

例如，使用两个特征 f1 和 f2 训练模型，并尝试在没有 f2 的数据集上生成预测将失败。请注意，将（某些或所有）特征值设置为“缺失”是可以的。同样，在 f2 是数值特征（例如，float32）的情况下训练模型，并将此模型应用于 f2 是文本（例如，字符串）特征的数据集将失败。

虽然 Keras API 进行了抽象，但在 Python 中实例化的模型（例如，使用 tfdf.keras.RandomForestModel()）和从磁盘加载的模型（例如，使用 tf_keras.models.load_model()）的行为可能不同。值得注意的是，Python 实例化的模型会自动应用必要的类型转换。例如，如果将 float64 特征馈送到期望 float32 特征的模型，则会隐式执行此转换。但是，对于从磁盘加载的模型，这种转换是不可能的。因此，训练数据和推理数据始终具有完全相同的类型非常重要。

设置

首先，我们安装 TensorFlow 决策森林...

pip install tensorflow_decision_forests

...，并导入此示例中使用的库。

import tensorflow_decision_forests as tfdf

import os
import numpy as np
import pandas as pd
import tensorflow as tf
import math

`model.predict(...)` 和 `pd_dataframe_to_tf_dataset` 函数

TensorFlow 决策森林实现了 Keras 模型 API。因此，TF-DF 模型具有 predict 函数来生成预测。此函数以 TensorFlow 数据集作为输入，并输出预测数组。创建 TensorFlow 数据集的最简单方法是使用 Pandas 和 tfdf.keras.pd_dataframe_to_tf_dataset(...) 函数。

以下示例展示了如何使用 pd_dataframe_to_tf_dataset 创建 TensorFlow 数据集。

pd_dataset = pd.DataFrame({
    "feature_1": [1,2,3],
    "feature_2": ["a", "b", "c"],
    "label": [0, 1, 0],
})

pd_dataset

tf_dataset = tfdf.keras.pd_dataframe_to_tf_dataset(pd_dataset, label="label")

for features, label in tf_dataset:
  print("Features:",features)
  print("label:", label)

Features: {'feature_1': <tf.Tensor: shape=(3,), dtype=int64, numpy=array([1, 2, 3])>, 'feature_2': <tf.Tensor: shape=(3,), dtype=string, numpy=array([b'a', b'b', b'c'], dtype=object)>}
label: tf.Tensor([0 1 0], shape=(3,), dtype=int64)
2024-04-20 11:14:51.301980: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence

注意：“pd”代表“pandas”。“tf”代表“TensorFlow”。

TensorFlow 数据集是一个输出一系列值的函数。这些值可以是简单的数组（称为张量）或组织成结构的数组（例如，组织在字典中的数组）。

以下示例展示了在玩具数据集上进行训练和推理（使用 predict）

# Creating a training dataset in Pandas
pd_train_dataset = pd.DataFrame({
    "feature_1": np.random.rand(1000),
    "feature_2": np.random.rand(1000),
})
pd_train_dataset["label"] = pd_train_dataset["feature_1"] > pd_train_dataset["feature_2"] 

pd_train_dataset

# Creating a serving dataset with Pandas
pd_serving_dataset = pd.DataFrame({
    "feature_1": np.random.rand(500),
    "feature_2": np.random.rand(500),
})

pd_serving_dataset

让我们将 Pandas 数据帧转换为 TensorFlow 数据集

tf_train_dataset = tfdf.keras.pd_dataframe_to_tf_dataset(pd_train_dataset, label="label")
tf_serving_dataset = tfdf.keras.pd_dataframe_to_tf_dataset(pd_serving_dataset)

我们现在可以在 tf_train_dataset 上训练模型

model = tfdf.keras.RandomForestModel(verbose=0)
model.fit(tf_train_dataset)

[INFO 24-04-20 11:14:55.1176 UTC kernel.cc:1233] Loading model from path /tmpfs/tmp/tmpdosbv775/model/ with prefix 951a85e27c8d4048
[INFO 24-04-20 11:14:55.1550 UTC decision_forest.cc:734] Model loaded with 300 root(s), 12674 node(s), and 2 input feature(s).
[INFO 24-04-20 11:14:55.1551 UTC abstract_model.cc:1344] Engine "RandomForestOptPred" built
[INFO 24-04-20 11:14:55.1551 UTC kernel.cc:1061] Use fast generic engine
<tf_keras.src.callbacks.History at 0x7f96c017a7f0>

然后在 tf_serving_dataset 上生成预测

# Print the first 10 predictions.
model.predict(tf_serving_dataset, verbose=0)[:10]

array([[0.57999957],
       [0.13666661],
       [0.68666613],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.00333333]], dtype=float32)

`model.predict(...)` 和手动 TF 数据集

在上一节中，我们展示了如何使用 pd_dataframe_to_tf_dataset 函数创建 TF 数据集。此选项很简单，但不适合大型数据集。相反，TensorFlow 提供了多种选项来创建 TensorFlow 数据集。以下示例展示了如何使用 tf.data.Dataset.from_tensor_slices() 函数创建数据集。

dataset = tf.data.Dataset.from_tensor_slices([1,2,3,4,5])

for value in dataset:
  print("value:", value.numpy())

value: 1
value: 2
value: 3
value: 4
value: 5
2024-04-20 11:14:59.117255: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence

TensorFlow 模型使用小批量训练：示例不是一次一个地馈送，而是分组到“批次”中。对于神经网络，批次大小会影响模型的质量，并且用户需要在训练期间确定最佳值。对于决策森林，批次大小不会影响模型。但是，出于兼容性原因，TensorFlow 决策森林期望数据集是批处理的。批处理使用 batch() 函数完成。

dataset = tf.data.Dataset.from_tensor_slices([1,2,3,4,5]).batch(2)

for value in dataset:
  print("value:", value.numpy())

value: [1 2]
value: [3 4]
value: [5]
2024-04-20 11:14:59.134734: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence

TensorFlow 决策森林期望数据集具有以下两种结构之一

特征，标签
特征，标签，权重

特征可以是单个二维数组（其中每列都是一个特征，每行都是一个示例），也可以是数组字典。

以下是与 TensorFlow 决策森林兼容的数据集示例

# A dataset with a single 2d array.
tf_dataset = tf.data.Dataset.from_tensor_slices(
    ([[1,2],[3,4],[5,6]], # Features
    [0,1,0], # Label
    )).batch(2)

for features, label in tf_dataset:
  print("features:", features)
  print("label:", label)

features: tf.Tensor(
[[1 2]
 [3 4]], shape=(2, 2), dtype=int32)
label: tf.Tensor([0 1], shape=(2,), dtype=int32)
features: tf.Tensor([[5 6]], shape=(1, 2), dtype=int32)
label: tf.Tensor([0], shape=(1,), dtype=int32)
2024-04-20 11:14:59.152655: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence

# A dataset with a dictionary of features.
tf_dataset = tf.data.Dataset.from_tensor_slices(
    ({
    "feature_1": [1,2,3],
    "feature_2": [4,5,6],
    },
    [0,1,0], # Label
    )).batch(2)

for features, label in tf_dataset:
  print("features:", features)
  print("label:", label)

features: {'feature_1': <tf.Tensor: shape=(2,), dtype=int32, numpy=array([1, 2], dtype=int32)>, 'feature_2': <tf.Tensor: shape=(2,), dtype=int32, numpy=array([4, 5], dtype=int32)>}
label: tf.Tensor([0 1], shape=(2,), dtype=int32)
features: {'feature_1': <tf.Tensor: shape=(1,), dtype=int32, numpy=array([3], dtype=int32)>, 'feature_2': <tf.Tensor: shape=(1,), dtype=int32, numpy=array([6], dtype=int32)>}
label: tf.Tensor([0], shape=(1,), dtype=int32)
2024-04-20 11:14:59.171912: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence

让我们使用第二个选项训练模型。

tf_dataset = tf.data.Dataset.from_tensor_slices(
    ({
    "feature_1": np.random.rand(100),
    "feature_2": np.random.rand(100),
    },
    np.random.rand(100) >= 0.5, # Label
    )).batch(2)

model = tfdf.keras.RandomForestModel(verbose=0)
model.fit(tf_dataset)

[INFO 24-04-20 11:14:59.3750 UTC kernel.cc:1233] Loading model from path /tmpfs/tmp/tmp208me4tj/model/ with prefix b7fe9aaae54944c5
[INFO 24-04-20 11:14:59.3979 UTC decision_forest.cc:734] Model loaded with 300 root(s), 7574 node(s), and 2 input feature(s).
[INFO 24-04-20 11:14:59.3979 UTC kernel.cc:1061] Use fast generic engine
<tf_keras.src.callbacks.History at 0x7f968c13e8e0>

可以使用 predict 函数直接在训练数据集上进行预测

# The first 10 predictions.
model.predict(tf_dataset, verbose=0)[:10]

array([[0.9366659 ],
       [0.42999968],
       [0.9266659 ],
       [0.31999978],
       [0.70999944],
       [0.2133332 ],
       [0.13333328],
       [0.836666  ],
       [0.10666663],
       [0.53333294]], dtype=float32)

`model.predict(...)` 和 `model.predict_on_batch()` 可以用于字典。

在某些情况下，predict 函数可以使用数组（或数组字典）而不是 TensorFlow 数据集。

以下示例使用先前训练的模型和 NumPy 数组字典。

# The first 10 predictions.
model.predict({
    "feature_1": np.random.rand(100),
    "feature_2": np.random.rand(100),
    }, verbose=0)[:10]

array([[0.5366663 ],
       [0.19666655],
       [0.2233332 ],
       [0.99999917],
       [0.3233331 ],
       [0.3866664 ],
       [0.71999943],
       [0.40666637],
       [0.73333275],
       [0.10999996]], dtype=float32)

在前面的示例中，数组会自动进行批处理。或者，可以使用 predict_on_batch 函数来确保所有示例都在同一个批次中运行。

# The first 10 predictions.
model.predict_on_batch({
    "feature_1": np.random.rand(100),
    "feature_2": np.random.rand(100),
    })[:10]

array([[0.3433331 ],
       [0.42333302],
       [0.9466659 ],
       [0.38333306],
       [0.21666653],
       [0.10999996],
       [0.09333331],
       [0.23999985],
       [0.13999994],
       [0.36999974]], dtype=float32)

使用 YDF 格式进行推理

此示例展示了如何运行使用 CLI API 训练的 TF-DF 模型（其他服务 API 之一）。我们还将使用基准测试工具来衡量模型的推理速度。

让我们先训练并保存模型

model = tfdf.keras.GradientBoostedTreesModel(verbose=0)
model.fit(tfdf.keras.pd_dataframe_to_tf_dataset(pd_train_dataset, label="label"))
model.save("my_model")

[WARNING 24-04-20 11:15:00.0298 UTC gradient_boosted_trees.cc:1840] "goss_alpha" set but "sampling_method" not equal to "GOSS".
[WARNING 24-04-20 11:15:00.0299 UTC gradient_boosted_trees.cc:1851] "goss_beta" set but "sampling_method" not equal to "GOSS".
[WARNING 24-04-20 11:15:00.0299 UTC gradient_boosted_trees.cc:1865] "selective_gradient_boosting_ratio" set but "sampling_method" not equal to "SELGB".
[INFO 24-04-20 11:15:00.4645 UTC kernel.cc:1233] Loading model from path /tmpfs/tmp/tmp_gpxt9u3/model/ with prefix 307d0dfd7bcd4058
[INFO 24-04-20 11:15:00.4725 UTC quick_scorer_extended.cc:911] The binary was compiled without AVX2 support, but your CPU supports it. Enable it for faster model inference.
[INFO 24-04-20 11:15:00.4729 UTC kernel.cc:1061] Use fast generic engine
INFO:tensorflow:Assets written to: my_model/assets
INFO:tensorflow:Assets written to: my_model/assets

让我们也将数据集导出到 csv 文件

pd_serving_dataset.to_csv("dataset.csv")

让我们下载并解压缩 Yggdrasil Decision Forests CLI 工具。

wget https://github.com/google/yggdrasil-decision-forests/releases/download/1.0.0/cli_linux.zip
unzip cli_linux.zip

--2024-04-20 11:15:01--  https://github.com/google/yggdrasil-decision-forests/releases/download/1.0.0/cli_linux.zip
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/360444739/bfcd0b9d-5cbc-42a8-be0a-02131875f9a6?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20240420%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240420T111501Z&X-Amz-Expires=300&X-Amz-Signature=01381b3c5a69d831a4be54e2fef635b848ca9b5aaeeac6822698c6acf5f93240&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=360444739&response-content-disposition=attachment%3B%20filename%3Dcli_linux.zip&response-content-type=application%2Foctet-stream [following]
--2024-04-20 11:15:01--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/360444739/bfcd0b9d-5cbc-42a8-be0a-02131875f9a6?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20240420%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240420T111501Z&X-Amz-Expires=300&X-Amz-Signature=01381b3c5a69d831a4be54e2fef635b848ca9b5aaeeac6822698c6acf5f93240&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=360444739&response-content-disposition=attachment%3B%20filename%3Dcli_linux.zip&response-content-type=application%2Foctet-stream
Resolving objects.githubusercontent.com (objects.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to objects.githubusercontent.com (objects.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 31516027 (30M) [application/octet-stream]
Saving to: ‘cli_linux.zip’

cli_linux.zip       100%[===================>]  30.06M   174MB/s    in 0.2s    

2024-04-20 11:15:01 (174 MB/s) - ‘cli_linux.zip’ saved [31516027/31516027]

Archive:  cli_linux.zip
  inflating: README                  
  inflating: cli.txt                 
  inflating: train                   
  inflating: show_model              
  inflating: show_dataspec           
  inflating: predict                 
  inflating: infer_dataspec          
  inflating: evaluate                
  inflating: convert_dataset         
  inflating: benchmark_inference     
  inflating: edit_model              
  inflating: synthetic_dataset       
  inflating: grpc_worker_main        
  inflating: LICENSE                 
  inflating: CHANGELOG.md

最后，让我们进行预测

备注

TensorFlow Decision Forests (TF-DF) 基于 Yggdrasil Decision Forests (YDF) 库，TF-DF 模型内部始终包含 YDF 模型。将 TF-DF 模型保存到磁盘时，TF-DF 模型目录包含一个名为 assets 的子目录，其中包含 YDF 模型。此 YDF 模型可用于所有 YDF 工具。在下一个示例中，我们将使用 predict 和 benchmark_inference 工具。有关更多详细信息，请参阅模型格式文档。
YDF 工具假设数据集类型使用前缀指定，例如 csv:。有关更多详细信息，请参阅 YDF 用户手册。

./predict --model=my_model/assets --dataset=csv:dataset.csv --output=csv:predictions.csv

[INFO abstract_model.cc:1296] Engine "GradientBoostedTreesQuickScorerExtended" built
[INFO predict.cc:133] Run predictions with semi-fast engine

现在我们可以查看预测结果

pd.read_csv("predictions.csv")

可以使用基准测试推理工具来衡量模型的推理速度。

# Create the empty label column.
pd_serving_dataset["__LABEL"] = 0
pd_serving_dataset.to_csv("dataset.csv")

!./benchmark_inference \
  --model=my_model/assets \
  --dataset=csv:dataset.csv \
  --batch_size=100 \
  --warmup_runs=10 \
  --num_runs=50

[INFO benchmark_inference.cc:245] Loading model
[INFO benchmark_inference.cc:248] The model is of type: GRADIENT_BOOSTED_TREES
[INFO benchmark_inference.cc:250] Loading dataset
[INFO benchmark_inference.cc:259] Found 3 compatible fast engines.
[INFO benchmark_inference.cc:262] Running GradientBoostedTreesGeneric
[INFO decision_forest.cc:639] Model loaded with 49 root(s), 2661 node(s), and 2 input feature(s).
[INFO benchmark_inference.cc:262] Running GradientBoostedTreesQuickScorerExtended
[INFO benchmark_inference.cc:262] Running GradientBoostedTreesOptPred
[INFO decision_forest.cc:639] Model loaded with 49 root(s), 2661 node(s), and 2 input feature(s).
[INFO benchmark_inference.cc:268] Running the slow generic engine
batch_size : 100  num_runs : 50
time/example(us)  time/batch(us)  method
----------------------------------------
         0.44275          44.275  GradientBoostedTreesQuickScorerExtended [virtual interface]
         0.79825          79.825  GradientBoostedTreesOptPred [virtual interface]
           1.877           187.7  GradientBoostedTreesGeneric [virtual interface]
          4.4463          444.62  Generic slow engine
----------------------------------------

在此基准测试中，我们可以看到不同推理引擎的推理速度。例如，“time/example(us) = 0.6315”（在不同的运行中可能会发生变化）表示一个示例的推理需要 0.63 微秒。也就是说，模型每秒可以运行约 160 万次。