编写自己的回调函数

作者:Rick Chao、Francois Chollet

在 TensorFlow.org 上查看 在 Google Colab 中运行 在 GitHub 上查看源代码 在 keras.io 上查看

简介

回调函数是自定义 Keras 模型在训练、评估或推理期间行为的强大工具。例如,tf.keras.callbacks.TensorBoard 可用于使用 TensorBoard 可视化训练进度和结果,或 tf.keras.callbacks.ModelCheckpoint 可用于在训练期间定期保存模型。

在本指南中,您将了解 Keras 回调函数是什么、它可以做什么以及如何构建自己的回调函数。我们提供了一些简单的回调函数应用程序演示,帮助您入门。

设置

import tensorflow as tf
from tensorflow import keras

Keras 回调函数概述

所有回调函数都是 keras.callbacks.Callback 类的子类,并覆盖在训练、测试和预测的各个阶段调用的方法集。回调函数对于在训练期间查看模型的内部状态和统计信息非常有用。

您可以将回调函数列表(作为关键字参数 callbacks)传递给以下模型方法

回调函数方法概述

全局方法

on_(train|test|predict)_begin(self, logs=None)

fit/evaluate/predict 开始时调用。

on_(train|test|predict)_end(self, logs=None)

fit/evaluate/predict 结束时调用。

训练/测试/预测的批次级方法

on_(train|test|predict)_batch_begin(self, batch, logs=None)

在训练/测试/预测期间处理批次之前立即调用。

on_(train|test|predict)_batch_end(self, batch, logs=None)

在训练/测试/预测批次结束时调用。在此方法中,logs 是一个包含指标结果的字典。

时期级方法(仅限训练)

on_epoch_begin(self, epoch, logs=None)

在训练期间的每个时期开始时调用。

on_epoch_end(self, epoch, logs=None)

在训练期间的每个时期结束时调用。

基本示例

让我们来看一个具体的例子。 首先,让我们导入 TensorFlow 并定义一个简单的 Sequential Keras 模型。

# Define the Keras model to add callbacks to
def get_model():
    model = keras.Sequential()
    model.add(keras.layers.Dense(1, input_dim=784))
    model.compile(
        optimizer=keras.optimizers.RMSprop(learning_rate=0.1),
        loss="mean_squared_error",
        metrics=["mean_absolute_error"],
    )
    return model

然后,从 Keras 数据集 API 加载用于训练和测试的 MNIST 数据。

# Load example MNIST data and pre-process it
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train = x_train.reshape(-1, 784).astype("float32") / 255.0
x_test = x_test.reshape(-1, 784).astype("float32") / 255.0

# Limit the data to 1000 samples
x_train = x_train[:1000]
y_train = y_train[:1000]
x_test = x_test[:1000]
y_test = y_test[:1000]

现在,定义一个简单的自定义回调,它记录以下信息:

  • fit/evaluate/predict 开始和结束时
  • 当每个 epoch 开始和结束时
  • 当每个训练批次开始和结束时
  • 当每个评估(测试)批次开始和结束时
  • 当每个推理(预测)批次开始和结束时
class CustomCallback(keras.callbacks.Callback):
    def on_train_begin(self, logs=None):
        keys = list(logs.keys())
        print("Starting training; got log keys: {}".format(keys))

    def on_train_end(self, logs=None):
        keys = list(logs.keys())
        print("Stop training; got log keys: {}".format(keys))

    def on_epoch_begin(self, epoch, logs=None):
        keys = list(logs.keys())
        print("Start epoch {} of training; got log keys: {}".format(epoch, keys))

    def on_epoch_end(self, epoch, logs=None):
        keys = list(logs.keys())
        print("End epoch {} of training; got log keys: {}".format(epoch, keys))

    def on_test_begin(self, logs=None):
        keys = list(logs.keys())
        print("Start testing; got log keys: {}".format(keys))

    def on_test_end(self, logs=None):
        keys = list(logs.keys())
        print("Stop testing; got log keys: {}".format(keys))

    def on_predict_begin(self, logs=None):
        keys = list(logs.keys())
        print("Start predicting; got log keys: {}".format(keys))

    def on_predict_end(self, logs=None):
        keys = list(logs.keys())
        print("Stop predicting; got log keys: {}".format(keys))

    def on_train_batch_begin(self, batch, logs=None):
        keys = list(logs.keys())
        print("...Training: start of batch {}; got log keys: {}".format(batch, keys))

    def on_train_batch_end(self, batch, logs=None):
        keys = list(logs.keys())
        print("...Training: end of batch {}; got log keys: {}".format(batch, keys))

    def on_test_batch_begin(self, batch, logs=None):
        keys = list(logs.keys())
        print("...Evaluating: start of batch {}; got log keys: {}".format(batch, keys))

    def on_test_batch_end(self, batch, logs=None):
        keys = list(logs.keys())
        print("...Evaluating: end of batch {}; got log keys: {}".format(batch, keys))

    def on_predict_batch_begin(self, batch, logs=None):
        keys = list(logs.keys())
        print("...Predicting: start of batch {}; got log keys: {}".format(batch, keys))

    def on_predict_batch_end(self, batch, logs=None):
        keys = list(logs.keys())
        print("...Predicting: end of batch {}; got log keys: {}".format(batch, keys))

让我们试一试。

model = get_model()
model.fit(
    x_train,
    y_train,
    batch_size=128,
    epochs=1,
    verbose=0,
    validation_split=0.5,
    callbacks=[CustomCallback()],
)

res = model.evaluate(
    x_test, y_test, batch_size=128, verbose=0, callbacks=[CustomCallback()]
)

res = model.predict(x_test, batch_size=128, callbacks=[CustomCallback()])
Starting training; got log keys: []
Start epoch 0 of training; got log keys: []
...Training: start of batch 0; got log keys: []
...Training: end of batch 0; got log keys: ['loss', 'mean_absolute_error']
...Training: start of batch 1; got log keys: []
...Training: end of batch 1; got log keys: ['loss', 'mean_absolute_error']
...Training: start of batch 2; got log keys: []
...Training: end of batch 2; got log keys: ['loss', 'mean_absolute_error']
...Training: start of batch 3; got log keys: []
...Training: end of batch 3; got log keys: ['loss', 'mean_absolute_error']
Start testing; got log keys: []
...Evaluating: start of batch 0; got log keys: []
...Evaluating: end of batch 0; got log keys: ['loss', 'mean_absolute_error']
...Evaluating: start of batch 1; got log keys: []
...Evaluating: end of batch 1; got log keys: ['loss', 'mean_absolute_error']
...Evaluating: start of batch 2; got log keys: []
...Evaluating: end of batch 2; got log keys: ['loss', 'mean_absolute_error']
...Evaluating: start of batch 3; got log keys: []
...Evaluating: end of batch 3; got log keys: ['loss', 'mean_absolute_error']
Stop testing; got log keys: ['loss', 'mean_absolute_error']
End epoch 0 of training; got log keys: ['loss', 'mean_absolute_error', 'val_loss', 'val_mean_absolute_error']
Stop training; got log keys: ['loss', 'mean_absolute_error', 'val_loss', 'val_mean_absolute_error']
Start testing; got log keys: []
...Evaluating: start of batch 0; got log keys: []
...Evaluating: end of batch 0; got log keys: ['loss', 'mean_absolute_error']
...Evaluating: start of batch 1; got log keys: []
...Evaluating: end of batch 1; got log keys: ['loss', 'mean_absolute_error']
...Evaluating: start of batch 2; got log keys: []
...Evaluating: end of batch 2; got log keys: ['loss', 'mean_absolute_error']
...Evaluating: start of batch 3; got log keys: []
...Evaluating: end of batch 3; got log keys: ['loss', 'mean_absolute_error']
...Evaluating: start of batch 4; got log keys: []
...Evaluating: end of batch 4; got log keys: ['loss', 'mean_absolute_error']
...Evaluating: start of batch 5; got log keys: []
...Evaluating: end of batch 5; got log keys: ['loss', 'mean_absolute_error']
...Evaluating: start of batch 6; got log keys: []
...Evaluating: end of batch 6; got log keys: ['loss', 'mean_absolute_error']
...Evaluating: start of batch 7; got log keys: []
...Evaluating: end of batch 7; got log keys: ['loss', 'mean_absolute_error']
Stop testing; got log keys: ['loss', 'mean_absolute_error']
Start predicting; got log keys: []
...Predicting: start of batch 0; got log keys: []
...Predicting: end of batch 0; got log keys: ['outputs']
1/8 [==>...........................] - ETA: 0s...Predicting: start of batch 1; got log keys: []
...Predicting: end of batch 1; got log keys: ['outputs']
...Predicting: start of batch 2; got log keys: []
...Predicting: end of batch 2; got log keys: ['outputs']
...Predicting: start of batch 3; got log keys: []
...Predicting: end of batch 3; got log keys: ['outputs']
...Predicting: start of batch 4; got log keys: []
...Predicting: end of batch 4; got log keys: ['outputs']
...Predicting: start of batch 5; got log keys: []
...Predicting: end of batch 5; got log keys: ['outputs']
...Predicting: start of batch 6; got log keys: []
...Predicting: end of batch 6; got log keys: ['outputs']
...Predicting: start of batch 7; got log keys: []
...Predicting: end of batch 7; got log keys: ['outputs']
Stop predicting; got log keys: []
8/8 [==============================] - 0s 2ms/step

使用 logs 字典

logs 字典包含损失值,以及批次或 epoch 结束时的所有指标。 例如,包括损失和平均绝对误差。

class LossAndErrorPrintingCallback(keras.callbacks.Callback):
    def on_train_batch_end(self, batch, logs=None):
        print(
            "Up to batch {}, the average loss is {:7.2f}.".format(batch, logs["loss"])
        )

    def on_test_batch_end(self, batch, logs=None):
        print(
            "Up to batch {}, the average loss is {:7.2f}.".format(batch, logs["loss"])
        )

    def on_epoch_end(self, epoch, logs=None):
        print(
            "The average loss for epoch {} is {:7.2f} "
            "and mean absolute error is {:7.2f}.".format(
                epoch, logs["loss"], logs["mean_absolute_error"]
            )
        )


model = get_model()
model.fit(
    x_train,
    y_train,
    batch_size=128,
    epochs=2,
    verbose=0,
    callbacks=[LossAndErrorPrintingCallback()],
)

res = model.evaluate(
    x_test,
    y_test,
    batch_size=128,
    verbose=0,
    callbacks=[LossAndErrorPrintingCallback()],
)
Up to batch 0, the average loss is   33.08.
Up to batch 1, the average loss is  429.70.
Up to batch 2, the average loss is  293.82.
Up to batch 3, the average loss is  222.52.
Up to batch 4, the average loss is  179.47.
Up to batch 5, the average loss is  150.49.
Up to batch 6, the average loss is  129.87.
Up to batch 7, the average loss is  116.92.
The average loss for epoch 0 is  116.92 and mean absolute error is    5.88.
Up to batch 0, the average loss is    5.29.
Up to batch 1, the average loss is    4.86.
Up to batch 2, the average loss is    4.66.
Up to batch 3, the average loss is    4.54.
Up to batch 4, the average loss is    4.50.
Up to batch 5, the average loss is    4.38.
Up to batch 6, the average loss is    4.39.
Up to batch 7, the average loss is    4.33.
The average loss for epoch 1 is    4.33 and mean absolute error is    1.68.
Up to batch 0, the average loss is    5.21.
Up to batch 1, the average loss is    4.73.
Up to batch 2, the average loss is    4.68.
Up to batch 3, the average loss is    4.57.
Up to batch 4, the average loss is    4.70.
Up to batch 5, the average loss is    4.71.
Up to batch 6, the average loss is    4.63.
Up to batch 7, the average loss is    4.56.

使用 self.model 属性

除了在调用其方法时接收日志信息外,回调还可以访问与当前训练/评估/推理回合相关的模型: self.model

以下是在回调中使用 self.model 可以做的一些事情:

  • 设置 self.model.stop_training = True 以立即中断训练。
  • 修改优化器的超参数(可作为 self.model.optimizer 获取),例如 self.model.optimizer.learning_rate
  • 定期保存模型。
  • 在每个 epoch 结束时记录 model.predict() 在一些测试样本上的输出,作为训练期间的健全性检查。
  • 在每个 epoch 结束时提取中间特征的可视化,以监控模型随着时间的推移学习的内容。
  • 等等。

让我们在几个例子中看看它的实际应用。

Keras 回调应用的例子

在最小损失时提前停止

第一个例子展示了如何创建一个 Callback,当损失达到最小值时停止训练,方法是设置属性 self.model.stop_training(布尔值)。 您可以选择提供一个参数 patience,指定在达到局部最小值后我们应该等待多少个 epoch 才能停止。

tf.keras.callbacks.EarlyStopping 提供了一个更完整和通用的实现。

import numpy as np


class EarlyStoppingAtMinLoss(keras.callbacks.Callback):
    """Stop training when the loss is at its min, i.e. the loss stops decreasing.

    Arguments:
        patience: Number of epochs to wait after min has been hit. After this
        number of no improvement, training stops.
    """

    def __init__(self, patience=0):
        super().__init__()
        self.patience = patience
        # best_weights to store the weights at which the minimum loss occurs.
        self.best_weights = None

    def on_train_begin(self, logs=None):
        # The number of epoch it has waited when loss is no longer minimum.
        self.wait = 0
        # The epoch the training stops at.
        self.stopped_epoch = 0
        # Initialize the best as infinity.
        self.best = np.Inf

    def on_epoch_end(self, epoch, logs=None):
        current = logs.get("loss")
        if np.less(current, self.best):
            self.best = current
            self.wait = 0
            # Record the best weights if current results is better (less).
            self.best_weights = self.model.get_weights()
        else:
            self.wait += 1
            if self.wait >= self.patience:
                self.stopped_epoch = epoch
                self.model.stop_training = True
                print("Restoring model weights from the end of the best epoch.")
                self.model.set_weights(self.best_weights)

    def on_train_end(self, logs=None):
        if self.stopped_epoch > 0:
            print("Epoch %05d: early stopping" % (self.stopped_epoch + 1))


model = get_model()
model.fit(
    x_train,
    y_train,
    batch_size=64,
    steps_per_epoch=5,
    epochs=30,
    verbose=0,
    callbacks=[LossAndErrorPrintingCallback(), EarlyStoppingAtMinLoss()],
)
Up to batch 0, the average loss is   23.53.
Up to batch 1, the average loss is  480.92.
Up to batch 2, the average loss is  328.49.
Up to batch 3, the average loss is  248.52.
Up to batch 4, the average loss is  200.53.
The average loss for epoch 0 is  200.53 and mean absolute error is    8.30.
Up to batch 0, the average loss is    5.02.
Up to batch 1, the average loss is    5.80.
Up to batch 2, the average loss is    5.51.
Up to batch 3, the average loss is    5.38.
Up to batch 4, the average loss is    5.42.
The average loss for epoch 1 is    5.42 and mean absolute error is    1.90.
Up to batch 0, the average loss is    5.80.
Up to batch 1, the average loss is    6.89.
Up to batch 2, the average loss is    6.68.
Up to batch 3, the average loss is    6.35.
Up to batch 4, the average loss is    6.57.
The average loss for epoch 2 is    6.57 and mean absolute error is    2.07.
Restoring model weights from the end of the best epoch.
Epoch 00003: early stopping
<keras.src.callbacks.History at 0x7fd3802cbb80>

学习率调度

在这个例子中,我们展示了如何使用自定义回调在训练过程中动态更改优化器的学习率。

请参阅 callbacks.LearningRateScheduler 以获取更通用的实现。

class CustomLearningRateScheduler(keras.callbacks.Callback):
    """Learning rate scheduler which sets the learning rate according to schedule.

    Arguments:
        schedule: a function that takes an epoch index
            (integer, indexed from 0) and current learning rate
            as inputs and returns a new learning rate as output (float).
    """

    def __init__(self, schedule):
        super().__init__()
        self.schedule = schedule

    def on_epoch_begin(self, epoch, logs=None):
        if not hasattr(self.model.optimizer, "lr"):
            raise ValueError('Optimizer must have a "lr" attribute.')
        # Get the current learning rate from model's optimizer.
        lr = float(tf.keras.backend.get_value(self.model.optimizer.learning_rate))
        # Call schedule function to get the scheduled learning rate.
        scheduled_lr = self.schedule(epoch, lr)
        # Set the value back to the optimizer before this epoch starts
        tf.keras.backend.set_value(self.model.optimizer.lr, scheduled_lr)
        print("\nEpoch %05d: Learning rate is %6.4f." % (epoch, scheduled_lr))


LR_SCHEDULE = [
    # (epoch to start, learning rate) tuples
    (3, 0.05),
    (6, 0.01),
    (9, 0.005),
    (12, 0.001),
]


def lr_schedule(epoch, lr):
    """Helper function to retrieve the scheduled learning rate based on epoch."""
    if epoch < LR_SCHEDULE[0][0] or epoch > LR_SCHEDULE[-1][0]:
        return lr
    for i in range(len(LR_SCHEDULE)):
        if epoch == LR_SCHEDULE[i][0]:
            return LR_SCHEDULE[i][1]
    return lr


model = get_model()
model.fit(
    x_train,
    y_train,
    batch_size=64,
    steps_per_epoch=5,
    epochs=15,
    verbose=0,
    callbacks=[
        LossAndErrorPrintingCallback(),
        CustomLearningRateScheduler(lr_schedule),
    ],
)
Epoch 00000: Learning rate is 0.1000.
Up to batch 0, the average loss is   25.33.
Up to batch 1, the average loss is  434.31.
Up to batch 2, the average loss is  298.47.
Up to batch 3, the average loss is  226.43.
Up to batch 4, the average loss is  182.22.
The average loss for epoch 0 is  182.22 and mean absolute error is    8.09.

Epoch 00001: Learning rate is 0.1000.
Up to batch 0, the average loss is    5.55.
Up to batch 1, the average loss is    5.56.
Up to batch 2, the average loss is    6.20.
Up to batch 3, the average loss is    6.24.
Up to batch 4, the average loss is    6.34.
The average loss for epoch 1 is    6.34 and mean absolute error is    2.09.

Epoch 00002: Learning rate is 0.1000.
Up to batch 0, the average loss is    7.28.
Up to batch 1, the average loss is    7.82.
Up to batch 2, the average loss is    7.52.
Up to batch 3, the average loss is    7.33.
Up to batch 4, the average loss is    7.52.
The average loss for epoch 2 is    7.52 and mean absolute error is    2.27.

Epoch 00003: Learning rate is 0.0500.
Up to batch 0, the average loss is   10.56.
Up to batch 1, the average loss is    7.01.
Up to batch 2, the average loss is    6.36.
Up to batch 3, the average loss is    6.18.
Up to batch 4, the average loss is    5.55.
The average loss for epoch 3 is    5.55 and mean absolute error is    1.90.

Epoch 00004: Learning rate is 0.0500.
Up to batch 0, the average loss is    3.26.
Up to batch 1, the average loss is    3.70.
Up to batch 2, the average loss is    3.75.
Up to batch 3, the average loss is    3.73.
Up to batch 4, the average loss is    3.79.
The average loss for epoch 4 is    3.79 and mean absolute error is    1.56.

Epoch 00005: Learning rate is 0.0500.
Up to batch 0, the average loss is    5.90.
Up to batch 1, the average loss is    5.09.
Up to batch 2, the average loss is    4.59.
Up to batch 3, the average loss is    4.39.
Up to batch 4, the average loss is    4.50.
The average loss for epoch 5 is    4.50 and mean absolute error is    1.66.

Epoch 00006: Learning rate is 0.0100.
Up to batch 0, the average loss is    6.34.
Up to batch 1, the average loss is    6.46.
Up to batch 2, the average loss is    5.29.
Up to batch 3, the average loss is    4.89.
Up to batch 4, the average loss is    4.68.
The average loss for epoch 6 is    4.68 and mean absolute error is    1.74.

Epoch 00007: Learning rate is 0.0100.
Up to batch 0, the average loss is    3.67.
Up to batch 1, the average loss is    3.06.
Up to batch 2, the average loss is    3.25.
Up to batch 3, the average loss is    3.45.
Up to batch 4, the average loss is    3.34.
The average loss for epoch 7 is    3.34 and mean absolute error is    1.43.

Epoch 00008: Learning rate is 0.0100.
Up to batch 0, the average loss is    3.35.
Up to batch 1, the average loss is    3.74.
Up to batch 2, the average loss is    3.50.
Up to batch 3, the average loss is    3.38.
Up to batch 4, the average loss is    3.58.
The average loss for epoch 8 is    3.58 and mean absolute error is    1.52.

Epoch 00009: Learning rate is 0.0050.
Up to batch 0, the average loss is    2.08.
Up to batch 1, the average loss is    2.52.
Up to batch 2, the average loss is    2.76.
Up to batch 3, the average loss is    2.72.
Up to batch 4, the average loss is    2.85.
The average loss for epoch 9 is    2.85 and mean absolute error is    1.31.

Epoch 00010: Learning rate is 0.0050.
Up to batch 0, the average loss is    3.64.
Up to batch 1, the average loss is    3.39.
Up to batch 2, the average loss is    3.42.
Up to batch 3, the average loss is    3.83.
Up to batch 4, the average loss is    3.85.
The average loss for epoch 10 is    3.85 and mean absolute error is    1.56.

Epoch 00011: Learning rate is 0.0050.
Up to batch 0, the average loss is    3.33.
Up to batch 1, the average loss is    3.18.
Up to batch 2, the average loss is    2.98.
Up to batch 3, the average loss is    3.02.
Up to batch 4, the average loss is    2.85.
The average loss for epoch 11 is    2.85 and mean absolute error is    1.31.

Epoch 00012: Learning rate is 0.0010.
Up to batch 0, the average loss is    3.58.
Up to batch 1, the average loss is    3.22.
Up to batch 2, the average loss is    3.27.
Up to batch 3, the average loss is    3.24.
Up to batch 4, the average loss is    3.02.
The average loss for epoch 12 is    3.02 and mean absolute error is    1.32.

Epoch 00013: Learning rate is 0.0010.
Up to batch 0, the average loss is    3.37.
Up to batch 1, the average loss is    3.55.
Up to batch 2, the average loss is    3.31.
Up to batch 3, the average loss is    3.28.
Up to batch 4, the average loss is    3.27.
The average loss for epoch 13 is    3.27 and mean absolute error is    1.43.

Epoch 00014: Learning rate is 0.0010.
Up to batch 0, the average loss is    2.02.
Up to batch 1, the average loss is    2.66.
Up to batch 2, the average loss is    2.61.
Up to batch 3, the average loss is    2.56.
Up to batch 4, the average loss is    2.82.
The average loss for epoch 14 is    2.82 and mean absolute error is    1.27.
<keras.src.callbacks.History at 0x7fd3801da790>

内置 Keras 回调

请务必通过阅读 API 文档 来查看现有的 Keras 回调。 应用包括将日志记录到 CSV、保存模型、在 TensorBoard 中可视化指标等等!