在 TensorFlow.org 上查看
|
在 Google Colab 中运行
|
在 GitHub 上查看源码
|
下载笔记本
|
本教程将向您展示如何使用自定义训练循环来训练机器学习模型,以对企鹅物种进行分类。在本笔记本中,您将使用 TensorFlow 完成以下任务:
- 导入数据集
- 构建简单的线性模型
- 训练模型
- 评估模型的有效性
- 使用训练好的模型进行预测
TensorFlow 编程
本教程演示了以下 TensorFlow 编程任务:
- 使用 TensorFlow Datasets API 导入数据
- 使用 Keras API 构建模型和层
企鹅分类问题
假设您是一位鸟类学家,正在寻找一种自动分类所发现企鹅的方法。机器学习提供了许多算法来从统计学角度对企鹅进行分类。例如,复杂的机器学习程序可以根据照片对企鹅进行分类。本教程中构建的模型相对简单一些。它根据企鹅的体重、鳍肢长度和喙(特别是喙的长度和宽度测量值,即 culmen)对企鹅进行分类。
企鹅共有 18 个物种,但在本教程中,您将仅尝试对以下三种进行分类:
- 帽带企鹅 (Chinstrap penguins)
- 巴布亚企鹅 (Gentoo penguins)
- 阿德利企鹅 (Adélie penguins)
|
|
图 1. 帽带、巴布亚 和 阿德利 企鹅(图片由 @allison_horst 创作,CC BY-SA 2.0)。 |
幸运的是,一个研究团队已经创建并共享了一个包含 334 只企鹅的数据集,其中包含体重、鳍肢长度、喙部测量值和其他数据。该数据集还可以作为 penguins TensorFlow 数据集方便地获取。
设置
安装用于企鹅数据集的 tfds-nightly 包。tfds-nightly 包是 TensorFlow Datasets (TFDS) 的每日发布版本。有关 TFDS 的更多信息,请参阅 TensorFlow Datasets 概述。
pip install -q tfds-nightly然后从 Colab 菜单中选择 运行时 (Runtime) > 重启运行时 (Restart Runtime) 以重启 Colab 运行时。
在重启运行时之前,请勿继续进行本教程的其余部分。
导入 TensorFlow 和其他必要的 Python 模块。
import os
import tensorflow as tf
import tensorflow_datasets as tfds
import matplotlib.pyplot as plt
print("TensorFlow version: {}".format(tf.__version__))
print("TensorFlow Datasets version: ",tfds.__version__)
2024-08-16 02:28:31.903728: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-08-16 02:28:31.924949: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-08-16 02:28:31.931168: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered TensorFlow version: 2.17.0 TensorFlow Datasets version: 4.9.3+nightly
导入数据集
默认的 penguins/processed TensorFlow 数据集已经过清洗、归一化,可以直接用于构建模型。在下载处理后的数据之前,请先预览一个简化版本,以熟悉原始的企鹅调查数据。
预览数据
使用 TensorFlow Datasets 的 tfds.load 方法下载简化版的企鹅数据集(penguins/simple)。该数据集共有 344 条数据记录。将前五条记录提取到 DataFrame 对象中,以检查该数据集中的部分数值。
ds_preview, info = tfds.load('penguins/simple', split='train', with_info=True)
df = tfds.as_dataframe(ds_preview.take(5), info)
print(df)
print(info.features)
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1723775316.046133 60662 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723775316.049924 60662 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723775316.053653 60662 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723775316.057338 60662 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723775316.068654 60662 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723775316.072263 60662 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723775316.075786 60662 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723775316.079300 60662 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723775316.082748 60662 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723775316.086266 60662 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723775316.089765 60662 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723775316.093289 60662 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723775317.317041 60662 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723775317.319122 60662 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723775317.321282 60662 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723775317.323367 60662 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723775317.325385 60662 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723775317.327283 60662 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723775317.329257 60662 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723775317.331277 60662 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723775317.333200 60662 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723775317.335099 60662 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723775317.337077 60662 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723775317.339051 60662 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723775317.376856 60662 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723775317.378816 60662 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723775317.380820 60662 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723775317.382839 60662 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723775317.384804 60662 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723775317.386710 60662 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723775317.388689 60662 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723775317.390670 60662 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723775317.392605 60662 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723775317.395005 60662 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723775317.397407 60662 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723775317.399783 60662 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
body_mass_g culmen_depth_mm culmen_length_mm flipper_length_mm island \
0 4200.0 13.9 45.500000 210.0 0
1 4650.0 13.7 40.900002 214.0 0
2 5300.0 14.2 51.299999 218.0 0
3 5650.0 15.0 47.799999 215.0 0
4 5050.0 15.8 46.299999 215.0 0
sex species
0 0 2
1 0 2
2 1 2
3 1 2
4 1 2
FeaturesDict({
'body_mass_g': float32,
'culmen_depth_mm': float32,
'culmen_length_mm': float32,
'flipper_length_mm': float32,
'island': ClassLabel(shape=(), dtype=int64, num_classes=3),
'sex': ClassLabel(shape=(), dtype=int64, num_classes=3),
'species': ClassLabel(shape=(), dtype=int64, num_classes=3),
})
2024-08-16 02:28:38.027706: W tensorflow/core/kernels/data/cache_dataset_ops.cc:913] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
带编号的行是数据记录,每行是一个 样本 (example),其中:
- 前六个字段是 特征 (features):即样本的属性。在这里,这些字段包含代表企鹅测量值的数字。
- 最后一列是 标签 (label):即您想要预测的值。对于此数据集,它是一个整数(0、1 或 2),对应于特定的企鹅物种名称。
在数据集中,企鹅物种的标签表示为数字,以便于在您构建的模型中使用。这些数字对应于以下企鹅物种:
0:阿德利企鹅1:帽带企鹅2:巴布亚企鹅
创建一个包含上述顺序企鹅物种名称的列表。您将使用此列表来解释分类模型的输出。
class_names = ['Adélie', 'Chinstrap', 'Gentoo']
有关特征和标签的更多信息,请参考 机器学习速成课程中的机器学习术语部分。
下载预处理后的数据集
现在,使用 tfds.load 方法下载预处理后的企鹅数据集(penguins/processed),该方法将返回一个 tf.data.Dataset 对象列表。请注意,penguins/processed 数据集本身不附带测试集,因此请使用 80:20 的比例将 完整数据集拆分 为训练集和测试集。稍后您将使用测试数据集来验证模型。
ds_split, info = tfds.load("penguins/processed", split=['train[:20%]', 'train[20%:]'], as_supervised=True, with_info=True)
ds_test = ds_split[0]
ds_train = ds_split[1]
assert isinstance(ds_test, tf.data.Dataset)
print(info.features)
df_test = tfds.as_dataframe(ds_test.take(5), info)
print("Test dataset sample: ")
print(df_test)
df_train = tfds.as_dataframe(ds_train.take(5), info)
print("Train dataset sample: ")
print(df_train)
ds_train_batch = ds_train.batch(32)
FeaturesDict({
'features': Tensor(shape=(4,), dtype=float32),
'species': ClassLabel(shape=(), dtype=int64, num_classes=3),
})
Test dataset sample:
features species
0 [0.6545454, 0.22619048, 0.89830506, 0.6388889] 2
1 [0.36, 0.04761905, 0.6440678, 0.4027778] 2
2 [0.68, 0.30952382, 0.91525424, 0.6944444] 2
3 [0.6181818, 0.20238096, 0.8135593, 0.6805556] 2
4 [0.5527273, 0.26190478, 0.84745765, 0.7083333] 2
Train dataset sample:
features species
0 [0.49818182, 0.6904762, 0.42372882, 0.4027778] 0
1 [0.48, 0.071428575, 0.6440678, 0.44444445] 2
2 [0.7236364, 0.9047619, 0.6440678, 0.5833333] 1
3 [0.34545454, 0.5833333, 0.33898306, 0.3472222] 0
4 [0.10909091, 0.75, 0.3559322, 0.41666666] 0
2024-08-16 02:28:39.105506: W tensorflow/core/kernels/data/cache_dataset_ops.cc:913] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
2024-08-16 02:28:39.262195: W tensorflow/core/kernels/data/cache_dataset_ops.cc:913] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
请注意,此版本的数据集已经过预处理,将数据缩减为四个归一化特征和一个物种标签。在此格式下,无需进一步处理即可快速用于训练模型。
features, labels = next(iter(ds_train_batch))
print(features)
print(labels)
tf.Tensor( [[0.49818182 0.6904762 0.42372882 0.4027778 ] [0.48 0.07142857 0.6440678 0.44444445] [0.7236364 0.9047619 0.6440678 0.5833333 ] [0.34545454 0.5833333 0.33898306 0.3472222 ] [0.10909091 0.75 0.3559322 0.41666666] [0.6690909 0.63095236 0.47457626 0.19444445] [0.8036364 0.9166667 0.4915254 0.44444445] [0.4909091 0.75 0.37288135 0.22916667] [0.33454546 0.85714287 0.37288135 0.2361111 ] [0.32 0.41666666 0.2542373 0.1388889 ] [0.41454545 0.5952381 0.5084746 0.19444445] [0.14909092 0.48809522 0.2542373 0.125 ] [0.23636363 0.4642857 0.27118644 0.05555556] [0.22181818 0.5952381 0.22033899 0.3472222 ] [0.24727273 0.5595238 0.15254237 0.25694445] [0.63272727 0.35714287 0.88135594 0.8194444 ] [0.47272727 0.15476191 0.6440678 0.4722222 ] [0.6036364 0.23809524 0.84745765 0.7361111 ] [0.26909092 0.5595238 0.27118644 0.16666667] [0.28 0.71428573 0.20338982 0.5416667 ] [0.10545454 0.5714286 0.33898306 0.2847222 ] [0.18545455 0.5952381 0.10169491 0.33333334] [0.47272727 0.16666667 0.7288136 0.6388889 ] [0.45090908 0.1904762 0.7118644 0.5972222 ] [0.49454546 0.5 0.3559322 0.25 ] [0.6363636 0.22619048 0.7457627 0.5694444 ] [0.08727273 0.5952381 0.2542373 0.05555556] [0.52 0.22619048 0.7457627 0.5555556 ] [0.5090909 0.23809524 0.7288136 0.6666667 ] [0.56 0.22619048 0.779661 0.625 ] [0.6363636 0.3452381 0.89830506 0.8333333 ] [0.15636364 0.47619048 0.20338982 0.04166667]], shape=(32, 4), dtype=float32) tf.Tensor([0 2 1 0 0 1 1 1 0 1 1 0 0 0 0 2 2 2 0 0 0 0 2 2 1 2 0 2 2 2 2 0], shape=(32,), dtype=int64) 2024-08-16 02:28:39.428348: W tensorflow/core/kernels/data/cache_dataset_ops.cc:913] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
您可以通过绘制批次中的少量特征来可视化一些聚类。
plt.scatter(features[:,0],
features[:,2],
c=labels,
cmap='viridis')
plt.xlabel("Body Mass")
plt.ylabel("Culmen Length")
plt.show()

构建简单的线性模型
为什么要建模?
模型 是特征与标签之间的关系。对于企鹅分类问题,模型定义了体重、鳍肢和喙测量值与预测的企鹅物种之间的关系。一些简单的模型可以用几行代数来描述,但复杂的机器学习模型包含大量难以概括的参数。
您能不使用机器学习就确定这四个特征与企鹅物种之间的关系吗?也就是说,您能使用传统的编程技术(例如大量的条件语句)来创建模型吗?或许可以——如果您花足够长的时间分析数据集,确定体重和喙部测量值与特定物种之间的关系。但这在更复杂的数据集上会变得困难,甚至不可能。一种好的机器学习方法会为您确定模型。如果您将足够多的代表性样本输入到正确的机器学习模型类型中,程序就会为您找出其中的关系。
选择模型
接下来,您需要选择要训练的模型类型。模型种类繁多,挑选一个好的模型需要经验。本教程使用神经网络来解决企鹅分类问题。神经网络 可以发现特征与标签之间的复杂关系。它是一个高度结构化的图,组织成一个或多个 隐藏层。每个隐藏层由一个或多个 神经元 组成。神经网络有多种类别,本程序使用的是密集型或 全连接神经网络:一层中的神经元接收来自上一层每一个神经元的输入连接。例如,图 2 展示了一个由输入层、两个隐藏层和一个输出层组成的密集神经网络。
|
|
图 2. 包含特征、隐藏层和预测的神经网络。 |
当您训练图 2 中的模型并输入一个未标记的样本时,它会得出三个预测:该企鹅属于特定物种的可能性。这种预测称为 推理 (inference)。在此示例中,输出预测的总和为 1.0。在图 2 中,此预测分解为:阿德利为 0.02,帽带为 0.95,巴布亚为 0.03。这意味着模型预测——有 95% 的概率——该未标记的企鹅样本属于帽带企鹅。
使用 Keras 创建模型
TensorFlow tf.keras API 是创建模型和层的首选方式。它使构建模型和进行实验变得容易,同时 Keras 处理了将所有内容连接在一起的复杂性。
tf.keras.Sequential 模型是一个线性的层堆栈。其构造函数接受一个层实例列表,在本例中为两个各有 10 个节点的 tf.keras.layers.Dense 层,以及一个包含 3 个节点的输出层(代表您的标签预测)。第一层的 input_shape 参数对应于数据集中的特征数量,这是必需的。
model = tf.keras.Sequential([
tf.keras.layers.Dense(10, activation=tf.nn.relu, input_shape=(4,)), # input shape required
tf.keras.layers.Dense(10, activation=tf.nn.relu),
tf.keras.layers.Dense(3)
])
/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/keras/src/layers/core/dense.py:87: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead. super().__init__(activity_regularizer=activity_regularizer, **kwargs)
激活函数 确定层中每个节点的输出形状。这些非线性因素非常重要——如果没有它们,模型将等同于单层。有许多 tf.keras.activations 可供选择,但 ReLU 对于隐藏层来说很常见。
隐藏层和神经元的理想数量取决于具体问题和数据集。与机器学习的许多方面一样,选择最佳的神经网络形状需要知识和实验的结合。根据经验,增加隐藏层和神经元的数量通常会创建更强大的模型,这需要更多数据才能有效地进行训练。
使用模型
让我们快速看看这个模型对一批特征做了什么。
predictions = model(features)
predictions[:5]
<tf.Tensor: shape=(5, 3), dtype=float32, numpy=
array([[-0.01486555, -0.2774356 , -0.4446883 ],
[-0.01989127, -0.18148412, -0.23701203],
[-0.01520463, -0.38729626, -0.6065093 ],
[-0.01831748, -0.22101548, -0.35036355],
[-0.05447623, -0.24796928, -0.36515424]], dtype=float32)>
这里,每个样本为每个类返回一个 logit。
要将这些 logit 转换为每个类的概率,请使用 softmax 函数。
tf.nn.softmax(predictions[:5])
<tf.Tensor: shape=(5, 3), dtype=float32, numpy=
array([[0.41327488, 0.31783837, 0.26888674],
[0.37655985, 0.32037243, 0.3030677 ],
[0.44585222, 0.30732197, 0.24682581],
[0.39463627, 0.32223028, 0.2831335 ],
[0.39107943, 0.322279 , 0.2866416 ]], dtype=float32)>
跨类执行 tf.math.argmax 可以得到预测的类索引。但是,模型尚未经过训练,因此这些预测并不准确。
print("Prediction: {}".format(tf.math.argmax(predictions, axis=1)))
print(" Labels: {}".format(labels))
Prediction: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
Labels: [0 2 1 0 0 1 1 1 0 1 1 0 0 0 0 2 2 2 0 0 0 0 2 2 1 2 0 2 2 2 2 0]
训练模型
训练 (training) 是机器学习中模型逐渐优化的阶段,即模型学习数据集的过程。目标是学习足够的训练数据集结构,以便对未知数据做出预测。如果您对训练数据集学习得太多,那么预测将仅对已见过的数据有效,而无法泛化。这个问题被称为 过拟合 (overfitting)——这就像死记硬背答案而不是理解如何解决问题。
企鹅分类问题是 监督机器学习 的一个例子:模型是从包含标签的示例中训练出来的。在 无监督机器学习 中,示例不包含标签。相反,模型通常会在特征中寻找模式。
定义损失和梯度函数
训练和评估阶段都需要计算模型的 损失 (loss)。这衡量了模型的预测与期望标签之间的偏差,换句话说,衡量了模型表现得有多差。您希望最小化或优化该值。
您的模型将使用 tf.keras.losses.SparseCategoricalCrossentropy 函数计算损失,该函数接收模型的类概率预测和期望标签,并返回各样本的平均损失。
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
def loss(model, x, y, training):
# training=training is needed only if there are layers with different
# behavior during training versus inference (e.g. Dropout).
y_ = model(x, training=training)
return loss_object(y_true=y, y_pred=y_)
l = loss(model, features, labels, training=False)
print("Loss test: {}".format(l))
Loss test: 1.0923010110855103
使用 tf.GradientTape 上下文来计算用于优化模型的 梯度 (gradients)。
def grad(model, inputs, targets):
with tf.GradientTape() as tape:
loss_value = loss(model, inputs, targets, training=True)
return loss_value, tape.gradient(loss_value, model.trainable_variables)
创建优化器
优化器 (optimizer) 将计算出的梯度应用于模型的参数,以最小化 loss 函数。您可以将损失函数想象成一个弯曲的表面(参考图 3),您希望通过四处走动找到它的最低点。梯度指向最陡峭的上升方向,因此您需要沿相反方向移动,即下坡。通过迭代计算每一批次的损失和梯度,您将在训练期间调整模型。模型将逐渐找到权重和偏差的最佳组合以最小化损失。损失越低,模型的预测就越好。
|
|
图 3. 3D 空间中随时间变化的优化算法可视化。 (来源: 斯坦福大学 CS231n 课程, MIT 许可, 图片来源: Alec Radford) |
TensorFlow 提供了许多可用于训练的优化算法。在本教程中,您将使用实现 随机梯度下降 (SGD) 算法的 tf.keras.optimizers.SGD。learning_rate 参数设置每次下坡迭代的步长。该速率是一个 超参数 (hyperparameter),您通常需要调整它以获得更好的结果。
使用 学习率 0.01 实例化优化器,这是一个在训练的每次迭代中与梯度相乘的标量值。
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)
然后使用此对象计算单个优化步骤。
loss_value, grads = grad(model, features, labels)
print("Step: {}, Initial Loss: {}".format(optimizer.iterations.numpy(),
loss_value.numpy()))
optimizer.apply_gradients(zip(grads, model.trainable_variables))
print("Step: {}, Loss: {}".format(optimizer.iterations.numpy(),
loss(model, features, labels, training=True).numpy()))
Step: 0, Initial Loss: 1.0923010110855103 Step: 1, Loss: 1.0911823511123657
训练循环
一切就绪后,模型就可以开始训练了!训练循环将数据集示例输入模型,以帮助其做出更好的预测。以下代码块设置了这些训练步骤:
- 迭代每个 epoch。一个 epoch 是对整个数据集的一次遍历。
- 在 epoch 内,遍历训练
Dataset中的每个示例,获取其 特征 (x) 和 标签 (y)。 - 使用示例的特征进行预测,并将其与标签进行比较。测量预测的不准确性,并使用它来计算模型的损失和梯度。
- 使用
optimizer更新模型的参数。 - 记录一些统计信息以供可视化。
- 对每个 epoch 重复此操作。
num_epochs 变量是遍历数据集集合的次数。在下方的代码中,num_epochs 被设置为 201,这意味着该训练循环将运行 201 次。与直觉相反,长时间训练模型并不能保证得到更好的模型。num_epochs 是一个可以调整的 超参数 (hyperparameter)。选择合适的数量通常需要经验和实验。
## Note: Rerunning this cell uses the same model parameters
# Keep results for plotting
train_loss_results = []
train_accuracy_results = []
num_epochs = 201
for epoch in range(num_epochs):
epoch_loss_avg = tf.keras.metrics.Mean()
epoch_accuracy = tf.keras.metrics.SparseCategoricalAccuracy()
# Training loop - using batches of 32
for x, y in ds_train_batch:
# Optimize the model
loss_value, grads = grad(model, x, y)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
# Track progress
epoch_loss_avg.update_state(loss_value) # Add current batch loss
# Compare predicted label to actual label
# training=True is needed only if there are layers with different
# behavior during training versus inference (e.g. Dropout).
epoch_accuracy.update_state(y, model(x, training=True))
# End epoch
train_loss_results.append(epoch_loss_avg.result())
train_accuracy_results.append(epoch_accuracy.result())
if epoch % 50 == 0:
print("Epoch {:03d}: Loss: {:.3f}, Accuracy: {:.3%}".format(epoch,
epoch_loss_avg.result(),
epoch_accuracy.result()))
Epoch 000: Loss: 1.082, Accuracy: 44.944% Epoch 050: Loss: 0.738, Accuracy: 80.524% Epoch 100: Loss: 0.377, Accuracy: 85.393% Epoch 150: Loss: 0.236, Accuracy: 95.506% Epoch 200: Loss: 0.164, Accuracy: 96.255%
或者,您也可以使用内置的 Keras Model.fit(ds_train_batch) 方法来训练模型。
可视化损失函数随时间的变化
虽然打印出模型的训练进度很有帮助,但您也可以使用 TensorFlow 附带的指标可视化工具 TensorBoard 来查看进度。对于这个简单的示例,您将使用 matplotlib 模块创建基本图表。
解读这些图表需要一定的经验,但总的来说,您希望看到 损失 下降,而 准确率 上升。
fig, axes = plt.subplots(2, sharex=True, figsize=(12, 8))
fig.suptitle('Training Metrics')
axes[0].set_ylabel("Loss", fontsize=14)
axes[0].plot(train_loss_results)
axes[1].set_ylabel("Accuracy", fontsize=14)
axes[1].set_xlabel("Epoch", fontsize=14)
axes[1].plot(train_accuracy_results)
plt.show()

评估模型的有效性
模型训练完成后,您可以获取其性能的一些统计信息。
评估 (Evaluating) 意味着确定模型做出预测的有效程度。为了确定模型在企鹅分类方面的有效性,请将一些测量值传递给模型,让模型预测它们代表什么企鹅物种。然后将模型的预测与实际标签进行比较。例如,在输入示例中有一半预测正确的模型的 准确率 (accuracy) 为 0.5。图 4 显示了一个稍微有效一些的模型,在 5 个预测中答对了 4 个,准确率为 80%。
| 示例特征 | 标签 | 模型预测 | |||
|---|---|---|---|---|---|
| 5.9 | 3.0 | 4.3 | 1.5 | 1 | 1 |
| 6.9 | 3.1 | 5.4 | 2.1 | 2 | 2 |
| 5.1 | 3.3 | 1.7 | 0.5 | 0 | 0 |
| 6.0 | 3.4 | 4.5 | 1.6 | 1 | 2 |
| 5.5 | 2.5 | 4.0 | 1.3 | 1 | 1 |
|
图 4. 准确率为 80% 的企鹅分类器。 | |||||
设置测试集
评估模型与训练模型类似。最大的区别在于,示例来自单独的 测试集 (test set),而不是训练集。为了公平地评估模型的有效性,用于评估模型的示例必须与用于训练模型的示例不同。
企鹅数据集没有单独的测试数据集,因此在之前的“下载数据集”部分中,您已将原始数据集拆分为测试集和训练集。使用 ds_test_batch 数据集进行评估。
在测试数据集上评估模型
与训练阶段不同,模型只评估单次 epoch 的测试数据。以下代码遍历测试集中的每个示例,并将模型的预测与实际标签进行比较。此比较用于衡量模型在整个测试集上的准确率。
test_accuracy = tf.keras.metrics.Accuracy()
ds_test_batch = ds_test.batch(10)
for (x, y) in ds_test_batch:
# training=False is needed only if there are layers with different
# behavior during training versus inference (e.g. Dropout).
logits = model(x, training=False)
prediction = tf.math.argmax(logits, axis=1, output_type=tf.int64)
test_accuracy(prediction, y)
print("Test set accuracy: {:.3%}".format(test_accuracy.result()))
Test set accuracy: 97.015%
您也可以使用 Keras 函数 model.evaluate(ds_test, return_dict=True) 来获取测试数据集的准确率信息。
例如,通过检查最后一批数据,您可以观察到模型的预测通常是正确的。
tf.stack([y,prediction],axis=1)
<tf.Tensor: shape=(7, 2), dtype=int64, numpy=
array([[1, 1],
[0, 0],
[2, 2],
[0, 0],
[1, 1],
[2, 2],
[0, 0]])>
使用训练好的模型进行预测
您已经训练了一个模型,并“证明”它在分类企鹅物种方面表现良好(虽然不是完美)。现在,让我们使用训练好的模型对 未标记的示例 (unlabeled examples) 进行预测;也就是说,对包含特征但不包含标签的示例进行预测。
在现实生活中,未标记的示例可能来自许多不同的来源,包括应用程序、CSV 文件和数据馈送。在本教程中,请手动提供三个未标记的示例来预测其标签。回想一下,标签数字映射为如下命名的表示形式:
0:阿德利企鹅1:帽带企鹅2:巴布亚企鹅
predict_dataset = tf.convert_to_tensor([
[0.3, 0.8, 0.4, 0.5,],
[0.4, 0.1, 0.8, 0.5,],
[0.7, 0.9, 0.8, 0.4]
])
# training=False is needed only if there are layers with different
# behavior during training versus inference (e.g. Dropout).
predictions = model(predict_dataset, training=False)
for i, logits in enumerate(predictions):
class_idx = tf.math.argmax(logits).numpy()
p = tf.nn.softmax(logits)[class_idx]
name = class_names[class_idx]
print("Example {} prediction: {} ({:4.1f}%)".format(i, name, 100*p))
Example 0 prediction: Adélie (90.4%) Example 1 prediction: Gentoo (96.0%) Example 2 prediction: Chinstrap (84.3%)
在 TensorFlow.org 上查看
在 Google Colab 中运行
在 GitHub 上查看源码
下载笔记本