使用 BERT 对文本进行分类

在 TensorFlow.org 上查看 在 Google Colab 中运行 在 GitHub 上查看 下载笔记本 查看 TF Hub 模型

本教程包含完整的代码,用于微调 BERT 以对纯文本 IMDB 电影评论数据集执行情感分析。除了训练模型,您还将学习如何将文本预处理为适当的格式。

在本笔记本中,您将

  • 加载 IMDB 数据集
  • 从 TensorFlow Hub 加载 BERT 模型
  • 通过将 BERT 与分类器相结合来构建自己的模型
  • 训练您自己的模型,将 BERT 作为其中的一部分进行微调
  • 保存您的模型并使用它对句子进行分类

如果您不熟悉使用 IMDB 数据集,请参阅 基本文本分类 以了解更多详细信息。

关于 BERT

BERT 和其他 Transformer 编码器架构在 NLP(自然语言处理)的各种任务中取得了巨大成功。它们计算自然语言的向量空间表示,适合用于深度学习模型。BERT 模型系列使用 Transformer 编码器架构来处理输入文本的每个标记,在所有之前和之后的标记的完整上下文中进行处理,因此得名:来自 Transformer 的双向编码器表示。

BERT 模型通常在大型文本语料库上进行预训练,然后针对特定任务进行微调。

设置

# A dependency of the preprocessing for BERT inputs
pip install -U "tensorflow-text==2.13.*"

您将使用来自 tensorflow/models 的 AdamW 优化器。

pip install "tf-models-official==2.13.*"
import os
import shutil

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
from official.nlp import optimization  # to create AdamW optimizer

import matplotlib.pyplot as plt

tf.get_logger().setLevel('ERROR')

情感分析

本笔记本训练了一个情感分析模型,根据评论的文本将电影评论分类为正面负面

您将使用 大型电影评论数据集,其中包含来自 互联网电影数据库 的 50,000 条电影评论的文本。

下载 IMDB 数据集

让我们下载并解压缩数据集,然后探索目录结构。

url = 'https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'

dataset = tf.keras.utils.get_file('aclImdb_v1.tar.gz', url,
                                  untar=True, cache_dir='.',
                                  cache_subdir='')

dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')

train_dir = os.path.join(dataset_dir, 'train')

# remove unused folders to make it easier to load the data
remove_dir = os.path.join(train_dir, 'unsup')
shutil.rmtree(remove_dir)
Downloading data from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
84125825/84125825 [==============================] - 6s 0us/step

接下来,您将使用 text_dataset_from_directory 实用程序来创建带标签的 tf.data.Dataset

IMDB 数据集已分为训练集和测试集,但缺少验证集。让我们使用以下 validation_split 参数,对训练数据进行 80:20 分割,创建一个验证集。

AUTOTUNE = tf.data.AUTOTUNE
batch_size = 32
seed = 42

raw_train_ds = tf.keras.utils.text_dataset_from_directory(
    'aclImdb/train',
    batch_size=batch_size,
    validation_split=0.2,
    subset='training',
    seed=seed)

class_names = raw_train_ds.class_names
train_ds = raw_train_ds.cache().prefetch(buffer_size=AUTOTUNE)

val_ds = tf.keras.utils.text_dataset_from_directory(
    'aclImdb/train',
    batch_size=batch_size,
    validation_split=0.2,
    subset='validation',
    seed=seed)

val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

test_ds = tf.keras.utils.text_dataset_from_directory(
    'aclImdb/test',
    batch_size=batch_size)

test_ds = test_ds.cache().prefetch(buffer_size=AUTOTUNE)
Found 25000 files belonging to 2 classes.
Using 20000 files for training.
2023-11-17 13:38:17.555894: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1960] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://tensorflowcn.cn/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
Found 25000 files belonging to 2 classes.
Using 5000 files for validation.
Found 25000 files belonging to 2 classes.

让我们看看一些评论。

for text_batch, label_batch in train_ds.take(1):
  for i in range(3):
    print(f'Review: {text_batch.numpy()[i]}')
    label = label_batch.numpy()[i]
    print(f'Label : {label} ({class_names[label]})')
Review: b'"Pandemonium" is a horror movie spoof that comes off more stupid than funny. Believe me when I tell you, I love comedies. Especially comedy spoofs. "Airplane", "The Naked Gun" trilogy, "Blazing Saddles", "High Anxiety", and "Spaceballs" are some of my favorite comedies that spoof a particular genre. "Pandemonium" is not up there with those films. Most of the scenes in this movie had me sitting there in stunned silence because the movie wasn\'t all that funny. There are a few laughs in the film, but when you watch a comedy, you expect to laugh a lot more than a few times and that\'s all this film has going for it. Geez, "Scream" had more laughs than this film and that was more of a horror film. How bizarre is that?<br /><br />*1/2 (out of four)'
Label : 0 (neg)
Review: b"David Mamet is a very interesting and a very un-equal director. His first movie 'House of Games' was the one I liked best, and it set a series of films with characters whose perspective of life changes as they get into complicated situations, and so does the perspective of the viewer.<br /><br />So is 'Homicide' which from the title tries to set the mind of the viewer to the usual crime drama. The principal characters are two cops, one Jewish and one Irish who deal with a racially charged area. The murder of an old Jewish shop owner who proves to be an ancient veteran of the Israeli Independence war triggers the Jewish identity in the mind and heart of the Jewish detective.<br /><br />This is were the flaws of the film are the more obvious. The process of awakening is theatrical and hard to believe, the group of Jewish militants is operatic, and the way the detective eventually walks to the final violent confrontation is pathetic. The end of the film itself is Mamet-like smart, but disappoints from a human emotional perspective.<br /><br />Joe Mantegna and William Macy give strong performances, but the flaws of the story are too evident to be easily compensated."
Label : 0 (neg)
Review: b'Great documentary about the lives of NY firefighters during the worst terrorist attack of all time.. That reason alone is why this should be a must see collectors item.. What shocked me was not only the attacks, but the"High Fat Diet" and physical appearance of some of these firefighters. I think a lot of Doctors would agree with me that,in the physical shape they were in, some of these firefighters would NOT of made it to the 79th floor carrying over 60 lbs of gear. Having said that i now have a greater respect for firefighters and i realize becoming a firefighter is a life altering job. The French have a history of making great documentary\'s and that is what this is, a Great Documentary.....'
Label : 1 (pos)
2023-11-17 13:38:20.000995: W tensorflow/core/kernels/data/cache_dataset_ops.cc:854] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.

从 TensorFlow Hub 加载模型

在这里,您可以选择要从 TensorFlow Hub 加载和微调的 BERT 模型。有多个 BERT 模型可用。

  • BERT-Base未分词七个更多模型,这些模型具有由原始 BERT 作者发布的训练权重。
  • 小型 BERT 具有相同的通用架构,但 Transformer 块更少或更小,这使您可以探索速度、大小和质量之间的权衡。
  • ALBERT:四种不同大小的“轻量级 BERT”,通过在层之间共享参数来减小模型大小(但不会减少计算时间)。
  • BERT 专家:八个模型,它们都具有 BERT-base 架构,但提供不同预训练域之间的选择,以更紧密地与目标任务对齐。
  • Electra 具有与 BERT 相同的架构(三种不同大小),但在类似于生成对抗网络 (GAN) 的设置中作为鉴别器进行预训练。
  • 具有 Talking-Heads 注意力和 Gated GELU 的 BERT [基础大型] 对 Transformer 架构的核心进行了两项改进。

TensorFlow Hub 上的模型文档提供了更多详细信息以及对研究文献的引用。请关注上面的链接,或单击下一个单元格执行后打印的 tfhub.dev URL。

建议从小型 BERT(参数较少)开始,因为它们微调起来更快。如果您喜欢小型模型,但希望获得更高的准确率,ALBERT 可能是您的下一个选择。如果您希望获得更高的准确率,请选择经典 BERT 尺寸之一或其最近的改进,例如 Electra、Talking Heads 或 BERT 专家。

除了以下提供的模型之外,还有 多个版本 的模型,它们更大,可以产生更高的准确率,但它们太大,无法在单个 GPU 上进行微调。您可以在 使用 BERT 在 TPU colab 上解决 GLUE 任务 上进行微调。

您将在下面的代码中看到,切换 tfhub.dev URL 就足以尝试任何这些模型,因为它们之间的所有差异都封装在来自 TF Hub 的 SavedModels 中。

选择一个 BERT 模型进行微调

BERT model selected           : https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1
Preprocess model auto-selected: https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3

预处理模型

文本输入需要转换为数字 token ID 并排列在多个张量中,然后才能输入到 BERT。TensorFlow Hub 为上面讨论的每个 BERT 模型提供了一个匹配的预处理模型,该模型使用来自 TF.text 库的 TF 操作来实现此转换。无需在 TensorFlow 模型之外运行纯 Python 代码来预处理文本。

预处理模型必须是 BERT 模型文档中引用的模型,您可以在上面打印的 URL 中阅读该文档。对于来自上面下拉菜单的 BERT 模型,预处理模型会自动选择。

bert_preprocess_model = hub.KerasLayer(tfhub_handle_preprocess)

让我们尝试对一些文本使用预处理模型并查看输出

text_test = ['this is such an amazing movie!']
text_preprocessed = bert_preprocess_model(text_test)

print(f'Keys       : {list(text_preprocessed.keys())}')
print(f'Shape      : {text_preprocessed["input_word_ids"].shape}')
print(f'Word Ids   : {text_preprocessed["input_word_ids"][0, :12]}')
print(f'Input Mask : {text_preprocessed["input_mask"][0, :12]}')
print(f'Type Ids   : {text_preprocessed["input_type_ids"][0, :12]}')
Keys       : ['input_mask', 'input_word_ids', 'input_type_ids']
Shape      : (1, 128)
Word Ids   : [ 101 2023 2003 2107 2019 6429 3185  999  102    0    0    0]
Input Mask : [1 1 1 1 1 1 1 1 1 0 0 0]
Type Ids   : [0 0 0 0 0 0 0 0 0 0 0 0]

如您所见,现在您有了预处理的 3 个输出,BERT 模型将使用这些输出 (input_words_idinput_maskinput_type_ids)。

其他一些重要事项

  • 输入被截断为 128 个 token。token 的数量可以自定义,您可以在 使用 BERT 在 TPU colab 上解决 GLUE 任务 上查看更多详细信息。
  • input_type_ids 只有一个值 (0),因为这是一个单句输入。对于多句输入,它将为每个输入有一个数字。

由于此文本预处理器是一个 TensorFlow 模型,因此它可以直接包含在您的模型中。

使用 BERT 模型

在将 BERT 放入您自己的模型之前,让我们看一下它的输出。您将从 TF Hub 加载它并查看返回的值。

bert_model = hub.KerasLayer(tfhub_handle_encoder)
bert_results = bert_model(text_preprocessed)

print(f'Loaded BERT: {tfhub_handle_encoder}')
print(f'Pooled Outputs Shape:{bert_results["pooled_output"].shape}')
print(f'Pooled Outputs Values:{bert_results["pooled_output"][0, :12]}')
print(f'Sequence Outputs Shape:{bert_results["sequence_output"].shape}')
print(f'Sequence Outputs Values:{bert_results["sequence_output"][0, :12]}')
Loaded BERT: https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1
Pooled Outputs Shape:(1, 512)
Pooled Outputs Values:[ 0.762629    0.99280983 -0.18611868  0.36673862  0.15233733  0.6550447
  0.9681154  -0.9486271   0.00216128 -0.9877732   0.06842692 -0.97630584]
Sequence Outputs Shape:(1, 128, 512)
Sequence Outputs Values:[[-0.28946346  0.3432128   0.33231518 ...  0.21300825  0.7102068
  -0.05771117]
 [-0.28742072  0.31981036 -0.23018576 ...  0.58455    -0.21329743
   0.72692114]
 [-0.66157067  0.68876773 -0.8743301  ...  0.1087725  -0.26173177
   0.47855407]
 ...
 [-0.2256118  -0.2892561  -0.0706445  ...  0.47566038  0.83277136
   0.40025333]
 [-0.2982428  -0.27473134 -0.05450517 ...  0.48849747  1.0955354
   0.18163396]
 [-0.44378242  0.00930811  0.07223688 ...  0.1729009   1.1833243
   0.07898017]]

BERT 模型返回一个包含 3 个重要键的映射:pooled_outputsequence_outputencoder_outputs

  • pooled_output 代表每个输入序列作为一个整体。形状为 [batch_size, H]。您可以将其视为整个电影评论的嵌入。
  • sequence_output 代表输入序列中的每个 token。形状为 [batch_size, seq_length, H]。您可以将其视为电影评论中每个 token 的上下文嵌入。
  • encoder_outputsL 个 Transformer 块的中间激活。outputs["encoder_outputs"][i] 是一个形状为 [batch_size, seq_length, 1024] 的张量,包含第 i 个 Transformer 块的输出,其中 0 <= i < L。列表的最后一个值等于 sequence_output

对于微调,您将使用 pooled_output 数组。

定义您的模型

您将创建一个非常简单的微调模型,其中包含预处理模型、选定的 BERT 模型、一个 Dense 层和一个 Dropout 层。

def build_classifier_model():
  text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
  preprocessing_layer = hub.KerasLayer(tfhub_handle_preprocess, name='preprocessing')
  encoder_inputs = preprocessing_layer(text_input)
  encoder = hub.KerasLayer(tfhub_handle_encoder, trainable=True, name='BERT_encoder')
  outputs = encoder(encoder_inputs)
  net = outputs['pooled_output']
  net = tf.keras.layers.Dropout(0.1)(net)
  net = tf.keras.layers.Dense(1, activation=None, name='classifier')(net)
  return tf.keras.Model(text_input, net)

让我们检查一下模型是否使用预处理模型的输出运行。

classifier_model = build_classifier_model()
bert_raw_result = classifier_model(tf.constant(text_test))
print(tf.sigmoid(bert_raw_result))
tf.Tensor([[0.21878408]], shape=(1, 1), dtype=float32)

当然,输出毫无意义,因为模型尚未训练。

让我们看一下模型的结构。

tf.keras.utils.plot_model(classifier_model)

png

模型训练

现在您拥有训练模型的所有要素,包括预处理模块、BERT 编码器、数据和分类器。

损失函数

由于这是一个二元分类问题,并且模型输出概率(单单元层),因此您将使用 losses.BinaryCrossentropy 损失函数。

loss = tf.keras.losses.BinaryCrossentropy(from_logits=True)
metrics = tf.metrics.BinaryAccuracy()

优化器

对于微调,让我们使用 BERT 最初训练时使用的相同优化器:“自适应矩”(Adam)。此优化器最小化预测损失并通过权重衰减(不使用矩)进行正则化,这也称为 AdamW

对于学习率 (init_lr),您将使用与 BERT 预训练相同的计划:名义初始学习率的线性衰减,在训练步骤的前 10% (num_warmup_steps) 中以线性预热阶段为前缀。与 BERT 论文一致,初始学习率对于微调更小(5e-5、3e-5、2e-5 中的最佳值)。

epochs = 5
steps_per_epoch = tf.data.experimental.cardinality(train_ds).numpy()
num_train_steps = steps_per_epoch * epochs
num_warmup_steps = int(0.1*num_train_steps)

init_lr = 3e-5
optimizer = optimization.create_optimizer(init_lr=init_lr,
                                          num_train_steps=num_train_steps,
                                          num_warmup_steps=num_warmup_steps,
                                          optimizer_type='adamw')

加载 BERT 模型并进行训练

使用您之前创建的 classifier_model,您可以使用损失、指标和优化器编译模型。

classifier_model.compile(optimizer=optimizer,
                         loss=loss,
                         metrics=metrics)
print(f'Training model with {tfhub_handle_encoder}')
history = classifier_model.fit(x=train_ds,
                               validation_data=val_ds,
                               epochs=epochs)
Training model with https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1
Epoch 1/5
625/625 [==============================] - 703s 1s/step - loss: 0.4885 - binary_accuracy: 0.7435 - val_loss: 0.3795 - val_binary_accuracy: 0.8350
Epoch 2/5
625/625 [==============================] - 688s 1s/step - loss: 0.3300 - binary_accuracy: 0.8536 - val_loss: 0.3708 - val_binary_accuracy: 0.8424
Epoch 3/5
625/625 [==============================] - 687s 1s/step - loss: 0.2522 - binary_accuracy: 0.8935 - val_loss: 0.3942 - val_binary_accuracy: 0.8434
Epoch 4/5
625/625 [==============================] - 687s 1s/step - loss: 0.1991 - binary_accuracy: 0.9208 - val_loss: 0.4291 - val_binary_accuracy: 0.8514
Epoch 5/5
625/625 [==============================] - 686s 1s/step - loss: 0.1573 - binary_accuracy: 0.9410 - val_loss: 0.4724 - val_binary_accuracy: 0.8536

评估模型

让我们看看模型的表现。将返回两个值。损失(表示误差的数字,值越低越好)和准确率。

loss, accuracy = classifier_model.evaluate(test_ds)

print(f'Loss: {loss}')
print(f'Accuracy: {accuracy}')
782/782 [==============================] - 221s 282ms/step - loss: 0.4534 - binary_accuracy: 0.8548
Loss: 0.4534495174884796
Accuracy: 0.8547999858856201

绘制随时间变化的准确率和损失

基于 model.fit() 返回的 History 对象。您可以绘制训练和验证损失以进行比较,以及训练和验证准确率

history_dict = history.history
print(history_dict.keys())

acc = history_dict['binary_accuracy']
val_acc = history_dict['val_binary_accuracy']
loss = history_dict['loss']
val_loss = history_dict['val_loss']

epochs = range(1, len(acc) + 1)
fig = plt.figure(figsize=(10, 6))
fig.tight_layout()

plt.subplot(2, 1, 1)
# r is for "solid red line"
plt.plot(epochs, loss, 'r', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
# plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.subplot(2, 1, 2)
plt.plot(epochs, acc, 'r', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
dict_keys(['loss', 'binary_accuracy', 'val_loss', 'val_binary_accuracy'])
<matplotlib.legend.Legend at 0x7fa410141c10>

png

在此图中,红线代表训练损失和准确率,蓝线代表验证损失和准确率。

导出以供推理

现在您只需保存您的微调模型以备后用。

dataset_name = 'imdb'
saved_model_path = './{}_bert'.format(dataset_name.replace('/', '_'))

classifier_model.save(saved_model_path, include_optimizer=False)

让我们重新加载模型,这样您就可以将其与仍在内存中的模型并排尝试。

reloaded_model = tf.saved_model.load(saved_model_path)

在这里,您可以对任何您想要的句子测试您的模型,只需将其添加到下面的示例变量中。

def print_my_examples(inputs, results):
  result_for_printing = \
    [f'input: {inputs[i]:<30} : score: {results[i][0]:.6f}'
                         for i in range(len(inputs))]
  print(*result_for_printing, sep='\n')
  print()


examples = [
    'this is such an amazing movie!',  # this is the same sentence tried earlier
    'The movie was great!',
    'The movie was meh.',
    'The movie was okish.',
    'The movie was terrible...'
]

reloaded_results = tf.sigmoid(reloaded_model(tf.constant(examples)))
original_results = tf.sigmoid(classifier_model(tf.constant(examples)))

print('Results from the saved model:')
print_my_examples(examples, reloaded_results)
print('Results from the model in memory:')
print_my_examples(examples, original_results)
Results from the saved model:
input: this is such an amazing movie! : score: 0.997565
input: The movie was great!           : score: 0.983252
input: The movie was meh.             : score: 0.986901
input: The movie was okish.           : score: 0.206568
input: The movie was terrible...      : score: 0.001724

Results from the model in memory:
input: this is such an amazing movie! : score: 0.997565
input: The movie was great!           : score: 0.983252
input: The movie was meh.             : score: 0.986901
input: The movie was okish.           : score: 0.206568
input: The movie was terrible...      : score: 0.001724

如果您想在 TF Serving 上使用您的模型,请记住它将通过其命名签名之一调用您的 SavedModel。在 Python 中,您可以按如下方式测试它们

serving_results = reloaded_model \
            .signatures['serving_default'](tf.constant(examples))

serving_results = tf.sigmoid(serving_results['classifier'])

print_my_examples(examples, serving_results)
input: this is such an amazing movie! : score: 0.997565
input: The movie was great!           : score: 0.983252
input: The movie was meh.             : score: 0.986901
input: The movie was okish.           : score: 0.206568
input: The movie was terrible...      : score: 0.001724

下一步

下一步,您可以尝试 使用 BERT 在 TPU 教程上解决 GLUE 任务,该教程在 TPU 上运行,并向您展示如何处理多个输入。