自定义 Transformer 编码器

在 TensorFlow.org 上查看 在 Google Colab 中运行 在 GitHub 上查看源代码 下载笔记本

学习目标

TensorFlow 模型 NLP 库 是一个用于构建和训练现代高性能自然语言模型的工具集合。

tfm.nlp.networks.EncoderScaffold 是该库的核心,并且提出了许多新的网络架构来改进编码器。在本 Colab 笔记本中,我们将学习如何自定义编码器以使用新的网络架构。

安装和导入

安装 TensorFlow Model Garden pip 包

  • tf-models-official 是稳定的 Model Garden 包。请注意,它可能不包含 tensorflow_models github 存储库中的最新更改。要包含最新更改,您可以安装 tf-models-nightly,它是每天自动创建的 Model Garden 夜间包。
  • pip 将自动安装所有模型和依赖项。
pip install -q opencv-python
pip install -q tf-models-official

导入 Tensorflow 和其他库

import numpy as np
import tensorflow as tf

import tensorflow_models as tfm
nlp = tfm.nlp
2023-12-14 12:09:32.926415: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-14 12:09:32.926462: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-14 12:09:32.927992: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered

规范 BERT 编码器

在学习如何自定义编码器之前,让我们首先创建一个规范的 BERT 编码器,并使用它来实例化一个 bert_classifier.BertClassifier 用于分类任务。

cfg = {
    "vocab_size": 100,
    "hidden_size": 32,
    "num_layers": 3,
    "num_attention_heads": 4,
    "intermediate_size": 64,
    "activation": tfm.utils.activations.gelu,
    "dropout_rate": 0.1,
    "attention_dropout_rate": 0.1,
    "max_sequence_length": 16,
    "type_vocab_size": 2,
    "initializer": tf.keras.initializers.TruncatedNormal(stddev=0.02),
}
bert_encoder = nlp.networks.BertEncoder(**cfg)

def build_classifier(bert_encoder):
  return nlp.models.BertClassifier(bert_encoder, num_classes=2)

canonical_classifier_model = build_classifier(bert_encoder)

canonical_classifier_model 可以使用训练数据进行训练。有关如何训练模型的详细信息,请参阅 微调 bert 笔记本。我们在此省略训练模型的代码。

训练后,我们可以应用模型进行预测。

def predict(model):
  batch_size = 3
  np.random.seed(0)
  word_ids = np.random.randint(
      cfg["vocab_size"], size=(batch_size, cfg["max_sequence_length"]))
  mask = np.random.randint(2, size=(batch_size, cfg["max_sequence_length"]))
  type_ids = np.random.randint(
      cfg["type_vocab_size"], size=(batch_size, cfg["max_sequence_length"]))
  print(model([word_ids, mask, type_ids], training=False))

predict(canonical_classifier_model)
tf.Tensor(
[[ 0.03545166  0.30729884]
 [ 0.00677404  0.17251147]
 [-0.07276718  0.17345032]], shape=(3, 2), dtype=float32)

自定义 BERT 编码器

一个 BERT 编码器由一个嵌入网络和多个 Transformer 块组成,每个 Transformer 块包含一个注意力层和一个前馈层。

我们提供通过 (1) EncoderScaffold 和 (2) TransformerScaffold 自定义这些组件的简便方法。

使用 EncoderScaffold

networks.EncoderScaffold 允许用户提供一个自定义嵌入子网络(它将替换标准嵌入逻辑)和/或一个自定义隐藏层类(它将替换编码器中的 Transformer 实例化)。

无自定义

没有任何自定义,networks.EncoderScaffold 的行为与规范的 networks.BertEncoder 相同。

如以下示例所示,networks.EncoderScaffold 可以加载 networks.BertEncoder 的权重并输出相同的值

default_hidden_cfg = dict(
    num_attention_heads=cfg["num_attention_heads"],
    intermediate_size=cfg["intermediate_size"],
    intermediate_activation=cfg["activation"],
    dropout_rate=cfg["dropout_rate"],
    attention_dropout_rate=cfg["attention_dropout_rate"],
    kernel_initializer=cfg["initializer"],
)
default_embedding_cfg = dict(
    vocab_size=cfg["vocab_size"],
    type_vocab_size=cfg["type_vocab_size"],
    hidden_size=cfg["hidden_size"],
    initializer=cfg["initializer"],
    dropout_rate=cfg["dropout_rate"],
    max_seq_length=cfg["max_sequence_length"]
)
default_kwargs = dict(
    hidden_cfg=default_hidden_cfg,
    embedding_cfg=default_embedding_cfg,
    num_hidden_instances=cfg["num_layers"],
    pooled_output_dim=cfg["hidden_size"],
    return_all_layer_outputs=True,
    pooler_layer_initializer=cfg["initializer"],
)

encoder_scaffold = nlp.networks.EncoderScaffold(**default_kwargs)
classifier_model_from_encoder_scaffold = build_classifier(encoder_scaffold)
classifier_model_from_encoder_scaffold.set_weights(
    canonical_classifier_model.get_weights())
predict(classifier_model_from_encoder_scaffold)
WARNING:absl:The `Transformer` layer is deprecated. Please directly use `TransformerEncoderBlock`.
WARNING:absl:The `Transformer` layer is deprecated. Please directly use `TransformerEncoderBlock`.
WARNING:absl:The `Transformer` layer is deprecated. Please directly use `TransformerEncoderBlock`.
tf.Tensor(
[[ 0.03545166  0.30729884]
 [ 0.00677404  0.17251147]
 [-0.07276718  0.17345032]], shape=(3, 2), dtype=float32)

自定义嵌入

接下来,我们将展示如何使用自定义嵌入网络。

我们首先构建一个嵌入网络,它将替换默认网络。它将具有 2 个输入(maskword_ids)而不是 3 个,并且不会使用位置嵌入。

word_ids = tf.keras.layers.Input(
    shape=(cfg['max_sequence_length'],), dtype=tf.int32, name="input_word_ids")
mask = tf.keras.layers.Input(
    shape=(cfg['max_sequence_length'],), dtype=tf.int32, name="input_mask")
embedding_layer = nlp.layers.OnDeviceEmbedding(
    vocab_size=cfg['vocab_size'],
    embedding_width=cfg['hidden_size'],
    initializer=cfg["initializer"],
    name="word_embeddings")
word_embeddings = embedding_layer(word_ids)
attention_mask = nlp.layers.SelfAttentionMask()([word_embeddings, mask])
new_embedding_network = tf.keras.Model([word_ids, mask],
                                       [word_embeddings, attention_mask])
/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/keras/src/initializers/initializers.py:120: UserWarning: The initializer TruncatedNormal is unseeded and being called multiple times, which will return identical values each time (even if the initializer is unseeded). Please update your code to provide a seed to the initializer, or avoid using the same initializer instance more than once.
  warnings.warn(

检查 new_embedding_network,我们可以看到它接受两个输入:input_word_idsinput_mask

tf.keras.utils.plot_model(new_embedding_network, show_shapes=True, dpi=48)

png

然后,我们可以使用上面的 new_embedding_network 构建一个新的编码器。

kwargs = dict(default_kwargs)

# Use new embedding network.
kwargs['embedding_cls'] = new_embedding_network
kwargs['embedding_data'] = embedding_layer.embeddings

encoder_with_customized_embedding = nlp.networks.EncoderScaffold(**kwargs)
classifier_model = build_classifier(encoder_with_customized_embedding)
# ... Train the model ...
print(classifier_model.inputs)

# Assert that there are only two inputs.
assert len(classifier_model.inputs) == 2
WARNING:absl:The `Transformer` layer is deprecated. Please directly use `TransformerEncoderBlock`.
WARNING:absl:The `Transformer` layer is deprecated. Please directly use `TransformerEncoderBlock`.
WARNING:absl:The `Transformer` layer is deprecated. Please directly use `TransformerEncoderBlock`.
/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/keras/src/initializers/initializers.py:120: UserWarning: The initializer TruncatedNormal is unseeded and being called multiple times, which will return identical values each time (even if the initializer is unseeded). Please update your code to provide a seed to the initializer, or avoid using the same initializer instance more than once.
  warnings.warn(
[<KerasTensor: shape=(None, 16) dtype=int32 (created by layer 'input_word_ids')>, <KerasTensor: shape=(None, 16) dtype=int32 (created by layer 'input_mask')>]

自定义 Transformer

用户还可以覆盖 networks.EncoderScaffold 构造函数中的 hidden_cls 参数,以使用自定义 Transformer 层。

请参阅 nlp.layers.ReZeroTransformer 的源代码,了解如何实现自定义 Transformer 层。

以下是如何使用 nlp.layers.ReZeroTransformer 的示例。

kwargs = dict(default_kwargs)

# Use ReZeroTransformer.
kwargs['hidden_cls'] = nlp.layers.ReZeroTransformer

encoder_with_rezero_transformer = nlp.networks.EncoderScaffold(**kwargs)
classifier_model = build_classifier(encoder_with_rezero_transformer)
# ... Train the model ...
predict(classifier_model)

# Assert that the variable `rezero_alpha` from ReZeroTransformer exists.
assert 'rezero_alpha' in ''.join([x.name for x in classifier_model.trainable_weights])
tf.Tensor(
[[-0.08663296  0.09281035]
 [-0.07291833  0.36477187]
 [-0.08730186  0.1503254 ]], shape=(3, 2), dtype=float32)

使用 nlp.layers.TransformerScaffold

以上自定义模型的方法需要重写整个 nlp.layers.Transformer 层,而有时您可能只想自定义注意力层或前馈块。在这种情况下,可以使用 nlp.layers.TransformerScaffold

自定义注意力层

用户还可以覆盖 layers.TransformerScaffold 构造函数中的 attention_cls 参数,以使用自定义注意力层。

请参阅 nlp.layers.TalkingHeadsAttention 的源代码,了解如何实现自定义 Attention 层。

以下是如何使用 nlp.layers.TalkingHeadsAttention 的示例。

# Use TalkingHeadsAttention
hidden_cfg = dict(default_hidden_cfg)
hidden_cfg['attention_cls'] = nlp.layers.TalkingHeadsAttention

kwargs = dict(default_kwargs)
kwargs['hidden_cls'] = nlp.layers.TransformerScaffold
kwargs['hidden_cfg'] = hidden_cfg

encoder = nlp.networks.EncoderScaffold(**kwargs)
classifier_model = build_classifier(encoder)
# ... Train the model ...
predict(classifier_model)

# Assert that the variable `pre_softmax_weight` from TalkingHeadsAttention exists.
assert 'pre_softmax_weight' in ''.join([x.name for x in classifier_model.trainable_weights])
tf.Tensor(
[[-0.20591784  0.09203205]
 [-0.0056177  -0.10278902]
 [-0.21681327 -0.12282   ]], shape=(3, 2), dtype=float32)
tf.keras.utils.plot_model(encoder_with_rezero_transformer, show_shapes=True, dpi=48)

png

自定义前馈层

类似地,您也可以自定义前馈层。

请参阅 nlp.layers.GatedFeedforward 的源代码,了解如何实现自定义前馈层。

以下是一个使用 nlp.layers.GatedFeedforward 的示例。

# Use GatedFeedforward
hidden_cfg = dict(default_hidden_cfg)
hidden_cfg['feedforward_cls'] = nlp.layers.GatedFeedforward

kwargs = dict(default_kwargs)
kwargs['hidden_cls'] = nlp.layers.TransformerScaffold
kwargs['hidden_cfg'] = hidden_cfg

encoder_with_gated_feedforward = nlp.networks.EncoderScaffold(**kwargs)
classifier_model = build_classifier(encoder_with_gated_feedforward)
# ... Train the model ...
predict(classifier_model)

# Assert that the variable `gate` from GatedFeedforward exists.
assert 'gate' in ''.join([x.name for x in classifier_model.trainable_weights])
/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/keras/src/initializers/initializers.py:120: UserWarning: The initializer TruncatedNormal is unseeded and being called multiple times, which will return identical values each time (even if the initializer is unseeded). Please update your code to provide a seed to the initializer, or avoid using the same initializer instance more than once.
  warnings.warn(
tf.Tensor(
[[-0.10270456 -0.10999684]
 [-0.03512481  0.15430304]
 [-0.23601504 -0.18162844]], shape=(3, 2), dtype=float32)

构建一个新的编码器

最后,您还可以使用模型库中的构建块构建一个新的编码器。

请参阅 nlp.networks.AlbertEncoder 的源代码 作为示例。

以下是一个使用 nlp.networks.AlbertEncoder 的示例。

albert_encoder = nlp.networks.AlbertEncoder(**cfg)
classifier_model = build_classifier(albert_encoder)
# ... Train the model ...
predict(classifier_model)
tf.Tensor(
[[-0.00369881 -0.2540995 ]
 [ 0.1235221  -0.2959229 ]
 [-0.08698564 -0.17653546]], shape=(3, 2), dtype=float32)

检查 albert_encoder,我们看到它多次堆叠相同的 Transformer 层(注意下面“Transformer”块的循环回溯)。

tf.keras.utils.plot_model(albert_encoder, show_shapes=True, dpi=48)

png