词嵌入

在 TensorFlow.org 上查看

在 Google Colab 中运行

在 GitHub 上查看源代码

下载笔记本

本教程介绍了词嵌入。您将使用简单的 Keras 模型针对情感分类任务训练自己的词嵌入，然后在嵌入投影仪（如以下图像所示）中对其进行可视化。

Screenshot of the embedding projector

将文本表示为数字

机器学习模型以向量（数字数组）作为输入。在处理文本时，您首先需要制定一种策略，在将文本馈送到模型之前将其转换为数字（或“向量化”文本）。在本节中，您将研究三种实现此目的的策略。

独热编码

首先，您可以对词汇表中的每个单词进行“独热”编码。考虑句子“The cat sat on the mat”。此句子中的词汇表（或唯一单词）为（cat、mat、on、sat、the）。为了表示每个单词，您将创建一个长度等于词汇表的零向量，然后在对应于该单词的索引处放置一个 1。此方法在以下图表中显示。

Diagram of one-hot encodings

要创建包含句子编码的向量，您可以将每个单词的独热向量连接起来。

使用唯一数字对每个单词进行编码

您可以尝试的第二种方法是使用唯一数字对每个单词进行编码。继续上面的示例，您可以将 1 分配给“cat”，将 2 分配给“mat”，依此类推。然后，您可以将句子“The cat sat on the mat”编码为类似于 [5, 1, 4, 3, 5, 2] 的密集向量。此方法效率很高。您现在拥有一个密集向量（其中所有元素都已填充），而不是稀疏向量。

但是，此方法有两个缺点

整数编码是任意的（它不会捕获单词之间的任何关系）。
整数编码可能难以让模型解释。例如，线性分类器为每个特征学习一个权重。由于任何两个单词的相似度与其编码的相似度之间没有关系，因此这种特征权重组合没有意义。

词嵌入

词嵌入为我们提供了一种使用高效密集表示的方法，其中相似单词具有相似的编码。重要的是，您不必手动指定此编码。嵌入是浮点值的密集向量（向量的长度是您指定的参数）。您不必手动指定嵌入的值，它们是可训练参数（模型在训练期间学习的权重，与模型学习密集层的权重的方式相同）。通常会看到 8 维的词嵌入（用于小型数据集），而处理大型数据集时则会看到高达 1024 维的词嵌入。更高维度的嵌入可以捕获单词之间更细粒度的关系，但需要更多数据才能学习。

Diagram of an embedding

上面是词嵌入的图表。每个单词都表示为一个 4 维的浮点值向量。另一种看待嵌入的方式是将其视为“查找表”。在学习这些权重后，您可以通过在表中查找与之对应的密集向量来对每个单词进行编码。

设置

import io
import os
import re
import shutil
import string
import tensorflow as tf

from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D
from tensorflow.keras.layers import TextVectorization

2022-12-14 12:16:56.601690: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2022-12-14 12:16:56.601797: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2022-12-14 12:16:56.601808: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

下载 IMDb 数据集

您将在本教程中使用大型电影评论数据集。您将在该数据集上训练情感分类器模型，并在此过程中从头开始学习嵌入。要详细了解如何从头开始加载数据集，请参阅加载文本教程。

使用 Keras 文件实用程序下载数据集，并查看目录。

url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

dataset = tf.keras.utils.get_file("aclImdb_v1.tar.gz", url,
                                  untar=True, cache_dir='.',
                                  cache_subdir='')

dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')
os.listdir(dataset_dir)

Downloading data from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
84125825/84125825 [==============================] - 3s 0us/step
['test', 'imdb.vocab', 'README', 'train', 'imdbEr.txt']

查看 train/ 目录。它包含 pos 和 neg 文件夹，分别包含标记为正面和负面的电影评论。您将使用来自 pos 和 neg 文件夹的评论来训练二元分类模型。

train_dir = os.path.join(dataset_dir, 'train')
os.listdir(train_dir)

['pos',
 'neg',
 'unsupBow.feat',
 'urls_pos.txt',
 'unsup',
 'urls_neg.txt',
 'urls_unsup.txt',
 'labeledBow.feat']

train 目录还包含其他文件夹，在创建训练数据集之前应将其删除。

remove_dir = os.path.join(train_dir, 'unsup')
shutil.rmtree(remove_dir)

接下来，使用 tf.data.Dataset 创建一个数据集，可以使用 tf.keras.utils.text_dataset_from_directory。你可以在这个文本分类教程中了解更多关于使用此工具的信息。

使用 train 目录创建训练集和验证集，验证集占 20%。

batch_size = 1024
seed = 123
train_ds = tf.keras.utils.text_dataset_from_directory(
    'aclImdb/train', batch_size=batch_size, validation_split=0.2,
    subset='training', seed=seed)
val_ds = tf.keras.utils.text_dataset_from_directory(
    'aclImdb/train', batch_size=batch_size, validation_split=0.2,
    subset='validation', seed=seed)

Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Found 25000 files belonging to 2 classes.
Using 5000 files for validation.

查看训练集中的一些电影评论及其标签 (1: positive, 0: negative)。

for text_batch, label_batch in train_ds.take(1):
  for i in range(5):
    print(label_batch[i].numpy(), text_batch.numpy()[i])

0 b"Oh My God! Please, for the love of all that is holy, Do Not Watch This Movie! It it 82 minutes of my life I will never get back. Sure, I could have stopped watching half way through. But I thought it might get better. It Didn't. Anyone who actually enjoyed this movie is one seriously sick and twisted individual. No wonder us Australians/New Zealanders have a terrible reputation when it comes to making movies. Everything about this movie is horrible, from the acting to the editing. I don't even normally write reviews on here, but in this case I'll make an exception. I only wish someone had of warned me before I hired this catastrophe"
1 b'This movie is SOOOO funny!!! The acting is WONDERFUL, the Ramones are sexy, the jokes are subtle, and the plot is just what every high schooler dreams of doing to his/her school. I absolutely loved the soundtrack as well as the carefully placed cynicism. If you like monty python, You will love this film. This movie is a tad bit "grease"esk (without all the annoying songs). The songs that are sung are likable; you might even find yourself singing these songs once the movie is through. This musical ranks number two in musicals to me (second next to the blues brothers). But please, do not think of it as a musical per say; seeing as how the songs are so likable, it is hard to tell a carefully choreographed scene is taking place. I think of this movie as more of a comedy with undertones of romance. You will be reminded of what it was like to be a rebellious teenager; needless to say, you will be reminiscing of your old high school days after seeing this film. Highly recommended for both the family (since it is a very youthful but also for adults since there are many jokes that are funnier with age and experience.'
0 b"Alex D. Linz replaces Macaulay Culkin as the central figure in the third movie in the Home Alone empire. Four industrial spies acquire a missile guidance system computer chip and smuggle it through an airport inside a remote controlled toy car. Because of baggage confusion, grouchy Mrs. Hess (Marian Seldes) gets the car. She gives it to her neighbor, Alex (Linz), just before the spies turn up. The spies rent a house in order to burglarize each house in the neighborhood until they locate the car. Home alone with the chicken pox, Alex calls 911 each time he spots a theft in progress, but the spies always manage to elude the police while Alex is accused of making prank calls. The spies finally turn their attentions toward Alex, unaware that he has rigged devices to cleverly booby-trap his entire house. Home Alone 3 wasn't horrible, but probably shouldn't have been made, you can't just replace Macauley Culkin, Joe Pesci, or Daniel Stern. Home Alone 3 had some funny parts, but I don't like when characters are changed in a movie series, view at own risk."
0 b"There's a good movie lurking here, but this isn't it. The basic idea is good: to explore the moral issues that would face a group of young survivors of the apocalypse. But the logic is so muddled that it's impossible to get involved. For example, our four heroes are (understandably) paranoid about catching the mysterious airborne contagion that's wiped out virtually all of mankind. Yet they wear surgical masks some times, not others. Some times they're fanatical about wiping down with bleach any area touched by an infected person. Other times, they seem completely unconcerned. Worse, after apparently surviving some weeks or months in this new kill-or-be-killed world, these people constantly behave like total newbs. They don't bother accumulating proper equipment, or food. They're forever running out of fuel in the middle of nowhere. They don't take elementary precautions when meeting strangers. And after wading through the rotting corpses of the entire human race, they're as squeamish as sheltered debutantes. You have to constantly wonder how they could have survived this long... and even if they did, why anyone would want to make a movie about them. So when these dweebs stop to agonize over the moral dimensions of their actions, it's impossible to take their soul-searching seriously. Their actions would first have to make some kind of minimal sense. On top of all this, we must contend with the dubious acting abilities of Chris Pine. His portrayal of an arrogant young James T Kirk might have seemed shrewd, when viewed in isolation. But in Carriers he plays on exactly that same note: arrogant and boneheaded. It's impossible not to suspect that this constitutes his entire dramatic range. On the positive side, the film *looks* excellent. It's got an over-sharp, saturated look that really suits the southwestern US locale. But that can't save the truly feeble writing nor the paper-thin (and annoying) characters. Even if you're a fan of the end-of-the-world genre, you should save yourself the agony of watching Carriers."
0 b'I saw this movie at an actual movie theater (probably the \\(2.00 one) with my cousin and uncle. We were around 11 and 12, I guess, and really into scary movies. I remember being so excited to see it because my cool uncle let us pick the movie (and we probably never got to do that again!) and sooo disappointed afterwards!! Just boring and not scary. The only redeeming thing I can remember was Corky Pigeon from Silver Spoons, and that wasn\'t all that great, just someone I recognized. I\'ve seen bad movies before and this one has always stuck out in my mind as the worst. This was from what I can recall, one of the most boring, non-scary, waste of our collective \\)6, and a waste of film. I have read some of the reviews that say it is worth a watch and I say, "Too each his own", but I wouldn\'t even bother. Not even so bad it\'s good.'

配置数据集以提高性能

以下两种方法在加载数据时非常重要，可以确保 I/O 不会成为瓶颈。

.cache() 将数据加载到磁盘后将其保存在内存中。这将确保数据集在训练模型时不会成为瓶颈。如果数据集太大而无法放入内存，你也可以使用此方法创建一个高效的磁盘缓存，这比读取许多小文件更有效。

.prefetch() 在训练期间重叠数据预处理和模型执行。

你可以在数据性能指南中了解更多关于这两种方法的信息，以及如何将数据缓存到磁盘。

AUTOTUNE = tf.data.AUTOTUNE

train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

使用 Embedding 层

Keras 使得使用词嵌入变得容易。查看 Embedding 层。

Embedding 层可以理解为一个查找表，它将整数索引（代表特定单词）映射到密集向量（它们的嵌入）。嵌入的维数（或宽度）是一个参数，你可以通过实验来确定哪个参数适合你的问题，就像你对 Dense 层中的神经元数量进行实验一样。

# Embed a 1,000 word vocabulary into 5 dimensions.
embedding_layer = tf.keras.layers.Embedding(1000, 5)

当你创建一个 Embedding 层时，嵌入的权重会随机初始化（就像任何其他层一样）。在训练期间，它们会通过反向传播逐渐调整。一旦训练完成，学习到的词嵌入将大致编码单词之间的相似性（因为它们是针对你的模型训练的特定问题而学习的）。

如果你将一个整数传递给一个嵌入层，结果将用嵌入表中的向量替换每个整数。

result = embedding_layer(tf.constant([1, 2, 3]))
result.numpy()

array([[ 0.03135875,  0.03640932, -0.00031054,  0.04873694, -0.03376802],
       [ 0.00243857, -0.02919209, -0.01841091, -0.03684188,  0.02765827],
       [-0.01245669, -0.01057661, -0.04422194, -0.0317696 , -0.00031216]],
      dtype=float32)

对于文本或序列问题，Embedding 层接受一个形状为 (samples, sequence_length) 的整数二维张量，其中每个条目都是一个整数序列。它可以嵌入可变长度的序列。你可以将形状为 (32, 10)（32 个长度为 10 的序列的批次）或 (64, 15)（64 个长度为 15 的序列的批次）的批次馈送到上面的嵌入层。

返回的张量比输入多一个轴，嵌入向量沿着新的最后一个轴对齐。传递一个 (2, 3) 输入批次，输出为 (2, 3, N)。

result = embedding_layer(tf.constant([[0, 1, 2], [3, 4, 5]]))
result.shape

TensorShape([2, 3, 5])

当给定一个序列批次作为输入时，嵌入层返回一个形状为 (samples, sequence_length, embedding_dimensionality) 的三维浮点张量。为了将这个可变长度的序列转换为固定表示，有各种标准方法。你可以在将其传递给 Dense 层之前使用 RNN、Attention 或池化层。本教程使用池化，因为它是最简单的。使用 RNN 进行文本分类教程是下一步的良好选择。

文本预处理

接下来，定义你的情感分类模型所需的文本预处理步骤。使用所需的参数初始化一个 TextVectorization 层来向量化电影评论。你可以在文本分类教程中了解更多关于使用此层的信息。

# Create a custom standardization function to strip HTML break tags '<br />'.
def custom_standardization(input_data):
  lowercase = tf.strings.lower(input_data)
  stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
  return tf.strings.regex_replace(stripped_html,
                                  '[%s]' % re.escape(string.punctuation), '')


# Vocabulary size and number of words in a sequence.
vocab_size = 10000
sequence_length = 100

# Use the text vectorization layer to normalize, split, and map strings to
# integers. Note that the layer uses the custom standardization defined above.
# Set maximum_sequence length as all samples are not of the same length.
vectorize_layer = TextVectorization(
    standardize=custom_standardization,
    max_tokens=vocab_size,
    output_mode='int',
    output_sequence_length=sequence_length)

# Make a text-only dataset (no labels) and call adapt to build the vocabulary.
text_ds = train_ds.map(lambda x, y: x)
vectorize_layer.adapt(text_ds)

WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/autograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23.
Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089

创建分类模型

使用 Keras Sequential API 定义情感分类模型。在这种情况下，它是一个“连续词袋”风格的模型。

TextVectorization 层将字符串转换为词汇表索引。你已经将 vectorize_layer 初始化为 TextVectorization 层，并通过在 text_ds 上调用 adapt 来构建其词汇表。现在，vectorize_layer 可以用作你的端到端分类模型的第一层，将转换后的字符串馈送到 Embedding 层。
Embedding 层接受整数编码的词汇表，并查找每个词索引的嵌入向量。这些向量在模型训练时学习。这些向量在输出数组中添加了一个维度。结果维度为：(batch, sequence, embedding)。
GlobalAveragePooling1D 层通过对序列维度求平均来为每个示例返回一个固定长度的输出向量。这允许模型以最简单的方式处理可变长度的输入。
固定长度的输出向量通过一个具有 16 个隐藏单元的全连接 (Dense) 层。
最后一层是与单个输出节点密集连接的。

embedding_dim=16

model = Sequential([
  vectorize_layer,
  Embedding(vocab_size, embedding_dim, name="embedding"),
  GlobalAveragePooling1D(),
  Dense(16, activation='relu'),
  Dense(1)
])

编译和训练模型

你将使用 TensorBoard 来可视化指标，包括损失和准确率。创建一个 tf.keras.callbacks.TensorBoard。

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir="logs")

使用 Adam 优化器和 BinaryCrossentropy 损失编译和训练模型。

model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=15,
    callbacks=[tensorboard_callback])

Epoch 1/15
20/20 [==============================] - 7s 206ms/step - loss: 0.6920 - accuracy: 0.5028 - val_loss: 0.6904 - val_accuracy: 0.4886
Epoch 2/15
20/20 [==============================] - 1s 50ms/step - loss: 0.6879 - accuracy: 0.5028 - val_loss: 0.6856 - val_accuracy: 0.4886
Epoch 3/15
20/20 [==============================] - 1s 48ms/step - loss: 0.6815 - accuracy: 0.5028 - val_loss: 0.6781 - val_accuracy: 0.4886
Epoch 4/15
20/20 [==============================] - 1s 51ms/step - loss: 0.6713 - accuracy: 0.5028 - val_loss: 0.6663 - val_accuracy: 0.4886
Epoch 5/15
20/20 [==============================] - 1s 49ms/step - loss: 0.6566 - accuracy: 0.5028 - val_loss: 0.6506 - val_accuracy: 0.4886
Epoch 6/15
20/20 [==============================] - 1s 49ms/step - loss: 0.6377 - accuracy: 0.5028 - val_loss: 0.6313 - val_accuracy: 0.4886
Epoch 7/15
20/20 [==============================] - 1s 49ms/step - loss: 0.6148 - accuracy: 0.5057 - val_loss: 0.6090 - val_accuracy: 0.5068
Epoch 8/15
20/20 [==============================] - 1s 48ms/step - loss: 0.5886 - accuracy: 0.5724 - val_loss: 0.5846 - val_accuracy: 0.5864
Epoch 9/15
20/20 [==============================] - 1s 48ms/step - loss: 0.5604 - accuracy: 0.6427 - val_loss: 0.5596 - val_accuracy: 0.6368
Epoch 10/15
20/20 [==============================] - 1s 49ms/step - loss: 0.5316 - accuracy: 0.6967 - val_loss: 0.5351 - val_accuracy: 0.6758
Epoch 11/15
20/20 [==============================] - 1s 50ms/step - loss: 0.5032 - accuracy: 0.7372 - val_loss: 0.5121 - val_accuracy: 0.7102
Epoch 12/15
20/20 [==============================] - 1s 48ms/step - loss: 0.4764 - accuracy: 0.7646 - val_loss: 0.4912 - val_accuracy: 0.7344
Epoch 13/15
20/20 [==============================] - 1s 48ms/step - loss: 0.4516 - accuracy: 0.7858 - val_loss: 0.4727 - val_accuracy: 0.7492
Epoch 14/15
20/20 [==============================] - 1s 48ms/step - loss: 0.4290 - accuracy: 0.8029 - val_loss: 0.4567 - val_accuracy: 0.7584
Epoch 15/15
20/20 [==============================] - 1s 49ms/step - loss: 0.4085 - accuracy: 0.8163 - val_loss: 0.4429 - val_accuracy: 0.7674
<keras.callbacks.History at 0x7f5095954190>

使用这种方法，模型的验证准确率约为 78%（请注意，模型过度拟合，因为训练准确率更高）。

你可以查看模型摘要以了解更多关于模型的每一层的信息。

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 text_vectorization (TextVec  (None, 100)              0         
 torization)                                                     
                                                                 
 embedding (Embedding)       (None, 100, 16)           160000    
                                                                 
 global_average_pooling1d (G  (None, 16)               0         
 lobalAveragePooling1D)                                          
                                                                 
 dense (Dense)               (None, 16)                272       
                                                                 
 dense_1 (Dense)             (None, 1)                 17        
                                                                 
=================================================================
Total params: 160,289
Trainable params: 160,289
Non-trainable params: 0
_________________________________________________________________

在 TensorBoard 中可视化模型指标。

#docs_infra: no_execute
%load_ext tensorboard
%tensorboard --logdir logs

检索训练后的词嵌入并将其保存到磁盘

接下来，检索训练期间学习到的词嵌入。嵌入是模型中 Embedding 层的权重。权重矩阵的形状为 (vocab_size, embedding_dimension)。

使用 get_layer() 和 get_weights() 从模型中获取权重。 get_vocabulary() 函数提供词汇表，用于构建一个每行一个标记的元数据文件。

weights = model.get_layer('embedding').get_weights()[0]
vocab = vectorize_layer.get_vocabulary()

将权重写入磁盘。要使用 Embedding Projector，你将上传两个制表符分隔格式的文件：一个向量文件（包含嵌入）和一个元数据文件（包含单词）。

out_v = io.open('vectors.tsv', 'w', encoding='utf-8')
out_m = io.open('metadata.tsv', 'w', encoding='utf-8')

for index, word in enumerate(vocab):
  if index == 0:
    continue  # skip 0, it's padding.
  vec = weights[index]
  out_v.write('\t'.join([str(x) for x in vec]) + "\n")
  out_m.write(word + "\n")
out_v.close()
out_m.close()

如果你在 Colaboratory 中运行本教程，你可以使用以下代码段将这些文件下载到你的本地机器（或使用文件浏览器，查看 -> 目录 -> 文件浏览器）。

try:
  from google.colab import files
  files.download('vectors.tsv')
  files.download('metadata.tsv')
except Exception:
  pass

可视化嵌入

要可视化嵌入，请将它们上传到嵌入投影仪。

打开 Embedding Projector（这也可以在本地 TensorBoard 实例中运行）。

点击“加载数据”。
上传上面创建的两个文件：vecs.tsv 和 meta.tsv。

你训练的嵌入现在将显示出来。你可以搜索单词以找到它们最接近的邻居。例如，尝试搜索“beautiful”。你可能会看到像“wonderful”这样的邻居。

下一步

本教程向你展示了如何在小型数据集上从头开始训练和可视化词嵌入。

要使用 Word2Vec 算法训练词嵌入，请尝试 Word2Vec 教程。
要了解有关高级文本处理的更多信息，请阅读用于语言理解的 Transformer 模型。