使用 TF Text 进行分词

在 TensorFlow.org 上查看 在 Google Colab 中运行 在 GitHub 上查看 下载笔记本 查看 TF Hub 模型

概述

分词是将字符串分解为标记的过程。通常,这些标记是单词、数字和/或标点符号。 tensorflow_text 包提供了一些分词器,可用于预处理文本模型所需的文本。通过在 TensorFlow 图中执行分词,您无需担心训练和推理工作流程之间的差异以及管理预处理脚本。

本指南讨论了 TensorFlow Text 提供的许多分词选项,何时可能需要使用一个选项而不是另一个选项,以及如何在模型中调用这些分词器。

设置

pip install -q "tensorflow-text==2.11.*"
import requests
import tensorflow as tf
import tensorflow_text as tf_text
2024-06-25 11:36:00.193262: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2024-06-25 11:36:00.997185: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2024-06-25 11:36:00.997272: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2024-06-25 11:36:00.997281: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

分词器 API

主要接口是 SplitterSplitterWithOffsets,它们分别具有单一方法 splitsplit_with_offsetsSplitterWithOffsets 变体(扩展了 Splitter)包含一个用于获取字节偏移量的选项。这允许调用者知道创建的标记来自原始字符串中的哪些字节。

TokenizerTokenizerWithOffsetsSplitter 的专门版本,它们分别提供了便利方法 tokenizetokenize_with_offsets

通常,对于任何 N 维输入,返回的标记都位于 N+1 维 RaggedTensor 中,其中最内层维度的标记映射到原始的单个字符串。

class Splitter {
  @abstractmethod
  def split(self, input)
}

class SplitterWithOffsets(Splitter) {
  @abstractmethod
  def split_with_offsets(self, input)
}

还有一个 Detokenizer 接口。任何实现此接口的分词器都可以接受 N 维标记的 RaggedTensor,并且通常返回一个 N-1 维张量或 RaggedTensor,其中包含一起组装的给定标记。

class Detokenizer {
  @abstractmethod
  def detokenize(self, input)
}

分词器

以下是 TensorFlow Text 提供的分词器套件。字符串输入假定为 UTF-8。请查看 Unicode 指南,了解如何将字符串转换为 UTF-8。

完整单词分词器

这些分词器尝试按单词拆分字符串,这是拆分文本最直观的方式。

WhitespaceTokenizer

text.WhitespaceTokenizer 是最基本的分词器,它根据 ICU 定义的空格字符(例如空格、制表符、换行符)拆分字符串。这通常适合快速构建原型模型。

tokenizer = tf_text.WhitespaceTokenizer()
tokens = tokenizer.tokenize(["What you know you can't explain, but you feel it."])
print(tokens.to_list())
[[b'What', b'you', b'know', b'you', b"can't", b'explain,', b'but', b'you', b'feel', b'it.']]
2024-06-25 11:36:02.659929: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2024-06-25 11:36:02.660029: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory
2024-06-25 11:36:02.660092: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory
2024-06-25 11:36:02.660152: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcufft.so.10'; dlerror: libcufft.so.10: cannot open shared object file: No such file or directory
2024-06-25 11:36:02.714527: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory
2024-06-25 11:36:02.714728: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1934] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://tensorflowcn.cn/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...

您可能会注意到此分词器的一个缺点是,标点符号与单词一起构成一个标记。要将单词和标点符号拆分为单独的标记,应使用 UnicodeScriptTokenizer

UnicodeScriptTokenizer

UnicodeScriptTokenizer 根据 Unicode 脚本边界拆分字符串。使用的脚本代码对应于国际 Unicode 组件 (ICU) UScriptCode 值。请参阅: http://icu-project.org/apiref/icu4c/uscript_8h.html

实际上,这类似于 WhitespaceTokenizer,最明显的区别是它将标点符号 (USCRIPT_COMMON) 与语言文本(例如 USCRIPT_LATIN、USCRIPT_CYRILLIC 等)分开,同时也将语言文本彼此分开。请注意,这也会将缩写词拆分为单独的标记。

tokenizer = tf_text.UnicodeScriptTokenizer()
tokens = tokenizer.tokenize(["What you know you can't explain, but you feel it."])
print(tokens.to_list())
[[b'What', b'you', b'know', b'you', b'can', b"'", b't', b'explain', b',', b'but', b'you', b'feel', b'it', b'.']]

子词分词器

子词分词器可以使用较小的词汇量,并允许模型从构成新词的子词中获得一些信息。

我们将在下面简要讨论子词分词选项,但 子词分词教程 将更深入地介绍,并解释如何生成词汇文件。

WordpieceTokenizer

WordPiece 分词是一种数据驱动的分词方案,它生成一组子标记。这些子标记可能对应于语言学词素,但通常并非如此。

WordpieceTokenizer 预计输入已拆分为标记。由于存在此先决条件,您通常需要先使用 WhitespaceTokenizerUnicodeScriptTokenizer 进行拆分。

tokenizer = tf_text.WhitespaceTokenizer()
tokens = tokenizer.tokenize(["What you know you can't explain, but you feel it."])
print(tokens.to_list())
[[b'What', b'you', b'know', b'you', b"can't", b'explain,', b'but', b'you', b'feel', b'it.']]

在将字符串拆分为标记后,可以使用 WordpieceTokenizer 将其拆分为子标记。

url = "https://github.com/tensorflow/text/blob/master/tensorflow_text/python/ops/test_data/test_wp_en_vocab.txt?raw=true"
r = requests.get(url)
filepath = "vocab.txt"
open(filepath, 'wb').write(r.content)
52382
subtokenizer = tf_text.UnicodeScriptTokenizer(filepath)
subtokens = tokenizer.tokenize(tokens)
print(subtokens.to_list())
[[[b'What'], [b'you'], [b'know'], [b'you'], [b"can't"], [b'explain,'], [b'but'], [b'you'], [b'feel'], [b'it.']]]

BertTokenizer

BertTokenizer 镜像了 BERT 论文中分词的原始实现。它由 WordpieceTokenizer 支持,但还会执行其他任务,例如规范化和首先将标记分词为单词。

tokenizer = tf_text.BertTokenizer(filepath, token_out_type=tf.string, lower_case=True)
tokens = tokenizer.tokenize(["What you know you can't explain, but you feel it."])
print(tokens.to_list())
[[[b'what'], [b'you'], [b'know'], [b'you'], [b'can'], [b"'"], [b't'], [b'explain'], [b','], [b'but'], [b'you'], [b'feel'], [b'it'], [b'.']]]

SentencepieceTokenizer

SentencepieceTokenizer 是一种高度可配置的子标记分词器。它由 Sentencepiece 库支持。与 BertTokenizer 一样,它可以在拆分为子标记之前包含规范化和标记拆分。

url = "https://github.com/tensorflow/text/blob/master/tensorflow_text/python/ops/test_data/test_oss_model.model?raw=true"
sp_model = requests.get(url).content
tokenizer = tf_text.SentencepieceTokenizer(sp_model, out_type=tf.string)
tokens = tokenizer.tokenize(["What you know you can't explain, but you feel it."])
print(tokens.to_list())
[[b'\xe2\x96\x81What', b'\xe2\x96\x81you', b'\xe2\x96\x81know', b'\xe2\x96\x81you', b'\xe2\x96\x81can', b"'", b't', b'\xe2\x96\x81explain', b',', b'\xe2\x96\x81but', b'\xe2\x96\x81you', b'\xe2\x96\x81feel', b'\xe2\x96\x81it', b'.']]

其他分词器

UnicodeCharTokenizer

此分词器将字符串拆分为 UTF-8 字符。它对于没有空格分隔单词的 CJK 语言非常有用。

tokenizer = tf_text.UnicodeCharTokenizer()
tokens = tokenizer.tokenize(["What you know you can't explain, but you feel it."])
print(tokens.to_list())
[[87, 104, 97, 116, 32, 121, 111, 117, 32, 107, 110, 111, 119, 32, 121, 111, 117, 32, 99, 97, 110, 39, 116, 32, 101, 120, 112, 108, 97, 105, 110, 44, 32, 98, 117, 116, 32, 121, 111, 117, 32, 102, 101, 101, 108, 32, 105, 116, 46]]

输出为 Unicode 码点。这对于创建字符 ngram(例如 bigram)也很有用。要转换回 UTF-8 字符,请使用。

characters = tf.strings.unicode_encode(tf.expand_dims(tokens, -1), "UTF-8")
bigrams = tf_text.ngrams(characters, 2, reduction_type=tf_text.Reduction.STRING_JOIN, string_separator='')
print(bigrams.to_list())
[[b'Wh', b'ha', b'at', b't ', b' y', b'yo', b'ou', b'u ', b' k', b'kn', b'no', b'ow', b'w ', b' y', b'yo', b'ou', b'u ', b' c', b'ca', b'an', b"n'", b"'t", b't ', b' e', b'ex', b'xp', b'pl', b'la', b'ai', b'in', b'n,', b', ', b' b', b'bu', b'ut', b't ', b' y', b'yo', b'ou', b'u ', b' f', b'fe', b'ee', b'el', b'l ', b' i', b'it', b't.']]

HubModuleTokenizer

这是一个围绕部署到 TF Hub 的模型的包装器,使调用更轻松,因为 TF Hub 目前不支持不规则张量。让模型执行分词对于 CJK 语言特别有用,当您想要将其拆分为单词时,但没有空格来提供启发式指南。目前,我们只有一个针对中文的分割模型。

MODEL_HANDLE = "https://tfhub.dev/google/zh_segmentation/1"
segmenter = tf_text.HubModuleTokenizer(MODEL_HANDLE)
tokens = segmenter.tokenize(["新华社北京"])
print(tokens.to_list())
[[b'\xe6\x96\xb0\xe5\x8d\x8e\xe7\xa4\xbe', b'\xe5\x8c\x97\xe4\xba\xac']]

查看 UTF-8 编码字节字符串的结果可能很困难。解码列表值以方便查看。

def decode_list(x):
  if type(x) is list:
    return list(map(decode_list, x))
  return x.decode("UTF-8")

def decode_utf8_tensor(x):
  return list(map(decode_list, x.to_list()))

print(decode_utf8_tensor(tokens))
[['新华社', '北京']]

SplitMergeTokenizer

SplitMergeTokenizerSplitMergeFromLogitsTokenizer 的目标是根据提供的指示字符串应在何处拆分的数值来拆分字符串。这在构建您自己的分割模型(如之前的分割示例)时非常有用。

对于 SplitMergeTokenizer,值 0 用于指示新字符串的开始,值 1 表示字符是当前字符串的一部分。

strings = ["新华社北京"]
labels = [[0, 1, 1, 0, 1]]
tokenizer = tf_text.SplitMergeTokenizer()
tokens = tokenizer.tokenize(strings, labels)
print(decode_utf8_tensor(tokens))
[['新华社', '北京']]

SplitMergeFromLogitsTokenizer 类似,但它接受来自神经网络的 logit 值对,这些值对预测每个字符是否应该拆分为新字符串或合并到当前字符串中。

strings = [["新华社北京"]]
labels = [[[5.0, -3.2], [0.2, 12.0], [0.0, 11.0], [2.2, -1.0], [-3.0, 3.0]]]
tokenizer = tf_text.SplitMergeFromLogitsTokenizer()
tokenizer.tokenize(strings, labels)
print(decode_utf8_tensor(tokens))
[['新华社', '北京']]

RegexSplitter

RegexSplitter 能够根据提供的正则表达式定义的任意断点分割字符串。

splitter = tf_text.RegexSplitter("\s")
tokens = splitter.split(["What you know you can't explain, but you feel it."], )
print(tokens.to_list())
[[b'What', b'you', b'know', b'you', b"can't", b'explain,', b'but', b'you', b'feel', b'it.']]

偏移量

在对字符串进行分词时,通常需要知道分词在原始字符串中的位置。为此,每个实现 TokenizerWithOffsets 的分词器都有一个 tokenize_with_offsets 方法,该方法将返回字节偏移量以及分词。start_offsets 列出了每个分词在原始字符串中开始的字节,end_offsets 列出了每个分词结束点后的字节。换句话说,开始偏移量是包含的,结束偏移量是排除的。

tokenizer = tf_text.UnicodeScriptTokenizer()
(tokens, start_offsets, end_offsets) = tokenizer.tokenize_with_offsets(['Everything not saved will be lost.'])
print(tokens.to_list())
print(start_offsets.to_list())
print(end_offsets.to_list())
[[b'Everything', b'not', b'saved', b'will', b'be', b'lost', b'.']]
[[0, 11, 15, 21, 26, 29, 33]]
[[10, 14, 20, 25, 28, 33, 34]]

去分词

实现 Detokenizer 的分词器提供了一个 detokenize 方法,该方法尝试组合字符串。这可能会导致信息丢失,因此去分词后的字符串可能并不总是与原始的、预分词的字符串完全匹配。

tokenizer = tf_text.UnicodeCharTokenizer()
tokens = tokenizer.tokenize(["What you know you can't explain, but you feel it."])
print(tokens.to_list())
strings = tokenizer.detokenize(tokens)
print(strings.numpy())
[[87, 104, 97, 116, 32, 121, 111, 117, 32, 107, 110, 111, 119, 32, 121, 111, 117, 32, 99, 97, 110, 39, 116, 32, 101, 120, 112, 108, 97, 105, 110, 44, 32, 98, 117, 116, 32, 121, 111, 117, 32, 102, 101, 101, 108, 32, 105, 116, 46]]
[b"What you know you can't explain, but you feel it."]

TF 数据

TF 数据是一个强大的 API,用于为训练模型创建输入管道。分词器与 API 的工作方式与预期一致。

docs = tf.data.Dataset.from_tensor_slices([['Never tell me the odds.'], ["It's a trap!"]])
tokenizer = tf_text.WhitespaceTokenizer()
tokenized_docs = docs.map(lambda x: tokenizer.tokenize(x))
iterator = iter(tokenized_docs)
print(next(iterator).to_list())
print(next(iterator).to_list())
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/autograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23.
Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/autograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23.
Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089
[[b'Never', b'tell', b'me', b'the', b'odds.']]
[[b"It's", b'a', b'trap!']]