多语言通用句子编码器问答检索

在 TensorFlow.org 上查看 在 Google Colab 中运行 在 GitHub 上查看 下载笔记本 查看 TF Hub 模型

这是一个使用 通用编码器多语言问答模型 进行文本问答检索的演示,说明了模型的 question_encoderresponse_encoder 的使用。我们使用来自 SQuAD 段落的句子作为演示数据集,每个句子及其上下文(围绕句子的文本)都使用 response_encoder 编码为高维嵌入。这些嵌入存储在使用 simpleneighbors 库构建的索引中,用于问答检索。

在检索时,从 SQuAD 数据集中随机选择一个问题,并使用 question_encoder 编码为高维嵌入,并查询 simpleneighbors 索引,返回语义空间中最近邻的列表。

更多模型

您可以在 此处 找到所有当前托管的文本嵌入模型,以及所有在 SQuAD 上训练的模型,此处

设置

设置环境

设置通用导入和函数

2024-02-02 12:42:03.366166: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2024-02-02 12:42:04.103707: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2024-02-02 12:42:04.103807: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2024-02-02 12:42:04.103818: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
[nltk_data] Downloading package punkt to /home/kbuilder/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.

运行以下代码块以将 SQuAD 数据集下载并解压缩到

  • sentences 是一个 (text, context) 元组列表 - SQuAD 数据集中的每个段落都使用 nltk 库拆分为句子,句子和段落文本构成 (text, context) 元组。
  • questions 是一个 (question, answer) 元组列表。

下载并解压缩 SQuAD 数据

10455 sentences, 10552 questions extracted from SQuAD https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json

Example sentence and context:

sentence:

('Oxygen gas is increasingly obtained by these non-cryogenic technologies (see '
 'also the related vacuum swing adsorption).')

context:

('The other major method of producing O\n'
 '2 gas involves passing a stream of clean, dry air through one bed of a pair '
 'of identical zeolite molecular sieves, which absorbs the nitrogen and '
 'delivers a gas stream that is 90% to 93% O\n'
 '2. Simultaneously, nitrogen gas is released from the other '
 'nitrogen-saturated zeolite bed, by reducing the chamber operating pressure '
 'and diverting part of the oxygen gas from the producer bed through it, in '
 'the reverse direction of flow. After a set cycle time the operation of the '
 'two beds is interchanged, thereby allowing for a continuous supply of '
 'gaseous oxygen to be pumped through a pipeline. This is known as pressure '
 'swing adsorption. Oxygen gas is increasingly obtained by these non-cryogenic '
 'technologies (see also the related vacuum swing adsorption).')

以下代码块使用 通用编码器多语言问答模型question_encoderresponse_encoder 签名设置 tensorflow 图 gsession

从 tensorflow hub 加载模型

2024-02-02 12:42:11.161871: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

以下代码块计算所有文本、上下文元组的嵌入,并将它们存储在使用 response_encodersimpleneighbors 索引中。

计算嵌入并构建 simpleneighbors 索引

Computing embeddings for 10455 sentences
0%|          | 0/104 [00:00<?, ?it/s]
simpleneighbors index for 10455 sentences built.

在检索时,使用 question_encoder 编码问题,并使用问题嵌入查询 simpleneighbors 索引。

检索来自 SQuAD 的随机问题的最近邻