多语言通用句子编码器问答检索

在 TensorFlow.org 上查看

在 Google Colab 中运行

在 GitHub 上查看

下载笔记本

查看 TF Hub 模型

这是一个使用通用编码器多语言问答模型进行文本问答检索的演示，说明了模型的 question_encoder 和 response_encoder 的使用。我们使用来自 SQuAD 段落的句子作为演示数据集，每个句子及其上下文（围绕句子的文本）都使用 response_encoder 编码为高维嵌入。这些嵌入存储在使用 simpleneighbors 库构建的索引中，用于问答检索。

在检索时，从 SQuAD 数据集中随机选择一个问题，并使用 question_encoder 编码为高维嵌入，并查询 simpleneighbors 索引，返回语义空间中最近邻的列表。

设置

设置环境

%%capture
# Install the latest Tensorflow version.
!pip install -q "tensorflow-text==2.11.*"
!pip install -q simpleneighbors[annoy]
!pip install -q nltk
!pip install -q tqdm

设置通用导入和函数

import json
import nltk
import os
import pprint
import random
import simpleneighbors
import urllib
from IPython.display import HTML, display
from tqdm.notebook import tqdm

import tensorflow.compat.v2 as tf
import tensorflow_hub as hub
from tensorflow_text import SentencepieceTokenizer

nltk.download('punkt')


def download_squad(url):
  return json.load(urllib.request.urlopen(url))

def extract_sentences_from_squad_json(squad):
  all_sentences = []
  for data in squad['data']:
    for paragraph in data['paragraphs']:
      sentences = nltk.tokenize.sent_tokenize(paragraph['context'])
      all_sentences.extend(zip(sentences, [paragraph['context']] * len(sentences)))
  return list(set(all_sentences)) # remove duplicates

def extract_questions_from_squad_json(squad):
  questions = []
  for data in squad['data']:
    for paragraph in data['paragraphs']:
      for qas in paragraph['qas']:
        if qas['answers']:
          questions.append((qas['question'], qas['answers'][0]['text']))
  return list(set(questions))

def output_with_highlight(text, highlight):
  output = "<li> "
  i = text.find(highlight)
  while True:
    if i == -1:
      output += text
      break
    output += text[0:i]
    output += '<b>'+text[i:i+len(highlight)]+'</b>'
    text = text[i+len(highlight):]
    i = text.find(highlight)
  return output + "</li>\n"

def display_nearest_neighbors(query_text, answer_text=None):
  query_embedding = model.signatures['question_encoder'](tf.constant([query_text]))['outputs'][0]
  search_results = index.nearest(query_embedding, n=num_results)

  if answer_text:
    result_md = '''
    <p>Random Question from SQuAD:</p>
    <p>&nbsp;&nbsp;<b>%s</b></p>
    <p>Answer:</p>
    <p>&nbsp;&nbsp;<b>%s</b></p>
    ''' % (query_text , answer_text)
  else:
    result_md = '''
    <p>Question:</p>
    <p>&nbsp;&nbsp;<b>%s</b></p>
    ''' % query_text

  result_md += '''
    <p>Retrieved sentences :
    <ol>
  '''

  if answer_text:
    for s in search_results:
      result_md += output_with_highlight(s, answer_text)
  else:
    for s in search_results:
      result_md += '<li>' + s + '</li>\n'

  result_md += "</ol>"
  display(HTML(result_md))

2024-02-02 12:42:03.366166: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2024-02-02 12:42:04.103707: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2024-02-02 12:42:04.103807: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2024-02-02 12:42:04.103818: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
[nltk_data] Downloading package punkt to /home/kbuilder/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.

运行以下代码块以将 SQuAD 数据集下载并解压缩到

sentences 是一个 (text, context) 元组列表 - SQuAD 数据集中的每个段落都使用 nltk 库拆分为句子，句子和段落文本构成 (text, context) 元组。
questions 是一个 (question, answer) 元组列表。

下载并解压缩 SQuAD 数据

squad_url = 'https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json'

squad_json = download_squad(squad_url)
sentences = extract_sentences_from_squad_json(squad_json)
questions = extract_questions_from_squad_json(squad_json)
print("%s sentences, %s questions extracted from SQuAD %s" % (len(sentences), len(questions), squad_url))

print("\nExample sentence and context:\n")
sentence = random.choice(sentences)
print("sentence:\n")
pprint.pprint(sentence[0])
print("\ncontext:\n")
pprint.pprint(sentence[1])
print()

10455 sentences, 10552 questions extracted from SQuAD https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json

Example sentence and context:

sentence:

('Oxygen gas is increasingly obtained by these non-cryogenic technologies (see '
 'also the related vacuum swing adsorption).')

context:

('The other major method of producing O\n'
 '2 gas involves passing a stream of clean, dry air through one bed of a pair '
 'of identical zeolite molecular sieves, which absorbs the nitrogen and '
 'delivers a gas stream that is 90% to 93% O\n'
 '2. Simultaneously, nitrogen gas is released from the other '
 'nitrogen-saturated zeolite bed, by reducing the chamber operating pressure '
 'and diverting part of the oxygen gas from the producer bed through it, in '
 'the reverse direction of flow. After a set cycle time the operation of the '
 'two beds is interchanged, thereby allowing for a continuous supply of '
 'gaseous oxygen to be pumped through a pipeline. This is known as pressure '
 'swing adsorption. Oxygen gas is increasingly obtained by these non-cryogenic '
 'technologies (see also the related vacuum swing adsorption).')

以下代码块使用通用编码器多语言问答模型的 question_encoder 和 response_encoder 签名设置 tensorflow 图 g 和 session。

从 tensorflow hub 加载模型

module_url = "https://tfhub.dev/google/universal-sentence-encoder-multilingual-qa/3"
model = hub.load(module_url)

2024-02-02 12:42:11.161871: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

以下代码块计算所有文本、上下文元组的嵌入，并将它们存储在使用 response_encoder 的 simpleneighbors 索引中。

计算嵌入并构建 simpleneighbors 索引

batch_size = 100

encodings = model.signatures['response_encoder'](
  input=tf.constant([sentences[0][0]]),
  context=tf.constant([sentences[0][1]]))
index = simpleneighbors.SimpleNeighbors(
    len(encodings['outputs'][0]), metric='angular')

print('Computing embeddings for %s sentences' % len(sentences))
slices = zip(*(iter(sentences),) * batch_size)
num_batches = int(len(sentences) / batch_size)
for s in tqdm(slices, total=num_batches):
  response_batch = list([r for r, c in s])
  context_batch = list([c for r, c in s])
  encodings = model.signatures['response_encoder'](
    input=tf.constant(response_batch),
    context=tf.constant(context_batch)
  )
  for batch_index, batch in enumerate(response_batch):
    index.add_one(batch, encodings['outputs'][batch_index])

index.build()
print('simpleneighbors index for %s sentences built.' % len(sentences))

Computing embeddings for 10455 sentences
0%|          | 0/104 [00:00<?, ?it/s]
simpleneighbors index for 10455 sentences built.

在检索时，使用 question_encoder 编码问题，并使用问题嵌入查询 simpleneighbors 索引。

检索来自 SQuAD 的随机问题的最近邻

num_results = 25

query = random.choice(questions)
display_nearest_neighbors(query[0], query[1])