在 TensorFlow.org 上查看 | 在 Google Colab 中运行 | 在 GitHub 上查看 | 下载笔记本 | 查看 TF Hub 模型 |
此笔记本演示了如何访问多语言通用句子编码器模块,并将其用于跨多种语言的句子相似度。此模块是 原始通用编码器模块 的扩展。
笔记本分为以下部分
- 第一部分展示了语言对之间句子的可视化。这更像是一个学术练习。
- 在第二部分中,我们将展示如何从多个语言的维基百科语料库样本中构建语义搜索引擎。
引用
在研究论文中使用此 Colab 中探索的模型时,请引用
用于语义检索的多语言通用句子编码器
Yinfei Yang, Daniel Cer, Amin Ahmad, Mandy Guo, Jax Law, Noah Constant, Gustavo Hernandez Abrego, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope 和 Ray Kurzweil。2019 年。arXiv 预印本 arXiv:1907.04307
设置
本部分设置访问多语言通用句子编码器模块的环境,并准备一组英语句子及其翻译。在接下来的部分中,将使用多语言模块来计算跨语言的相似度。
设置环境
设置通用导入和函数
这是额外的样板代码,我们将导入预训练的 ML 模型,该模型将在整个笔记本中用于对文本进行编码。
# The 16-language multilingual module is the default but feel free
# to pick others from the list and compare the results.
module_url = 'https://tfhub.dev/google/universal-sentence-encoder-multilingual/3'
model = hub.load(module_url)
def embed_text(input):
return model(input)
可视化语言之间的文本相似度
现在有了句子嵌入,我们可以可视化不同语言之间的语义相似度。
计算文本嵌入
我们首先定义一组并行翻译成各种语言的句子。然后,我们预先计算所有句子的嵌入。
# Some texts of different lengths in different languages.
arabic_sentences = ['كلب', 'الجراء لطيفة.', 'أستمتع بالمشي لمسافات طويلة على طول الشاطئ مع كلبي.']
chinese_sentences = ['狗', '小狗很好。', '我喜欢和我的狗一起沿着海滩散步。']
english_sentences = ['dog', 'Puppies are nice.', 'I enjoy taking long walks along the beach with my dog.']
french_sentences = ['chien', 'Les chiots sont gentils.', 'J\'aime faire de longues promenades sur la plage avec mon chien.']
german_sentences = ['Hund', 'Welpen sind nett.', 'Ich genieße lange Spaziergänge am Strand entlang mit meinem Hund.']
italian_sentences = ['cane', 'I cuccioli sono carini.', 'Mi piace fare lunghe passeggiate lungo la spiaggia con il mio cane.']
japanese_sentences = ['犬', '子犬はいいです', '私は犬と一緒にビーチを散歩するのが好きです']
korean_sentences = ['개', '강아지가 좋다.', '나는 나의 개와 해변을 따라 길게 산책하는 것을 즐긴다.']
russian_sentences = ['собака', 'Милые щенки.', 'Мне нравится подолгу гулять по пляжу со своей собакой.']
spanish_sentences = ['perro', 'Los cachorros son agradables.', 'Disfruto de dar largos paseos por la playa con mi perro.']
# Multilingual example
multilingual_example = ["Willkommen zu einfachen, aber", "verrassend krachtige", "multilingüe", "compréhension du language naturel", "модели.", "大家是什么意思" , "보다 중요한", ".اللغة التي يتحدثونها"]
multilingual_example_in_en = ["Welcome to simple yet", "surprisingly powerful", "multilingual", "natural language understanding", "models.", "What people mean", "matters more than", "the language they speak."]
# Compute embeddings.
ar_result = embed_text(arabic_sentences)
en_result = embed_text(english_sentences)
es_result = embed_text(spanish_sentences)
de_result = embed_text(german_sentences)
fr_result = embed_text(french_sentences)
it_result = embed_text(italian_sentences)
ja_result = embed_text(japanese_sentences)
ko_result = embed_text(korean_sentences)
ru_result = embed_text(russian_sentences)
zh_result = embed_text(chinese_sentences)
multilingual_result = embed_text(multilingual_example)
multilingual_in_en_result = embed_text(multilingual_example_in_en)
可视化相似度
有了文本嵌入,我们可以取它们的点积来可视化句子在语言之间有多相似。颜色越深表示嵌入在语义上越相似。
多语言相似度
visualize_similarity(multilingual_in_en_result, multilingual_result,
multilingual_example_in_en, multilingual_example, "Multilingual Universal Sentence Encoder for Semantic Retrieval (Yang et al., 2019)")
英语-阿拉伯语相似度
visualize_similarity(en_result, ar_result, english_sentences, arabic_sentences, 'English-Arabic Similarity')
英语-俄语相似度
visualize_similarity(en_result, ru_result, english_sentences, russian_sentences, 'English-Russian Similarity')
英语-西班牙语相似度
visualize_similarity(en_result, es_result, english_sentences, spanish_sentences, 'English-Spanish Similarity')
英语-意大利语相似度
visualize_similarity(en_result, it_result, english_sentences, italian_sentences, 'English-Italian Similarity')
意大利语-西班牙语相似度
visualize_similarity(it_result, es_result, italian_sentences, spanish_sentences, 'Italian-Spanish Similarity')
英语-中文相似度
visualize_similarity(en_result, zh_result, english_sentences, chinese_sentences, 'English-Chinese Similarity')
英语-韩语相似度
visualize_similarity(en_result, ko_result, english_sentences, korean_sentences, 'English-Korean Similarity')
中文-韩语相似度
visualize_similarity(zh_result, ko_result, chinese_sentences, korean_sentences, 'Chinese-Korean Similarity')
等等...
以上示例可以扩展到英语、阿拉伯语、中文、荷兰语、法语、德语、意大利语、日语、韩语、波兰语、葡萄牙语、俄语、西班牙语、泰语和土耳其语之间的任何语言对。祝您编码愉快!
创建多语言语义相似度搜索引擎
在前面的示例中,我们可视化了少量句子,在本部分中,我们将构建一个语义搜索索引,其中包含来自维基百科语料库的约 200,000 个句子。大约一半是英语,另一半是西班牙语,以展示通用句子编码器的多语言功能。
下载要索引的数据
首先,我们将从 新闻评论语料库 [1] 中下载多种语言的新闻句子。在不失一般性的情况下,此方法也适用于索引其他支持的语言。
为了加快演示速度,我们将每种语言限制为 1000 个句子。
corpus_metadata = [
('ar', 'ar-en.txt.zip', 'News-Commentary.ar-en.ar', 'Arabic'),
('zh', 'en-zh.txt.zip', 'News-Commentary.en-zh.zh', 'Chinese'),
('en', 'en-es.txt.zip', 'News-Commentary.en-es.en', 'English'),
('ru', 'en-ru.txt.zip', 'News-Commentary.en-ru.ru', 'Russian'),
('es', 'en-es.txt.zip', 'News-Commentary.en-es.es', 'Spanish'),
]
language_to_sentences = {}
language_to_news_path = {}
for language_code, zip_file, news_file, language_name in corpus_metadata:
zip_path = tf.keras.utils.get_file(
fname=zip_file,
origin='http://opus.nlpl.eu/download.php?f=News-Commentary/v11/moses/' + zip_file,
extract=True)
news_path = os.path.join(os.path.dirname(zip_path), news_file)
language_to_sentences[language_code] = pd.read_csv(news_path, sep='\t', header=None)[0][:1000]
language_to_news_path[language_code] = news_path
print('{:,} {} sentences'.format(len(language_to_sentences[language_code]), language_name))
Downloading data from http://opus.nlpl.eu/download.php?f=News-Commentary/v11/moses/ar-en.txt.zip 24715264/24714354 [==============================] - 2s 0us/step 1,000 Arabic sentences Downloading data from http://opus.nlpl.eu/download.php?f=News-Commentary/v11/moses/en-zh.txt.zip 18104320/18101984 [==============================] - 2s 0us/step 1,000 Chinese sentences Downloading data from http://opus.nlpl.eu/download.php?f=News-Commentary/v11/moses/en-es.txt.zip 28106752/28106064 [==============================] - 2s 0us/step 1,000 English sentences Downloading data from http://opus.nlpl.eu/download.php?f=News-Commentary/v11/moses/en-ru.txt.zip 24854528/24849511 [==============================] - 2s 0us/step 1,000 Russian sentences 1,000 Spanish sentences
使用预训练模型将句子转换为向量
我们分批计算嵌入,以便它们适合 GPU 的 RAM。
# Takes about 3 minutes
batch_size = 2048
language_to_embeddings = {}
for language_code, zip_file, news_file, language_name in corpus_metadata:
print('\nComputing {} embeddings'.format(language_name))
with tqdm(total=len(language_to_sentences[language_code])) as pbar:
for batch in pd.read_csv(language_to_news_path[language_code], sep='\t',header=None, chunksize=batch_size):
language_to_embeddings.setdefault(language_code, []).extend(embed_text(batch[0]))
pbar.update(len(batch))
0%| | 0/1000 [00:00<?, ?it/s] Computing Arabic embeddings 83178it [00:30, 2768.60it/s] 0%| | 0/1000 [00:00<?, ?it/s] Computing Chinese embeddings 69206it [00:18, 3664.60it/s] 0%| | 0/1000 [00:00<?, ?it/s] Computing English embeddings 238853it [00:37, 6319.00it/s] 0%| | 0/1000 [00:00<?, ?it/s] Computing Russian embeddings 190092it [00:34, 5589.16it/s] 0%| | 0/1000 [00:00<?, ?it/s] Computing Spanish embeddings 238819it [00:41, 5754.02it/s]
构建语义向量索引
我们使用 SimpleNeighbors 库(它是 Annoy 库的包装器)来高效地从语料库中查找结果。
%%time
# Takes about 8 minutes
num_index_trees = 40
language_name_to_index = {}
embedding_dimensions = len(list(language_to_embeddings.values())[0][0])
for language_code, zip_file, news_file, language_name in corpus_metadata:
print('\nAdding {} embeddings to index'.format(language_name))
index = SimpleNeighbors(embedding_dimensions, metric='dot')
for i in trange(len(language_to_sentences[language_code])):
index.add_one(language_to_sentences[language_code][i], language_to_embeddings[language_code][i])
print('Building {} index with {} trees...'.format(language_name, num_index_trees))
index.build(n=num_index_trees)
language_name_to_index[language_name] = index
0%| | 1/1000 [00:00<02:21, 7.04it/s] Adding Arabic embeddings to index 100%|██████████| 1000/1000 [02:06<00:00, 7.90it/s] 0%| | 1/1000 [00:00<01:53, 8.84it/s] Building Arabic index with 40 trees... Adding Chinese embeddings to index 100%|██████████| 1000/1000 [02:05<00:00, 7.99it/s] 0%| | 1/1000 [00:00<01:59, 8.39it/s] Building Chinese index with 40 trees... Adding English embeddings to index 100%|██████████| 1000/1000 [02:07<00:00, 7.86it/s] 0%| | 1/1000 [00:00<02:17, 7.26it/s] Building English index with 40 trees... Adding Russian embeddings to index 100%|██████████| 1000/1000 [02:06<00:00, 7.91it/s] 0%| | 1/1000 [00:00<02:03, 8.06it/s] Building Russian index with 40 trees... Adding Spanish embeddings to index 100%|██████████| 1000/1000 [02:07<00:00, 7.84it/s] Building Spanish index with 40 trees... CPU times: user 11min 21s, sys: 2min 14s, total: 13min 35s Wall time: 10min 33s
%%time
# Takes about 13 minutes
num_index_trees = 60
print('Computing mixed-language index')
combined_index = SimpleNeighbors(embedding_dimensions, metric='dot')
for language_code, zip_file, news_file, language_name in corpus_metadata:
print('Adding {} embeddings to mixed-language index'.format(language_name))
for i in trange(len(language_to_sentences[language_code])):
annotated_sentence = '({}) {}'.format(language_name, language_to_sentences[language_code][i])
combined_index.add_one(annotated_sentence, language_to_embeddings[language_code][i])
print('Building mixed-language index with {} trees...'.format(num_index_trees))
combined_index.build(n=num_index_trees)
0%| | 1/1000 [00:00<02:00, 8.29it/s] Computing mixed-language index Adding Arabic embeddings to mixed-language index 100%|██████████| 1000/1000 [02:06<00:00, 7.92it/s] 0%| | 1/1000 [00:00<02:24, 6.89it/s] Adding Chinese embeddings to mixed-language index 100%|██████████| 1000/1000 [02:05<00:00, 7.95it/s] 0%| | 1/1000 [00:00<02:05, 7.98it/s] Adding English embeddings to mixed-language index 100%|██████████| 1000/1000 [02:06<00:00, 7.88it/s] 0%| | 1/1000 [00:00<02:18, 7.20it/s] Adding Russian embeddings to mixed-language index 100%|██████████| 1000/1000 [02:04<00:00, 8.03it/s] 0%| | 1/1000 [00:00<02:17, 7.28it/s] Adding Spanish embeddings to mixed-language index 100%|██████████| 1000/1000 [02:06<00:00, 7.90it/s] Building mixed-language index with 60 trees... CPU times: user 11min 18s, sys: 2min 13s, total: 13min 32s Wall time: 10min 30s
验证语义相似性搜索引擎是否正常工作
在本节中,我们将演示
- 语义搜索功能:从语料库中检索与给定查询语义相似的句子。
- 多语言功能:当查询语言和索引语言匹配时,在多种语言中执行此操作
- 跨语言功能:使用与索引语料库不同的语言发出查询
- 混合语言语料库:以上所有操作都在包含来自所有语言的条目的单个索引上执行
语义搜索跨语言功能
在本节中,我们将展示如何检索与一组英语示例句子相关的句子。一些尝试方法
- 尝试几个不同的示例句子
- 尝试更改返回结果的数量(它们按相似度排序返回)
- 尝试跨语言功能,通过返回不同语言的结果(可能需要使用 Google 翻译 将一些结果翻译成您的母语以进行理智检查)
English sentences similar to: "The stock market fell four points." ['Nobel laureate Amartya Sen attributed the European crisis to four failures – political, economic, social, and intellectual.', 'Just last December, fellow economists Martin Feldstein and Nouriel Roubini each penned op-eds bravely questioning bullish market sentiment, sensibly pointing out gold’s risks.', 'His ratings have dipped below 50% for the first time.', 'As a result, markets were deregulated, making it easier to trade assets that were perceived to be safe, but were in fact not.', 'Consider the advanced economies.', 'But the agreement has three major flaws.', 'This “predetermined equilibrium” thinking – reflected in the view that markets always self-correct – led to policy paralysis until the Great Depression, when John Maynard Keynes’s argument for government intervention to address unemployment and output gaps gained traction.', 'Officials underestimated tail risks.', 'Consider a couple of notorious examples.', 'Stalin was content to settle for an empire in Eastern Europe.']
混合语料库功能
我们现在将使用英语发出查询,但结果将来自任何索引语言。
English sentences similar to: "The stock market fell four points." ['Nobel laureate Amartya Sen attributed the European crisis to four failures – political, economic, social, and intellectual.', 'It was part of the 1945 consensus.', 'The end of the East-West ideological divide and the end of absolute faith in markets are historical turning points.', 'Just last December, fellow economists Martin Feldstein and Nouriel Roubini each penned op-eds bravely questioning bullish market sentiment, sensibly pointing out gold’s risks.', 'His ratings have dipped below 50% for the first time.', 'As a result, markets were deregulated, making it easier to trade assets that were perceived to be safe, but were in fact not.', 'Consider the advanced economies.', 'Since their articles appeared, the price of gold has moved up still further.', 'But the agreement has three major flaws.', 'Gold prices even hit a record-high $1,300 recently.', 'This “predetermined equilibrium” thinking – reflected in the view that markets always self-correct – led to policy paralysis until the Great Depression, when John Maynard Keynes’s argument for government intervention to address unemployment and output gaps gained traction.', 'What Failed in 2008?', 'Officials underestimated tail risks.', 'Consider a couple of notorious examples.', 'One of these species, orange roughy, has been caught commercially for only around a quarter-century, but already is being fished to the point of collapse.', 'Meanwhile, policymakers were lulled into complacency by the widespread acceptance of economic theories such as the “efficient-market hypothesis,” which assumes that investors act rationally and use all available information when making their decisions.', 'Stalin was content to settle for an empire in Eastern Europe.', 'Intelligence assets have been redirected.', 'A new wave of what the economist Joseph Schumpeter famously called “creative destruction” is under way: even as central banks struggle to maintain stability by flooding markets with liquidity, credit to business and households is shrinking.', 'It all came about in a number of ways.', 'The UN, like the dream of European unity, was also part of the 1945 consensus.', 'The End of 1945', 'The Global Economy’s New Path', 'But this scenario failed to materialize.', 'Gold prices are extremely sensitive to global interest-rate movements.', 'Fukushima has presented the world with a far-reaching, fundamental choice.', 'It was Japan, the high-tech country par excellence (not the latter-day Soviet Union) that proved unable to take adequate precautions to avert disaster in four reactor blocks.', 'Some European academics tried to argue that there was no need for US-like fiscal transfers, because any desired degree of risk sharing can, in theory, be achieved through financial markets.', '$10,000 Gold?', 'One answer, of course, is a complete collapse of the US dollar.', '1929 or 1989?', 'The goods we made were what economists call “rival" and “excludible" commodities.', 'This dream quickly faded when the Cold War divided the world into two hostile blocs. But in some ways the 1945 consensus, in the West, was strengthened by Cold War politics.', 'The first flaw is that the spending reductions are badly timed: coming as they do when the US economy is weak, they risk triggering another recession.', 'One successful gold investor recently explained to me that stock prices languished for a more than a decade before the Dow Jones index crossed the 1,000 mark in the early 1980’s.', 'Eichengreen traces our tepid response to the crisis to the triumph of monetarist economists, the disciples of Milton Friedman, over their Keynesian and Minskyite peers – at least when it comes to interpretations of the causes and consequences of the Great Depression.', "However, America's unilateral options are limited.", 'Once it was dark, a screen was set up and Mark showed home videos from space.', 'These aspirations were often voiced in the United Nations, founded in 1945.', 'Then I got distracted for about 40 years.']
尝试您自己的查询
English sentences similar to: "The stock market fell four points." ['(Chinese) 新兴市场的号角', '(English) It was part of the 1945 consensus.', '(Russian) Брюссель. Цунами, пронёсшееся по финансовым рынкам, является глобальной катастрофой.', '(Arabic) هناك أربعة شروط مسبقة لتحقيق النجاح الأوروبي في أفغانستان:', '(Spanish) Su índice de popularidad ha caído por primera vez por debajo del 50 por ciento.', '(English) His ratings have dipped below 50% for the first time.', '(Russian) Впервые его рейтинг опустился ниже 50%.', '(English) As a result, markets were deregulated, making it easier to trade assets that were perceived to be safe, but were in fact not.', '(Arabic) وكانت التطورات التي شهدتها سوق العمل أكثر تشجيعا، فهي على النقيض من أسواق الأصول تعكس النتائج وليس التوقعات. وهنا أيضاً كانت الأخبار طيبة. فقد أصبحت سوق العمل أكثر إحكاما، حيث ظلت البطالة عند مستوى 3.5% وكانت نسبة الوظائف إلى الطلبات المقدمة فوق مستوى التعادل.', '(Russian) Это было частью консенсуса 1945 года.', '(English) Consider the advanced economies.', '(English) Since their articles appeared, the price of gold has moved up still further.', '(Russian) Тогда они не только смогут накормить свои семьи, но и начать получать рыночную прибыль и откладывать деньги на будущее.', '(English) Gold prices even hit a record-high $1,300 recently.', '(Chinese) 另一种金融危机', '(Russian) Европейская мечта находится в кризисе.', '(English) What Failed in 2008?', '(Spanish) Pero el acuerdo alcanzado tiene tres grandes defectos.', '(English) Officials underestimated tail risks.', '(English) Consider a couple of notorious examples.', '(Spanish) Los mercados financieros pueden ser frágiles y ofrecen muy poca capacidad de compartir los riesgos relacionados con el ingreso de los trabajadores, que constituye la mayor parte de la renta de cualquier economía avanzada.', '(Chinese) 2008年败在何处?', '(Spanish) Consideremos las economías avanzadas.', '(Spanish) Los bienes producidos se caracterizaron por ser, como señalaron algunos economistas, mercancías “rivales” y “excluyentes”.', '(Arabic) إغلاق الفجوة الاستراتيجية في أوروبا', '(English) Stalin was content to settle for an empire in Eastern Europe.', '(English) Intelligence assets have been redirected.', '(Spanish) Hoy, envalentonados por la apreciación continua, algunos están sugiriendo que el oro podría llegar incluso a superar esa cifra.', '(Russian) Цены на золото чрезвычайно чувствительны к мировым движениям процентных ставок.', '(Russian) Однако у достигнутой договоренности есть три основных недостатка.']
进一步主题
多语言
最后,我们鼓励您尝试使用任何支持的语言进行查询:英语、阿拉伯语、中文、荷兰语、法语、德语、意大利语、日语、韩语、波兰语、葡萄牙语、俄语、西班牙语、泰语和土耳其语。
此外,即使我们只对一小部分语言进行了索引,您也可以对任何支持的语言的内容进行索引。
模型变体
我们提供针对内存、延迟和/或质量等各种因素优化的通用编码器模型变体。请随意尝试它们,以找到合适的模型。
最近邻库
我们使用 Annoy 来高效地查找最近邻。请参阅 权衡部分,了解有关树的数量(内存相关)和要搜索的项目数量(延迟相关)的信息——SimpleNeighbors 仅允许控制树的数量,但将代码重构为直接使用 Annoy 应该很简单,我们只是想让此代码对于一般用户尽可能简单。
如果 Annoy 不适合您的应用程序,请查看 FAISS。
祝您构建多语言语义应用程序一切顺利!
[1] J. Tiedemann, 2012, OPUS 中的平行数据、工具和接口。在第八届语言资源与评估国际会议 (LREC 2012) 论文集