在 TensorFlow.org 上查看 | 在 Google Colab 中运行 | 在 GitHub 上查看 | 下载笔记本 |
概述
数据集集合提供了一种简单的方法来将任意数量的现有 TFDS 数据集组合在一起,并对它们执行简单的操作。
例如,它们可以用于将与同一任务相关的不同数据集组合在一起,或者用于在固定数量的不同任务上轻松 对模型进行基准测试。
设置
要开始使用,请安装一些软件包
# Use tfds-nightly to ensure access to the latest features.
pip install -q tfds-nightly tensorflow
pip install -U conllu
将 TensorFlow 和 Tensorflow 数据集包导入您的开发环境
import pprint
import tensorflow as tf
import tensorflow_datasets as tfds
2023-10-03 09:24:58.961730: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2023-10-03 09:24:58.961781: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2023-10-03 09:24:58.961817: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
数据集集合提供了一种简单的方法来将任意数量的现有 TensorFlow 数据集 (TFDS) 数据集组合在一起,并对它们执行简单的操作。
例如,它们可以用于将与同一任务相关的不同数据集组合在一起,或者用于在固定数量的不同任务上轻松 对模型进行基准测试。
查找可用的数据集集合
所有数据集集合构建器都是 tfds.core.dataset_collection_builder.DatasetCollection
的子类。
要获取可用构建器的列表,请使用 tfds.list_dataset_collections()
。
tfds.list_dataset_collections()
['longt5', 'xtreme']
加载和检查数据集集合
加载数据集集合最简单的方法是使用 DatasetCollectionLoader
对象,该对象使用 tfds.dataset_collection
命令进行实例化。
collection_loader = tfds.dataset_collection('xtreme')
可以按照与 TFDS 数据集相同的语法加载特定数据集集合版本
collection_loader = tfds.dataset_collection('xtreme:1.0.0')
数据集集合加载器可以显示有关集合的信息
collection_loader.print_info()
Dataset collection: xtreme Version: 1.0.0 Description: # Xtreme Benchmark The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of the cross-lingual generalization ability of pre-trained multilingual models. It covers 40 typologically diverse languages (spanning 12 language families) and includes nine tasks that collectively require reasoning about different levels of syntax and semantics. The languages in XTREME are selected to maximize language diversity, coverage in existing tasks, and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil (spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the Niger-Congo languages Swahili and Yoruba, spoken in Africa. For a full description of the benchmark, see the [paper](https://arxiv.org/abs/2003.11080). Citation: @article{hu2020xtreme, author = {Junjie Hu and Sebastian Ruder and Aditya Siddhant and Graham Neubig and Orhan Firat and Melvin Johnson}, title = {XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization}, journal = {CoRR}, volume = {abs/2003.11080}, year = {2020}, archivePrefix = {arXiv}, eprint = {2003.11080} }
数据集加载器还可以显示有关集合中包含的数据集的信息
collection_loader.print_datasets()
The dataset collection xtreme (version: 1.0.0) contains the datasets: - xnli: DatasetReference(dataset_name='xtreme_xnli', namespace=None, config=None, version='1.1.0', data_dir=None, split_mapping=None) - pawsx: DatasetReference(dataset_name='xtreme_pawsx', namespace=None, config=None, version='1.0.0', data_dir=None, split_mapping=None) - pos: DatasetReference(dataset_name='xtreme_pos', namespace=None, config=None, version='1.0.0', data_dir=None, split_mapping=None) - ner: DatasetReference(dataset_name='wikiann', namespace=None, config=None, version='1.0.0', data_dir=None, split_mapping=None) - xquad: DatasetReference(dataset_name='xquad', namespace=None, config=None, version='3.0.0', data_dir=None, split_mapping=None) - mlqa: DatasetReference(dataset_name='mlqa', namespace=None, config=None, version='1.0.0', data_dir=None, split_mapping=None) - tydiqa: DatasetReference(dataset_name='tydi_qa', namespace=None, config=None, version='3.0.0', data_dir=None, split_mapping=None) - bucc: DatasetReference(dataset_name='bucc', namespace=None, config=None, version='1.0.0', data_dir=None, split_mapping=None) - tatoeba: DatasetReference(dataset_name='tatoeba', namespace=None, config=None, version='1.0.0', data_dir=None, split_mapping=None)
从数据集集合加载数据集
从集合加载一个数据集最简单的方法是使用 DatasetCollectionLoader
对象的 load_dataset
方法,该方法通过调用 tfds.load
来加载所需数据集。
此调用返回一个拆分名称和相应 tf.data.Dataset
的字典
splits = collection_loader.load_dataset("ner")
pprint.pprint(splits)
{'test': <_PrefetchDataset element_spec={'langs': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'spans': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'tags': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'tokens': TensorSpec(shape=(None,), dtype=tf.string, name=None)}>, 'train': <_PrefetchDataset element_spec={'langs': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'spans': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'tags': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'tokens': TensorSpec(shape=(None,), dtype=tf.string, name=None)}>, 'validation': <_PrefetchDataset element_spec={'langs': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'spans': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'tags': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'tokens': TensorSpec(shape=(None,), dtype=tf.string, name=None)}>} 2023-10-03 09:25:02.792501: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:268] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
load_dataset
接受以下可选参数
split
:要加载的拆分。它接受单个拆分 (split="test"
) 或拆分列表:(split=["train", "test"]
)。如果未指定,它将加载给定数据集的所有拆分。loader_kwargs
:要传递给tfds.load
函数的关键字参数。有关不同加载选项的全面概述,请参阅tfds.load
文档。
从数据集集合加载多个数据集
从集合加载多个数据集最简单的方法是使用 DatasetCollectionLoader
对象的 load_datasets
方法,该方法通过调用 tfds.load
来加载所需数据集。
它返回一个数据集名称的字典,每个名称都与一个拆分名称和相应 tf.data.Dataset
的字典相关联,如下例所示
datasets = collection_loader.load_datasets(['xnli', 'bucc'])
pprint.pprint(datasets)
{'bucc': {'test': <_PrefetchDataset element_spec={'source_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'source_sentence': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_sentence': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'validation': <_PrefetchDataset element_spec={'source_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'source_sentence': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_sentence': TensorSpec(shape=(), dtype=tf.string, name=None)}>}, 'xnli': {'train': <_PrefetchDataset element_spec={'hypothesis': {'language': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'translation': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'label': TensorSpec(shape=(), dtype=tf.int64, name=None), 'premise': {'ar': TensorSpec(shape=(), dtype=tf.string, name=None), 'bg': TensorSpec(shape=(), dtype=tf.string, name=None), 'de': TensorSpec(shape=(), dtype=tf.string, name=None), 'el': TensorSpec(shape=(), dtype=tf.string, name=None), 'en': TensorSpec(shape=(), dtype=tf.string, name=None), 'es': TensorSpec(shape=(), dtype=tf.string, name=None), 'fr': TensorSpec(shape=(), dtype=tf.string, name=None), 'hi': TensorSpec(shape=(), dtype=tf.string, name=None), 'ru': TensorSpec(shape=(), dtype=tf.string, name=None), 'sw': TensorSpec(shape=(), dtype=tf.string, name=None), 'th': TensorSpec(shape=(), dtype=tf.string, name=None), 'tr': TensorSpec(shape=(), dtype=tf.string, name=None), 'ur': TensorSpec(shape=(), dtype=tf.string, name=None), 'vi': TensorSpec(shape=(), dtype=tf.string, name=None), 'zh': TensorSpec(shape=(), dtype=tf.string, name=None)} }>} }
load_all_datasets
方法加载给定集合的所有可用数据集
all_datasets = collection_loader.load_all_datasets()
pprint.pprint(all_datasets)
{'bucc': {'test': <_PrefetchDataset element_spec={'source_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'source_sentence': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_sentence': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'validation': <_PrefetchDataset element_spec={'source_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'source_sentence': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_sentence': TensorSpec(shape=(), dtype=tf.string, name=None)}>}, 'mlqa': {'test': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'validation': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>}, 'ner': {'test': <_PrefetchDataset element_spec={'langs': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'spans': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'tags': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'tokens': TensorSpec(shape=(None,), dtype=tf.string, name=None)}>, 'train': <_PrefetchDataset element_spec={'langs': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'spans': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'tags': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'tokens': TensorSpec(shape=(None,), dtype=tf.string, name=None)}>, 'validation': <_PrefetchDataset element_spec={'langs': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'spans': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'tags': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'tokens': TensorSpec(shape=(None,), dtype=tf.string, name=None)}>}, 'pawsx': {'train': <_PrefetchDataset element_spec={'label': TensorSpec(shape=(), dtype=tf.int64, name=None), 'sentence1': TensorSpec(shape=(), dtype=tf.string, name=None), 'sentence2': TensorSpec(shape=(), dtype=tf.string, name=None)}>}, 'pos': {'dev': <_PrefetchDataset element_spec={'tokens': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'upos': TensorSpec(shape=(None,), dtype=tf.int64, name=None)}>, 'test': <_PrefetchDataset element_spec={'tokens': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'upos': TensorSpec(shape=(None,), dtype=tf.int64, name=None)}>, 'train': <_PrefetchDataset element_spec={'tokens': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'upos': TensorSpec(shape=(None,), dtype=tf.int64, name=None)}>}, 'tatoeba': {'train': <_PrefetchDataset element_spec={'source_language': TensorSpec(shape=(), dtype=tf.string, name=None), 'source_sentence': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_language': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_sentence': TensorSpec(shape=(), dtype=tf.string, name=None)}>}, 'tydiqa': {'train': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'translate-train-ar': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'translate-train-bn': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'translate-train-fi': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'translate-train-id': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'translate-train-ko': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'translate-train-ru': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'translate-train-sw': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'translate-train-te': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'validation': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'validation-ar': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'validation-bn': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'validation-en': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'validation-fi': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'validation-id': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'validation-ko': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'validation-ru': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'validation-sw': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'validation-te': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>}, 'xnli': {'train': <_PrefetchDataset element_spec={'hypothesis': {'language': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'translation': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'label': TensorSpec(shape=(), dtype=tf.int64, name=None), 'premise': {'ar': TensorSpec(shape=(), dtype=tf.string, name=None), 'bg': TensorSpec(shape=(), dtype=tf.string, name=None), 'de': TensorSpec(shape=(), dtype=tf.string, name=None), 'el': TensorSpec(shape=(), dtype=tf.string, name=None), 'en': TensorSpec(shape=(), dtype=tf.string, name=None), 'es': TensorSpec(shape=(), dtype=tf.string, name=None), 'fr': TensorSpec(shape=(), dtype=tf.string, name=None), 'hi': TensorSpec(shape=(), dtype=tf.string, name=None), 'ru': TensorSpec(shape=(), dtype=tf.string, name=None), 'sw': TensorSpec(shape=(), dtype=tf.string, name=None), 'th': TensorSpec(shape=(), dtype=tf.string, name=None), 'tr': TensorSpec(shape=(), dtype=tf.string, name=None), 'ur': TensorSpec(shape=(), dtype=tf.string, name=None), 'vi': TensorSpec(shape=(), dtype=tf.string, name=None), 'zh': TensorSpec(shape=(), dtype=tf.string, name=None)} }>}, 'xquad': {'test': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'translate-dev': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'translate-test': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>, 'translate-train': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>} }
load_datasets
方法接受以下可选参数
split
:要加载的拆分。它接受单个拆分(split="test")
或拆分列表:(split=["train", "test"])
。如果未指定,它将加载给定数据集的所有拆分。loader_kwargs
:要传递给tfds.load
函数的关键字参数。有关不同加载选项的全面概述,请参阅tfds.load
文档。
指定 loader_kwargs
loader_kwargs
是要传递给 tfds.load
函数的可选关键字参数。它们可以通过三种方式指定
- 初始化
DatasetCollectionLoader
类时
collection_loader = tfds.dataset_collection('xtreme', loader_kwargs=dict(split='train', batch_size=10, try_gcs=False))
- 使用
DatasetCollectioLoader
的set_loader_kwargs
方法
collection_loader.set_loader_kwargs(dict(split='train', batch_size=10, try_gcs=False))
- 作为
load_dataset
、load_datasets
和load_all_datasets
方法的可选参数。
dataset = collection_loader.load_dataset('ner', loader_kwargs=dict(split='train', batch_size=10, try_gcs=False))
反馈
我们一直在努力改进数据集创建工作流程,但只有在了解问题的情况下才能做到。您在创建数据集集合时遇到了哪些问题或错误?是否有任何部分令人困惑、重复或第一次无法正常工作?请在 GitHub 上分享您的反馈。