数据集集合

在 TensorFlow.org 上查看 在 Google Colab 中运行 在 GitHub 上查看 下载笔记本

概述

数据集集合提供了一种简单的方法来将任意数量的现有 TFDS 数据集组合在一起,并对它们执行简单的操作。

例如,它们可以用于将与同一任务相关的不同数据集组合在一起,或者用于在固定数量的不同任务上轻松 对模型进行基准测试

设置

要开始使用,请安装一些软件包

# Use tfds-nightly to ensure access to the latest features.
pip install -q tfds-nightly tensorflow
pip install -U conllu

将 TensorFlow 和 Tensorflow 数据集包导入您的开发环境

import pprint

import tensorflow as tf
import tensorflow_datasets as tfds
2023-10-03 09:24:58.961730: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-10-03 09:24:58.961781: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-10-03 09:24:58.961817: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered

数据集集合提供了一种简单的方法来将任意数量的现有 TensorFlow 数据集 (TFDS) 数据集组合在一起,并对它们执行简单的操作。

例如,它们可以用于将与同一任务相关的不同数据集组合在一起,或者用于在固定数量的不同任务上轻松 对模型进行基准测试

查找可用的数据集集合

所有数据集集合构建器都是 tfds.core.dataset_collection_builder.DatasetCollection 的子类。

要获取可用构建器的列表,请使用 tfds.list_dataset_collections()

tfds.list_dataset_collections()
['longt5', 'xtreme']

加载和检查数据集集合

加载数据集集合最简单的方法是使用 DatasetCollectionLoader 对象,该对象使用 tfds.dataset_collection 命令进行实例化。

collection_loader = tfds.dataset_collection('xtreme')

可以按照与 TFDS 数据集相同的语法加载特定数据集集合版本

collection_loader = tfds.dataset_collection('xtreme:1.0.0')

数据集集合加载器可以显示有关集合的信息

collection_loader.print_info()
Dataset collection: xtreme
Version: 1.0.0
Description: # Xtreme Benchmark

The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME)
benchmark is a benchmark for the evaluation of the cross-lingual generalization
ability of pre-trained multilingual models. It covers 40 typologically diverse
languages (spanning 12 language families) and includes nine tasks that
collectively require reasoning about different levels of syntax and semantics.
The languages in XTREME are selected to maximize language diversity, coverage
in existing tasks, and availability of training data. Among these are many
under-studied languages, such as the Dravidian languages Tamil (spoken in
southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken
mainly in southern India), and the Niger-Congo languages Swahili and Yoruba,
spoken in Africa.

For a full description of the benchmark,
see the [paper](https://arxiv.org/abs/2003.11080).

Citation:
@article{hu2020xtreme,
    author    = {Junjie Hu and Sebastian Ruder and Aditya Siddhant and Graham
                 Neubig and Orhan Firat and Melvin Johnson},
    title     = {XTREME: A Massively Multilingual Multi-task Benchmark for
                 Evaluating Cross-lingual Generalization},
    journal   = {CoRR},
    volume    = {abs/2003.11080},
    year      = {2020},
    archivePrefix = {arXiv},
    eprint    = {2003.11080}
}

数据集加载器还可以显示有关集合中包含的数据集的信息

collection_loader.print_datasets()
The dataset collection xtreme (version: 1.0.0) contains the datasets:

 - xnli: DatasetReference(dataset_name='xtreme_xnli', namespace=None, config=None, version='1.1.0', data_dir=None, split_mapping=None)
 - pawsx: DatasetReference(dataset_name='xtreme_pawsx', namespace=None, config=None, version='1.0.0', data_dir=None, split_mapping=None)
 - pos: DatasetReference(dataset_name='xtreme_pos', namespace=None, config=None, version='1.0.0', data_dir=None, split_mapping=None)
 - ner: DatasetReference(dataset_name='wikiann', namespace=None, config=None, version='1.0.0', data_dir=None, split_mapping=None)
 - xquad: DatasetReference(dataset_name='xquad', namespace=None, config=None, version='3.0.0', data_dir=None, split_mapping=None)
 - mlqa: DatasetReference(dataset_name='mlqa', namespace=None, config=None, version='1.0.0', data_dir=None, split_mapping=None)
 - tydiqa: DatasetReference(dataset_name='tydi_qa', namespace=None, config=None, version='3.0.0', data_dir=None, split_mapping=None)
 - bucc: DatasetReference(dataset_name='bucc', namespace=None, config=None, version='1.0.0', data_dir=None, split_mapping=None)
 - tatoeba: DatasetReference(dataset_name='tatoeba', namespace=None, config=None, version='1.0.0', data_dir=None, split_mapping=None)

从数据集集合加载数据集

从集合加载一个数据集最简单的方法是使用 DatasetCollectionLoader 对象的 load_dataset 方法,该方法通过调用 tfds.load 来加载所需数据集。

此调用返回一个拆分名称和相应 tf.data.Dataset 的字典

splits = collection_loader.load_dataset("ner")

pprint.pprint(splits)
{'test': <_PrefetchDataset element_spec={'langs': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'spans': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'tags': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'tokens': TensorSpec(shape=(None,), dtype=tf.string, name=None)}>,
 'train': <_PrefetchDataset element_spec={'langs': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'spans': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'tags': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'tokens': TensorSpec(shape=(None,), dtype=tf.string, name=None)}>,
 'validation': <_PrefetchDataset element_spec={'langs': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'spans': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'tags': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'tokens': TensorSpec(shape=(None,), dtype=tf.string, name=None)}>}
2023-10-03 09:25:02.792501: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:268] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

load_dataset 接受以下可选参数

  • split:要加载的拆分。它接受单个拆分 (split="test") 或拆分列表:(split=["train", "test"])。如果未指定,它将加载给定数据集的所有拆分。
  • loader_kwargs:要传递给 tfds.load 函数的关键字参数。有关不同加载选项的全面概述,请参阅 tfds.load 文档。

从数据集集合加载多个数据集

从集合加载多个数据集最简单的方法是使用 DatasetCollectionLoader 对象的 load_datasets 方法,该方法通过调用 tfds.load 来加载所需数据集。

它返回一个数据集名称的字典,每个名称都与一个拆分名称和相应 tf.data.Dataset 的字典相关联,如下例所示

datasets = collection_loader.load_datasets(['xnli', 'bucc'])

pprint.pprint(datasets)
{'bucc': {'test': <_PrefetchDataset element_spec={'source_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'source_sentence': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_sentence': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
          'validation': <_PrefetchDataset element_spec={'source_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'source_sentence': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_sentence': TensorSpec(shape=(), dtype=tf.string, name=None)}>},
 'xnli': {'train': <_PrefetchDataset element_spec={'hypothesis': {'language': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'translation': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'label': TensorSpec(shape=(), dtype=tf.int64, name=None), 'premise': {'ar': TensorSpec(shape=(), dtype=tf.string, name=None), 'bg': TensorSpec(shape=(), dtype=tf.string, name=None), 'de': TensorSpec(shape=(), dtype=tf.string, name=None), 'el': TensorSpec(shape=(), dtype=tf.string, name=None), 'en': TensorSpec(shape=(), dtype=tf.string, name=None), 'es': TensorSpec(shape=(), dtype=tf.string, name=None), 'fr': TensorSpec(shape=(), dtype=tf.string, name=None), 'hi': TensorSpec(shape=(), dtype=tf.string, name=None), 'ru': TensorSpec(shape=(), dtype=tf.string, name=None), 'sw': TensorSpec(shape=(), dtype=tf.string, name=None), 'th': TensorSpec(shape=(), dtype=tf.string, name=None), 'tr': TensorSpec(shape=(), dtype=tf.string, name=None), 'ur': TensorSpec(shape=(), dtype=tf.string, name=None), 'vi': TensorSpec(shape=(), dtype=tf.string, name=None), 'zh': TensorSpec(shape=(), dtype=tf.string, name=None)} }>} }

load_all_datasets 方法加载给定集合的所有可用数据集

all_datasets = collection_loader.load_all_datasets()

pprint.pprint(all_datasets)
{'bucc': {'test': <_PrefetchDataset element_spec={'source_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'source_sentence': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_sentence': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
          'validation': <_PrefetchDataset element_spec={'source_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'source_sentence': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_sentence': TensorSpec(shape=(), dtype=tf.string, name=None)}>},
 'mlqa': {'test': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
          'validation': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>},
 'ner': {'test': <_PrefetchDataset element_spec={'langs': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'spans': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'tags': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'tokens': TensorSpec(shape=(None,), dtype=tf.string, name=None)}>,
         'train': <_PrefetchDataset element_spec={'langs': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'spans': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'tags': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'tokens': TensorSpec(shape=(None,), dtype=tf.string, name=None)}>,
         'validation': <_PrefetchDataset element_spec={'langs': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'spans': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'tags': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'tokens': TensorSpec(shape=(None,), dtype=tf.string, name=None)}>},
 'pawsx': {'train': <_PrefetchDataset element_spec={'label': TensorSpec(shape=(), dtype=tf.int64, name=None), 'sentence1': TensorSpec(shape=(), dtype=tf.string, name=None), 'sentence2': TensorSpec(shape=(), dtype=tf.string, name=None)}>},
 'pos': {'dev': <_PrefetchDataset element_spec={'tokens': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'upos': TensorSpec(shape=(None,), dtype=tf.int64, name=None)}>,
         'test': <_PrefetchDataset element_spec={'tokens': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'upos': TensorSpec(shape=(None,), dtype=tf.int64, name=None)}>,
         'train': <_PrefetchDataset element_spec={'tokens': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'upos': TensorSpec(shape=(None,), dtype=tf.int64, name=None)}>},
 'tatoeba': {'train': <_PrefetchDataset element_spec={'source_language': TensorSpec(shape=(), dtype=tf.string, name=None), 'source_sentence': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_language': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_sentence': TensorSpec(shape=(), dtype=tf.string, name=None)}>},
 'tydiqa': {'train': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
            'translate-train-ar': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
            'translate-train-bn': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
            'translate-train-fi': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
            'translate-train-id': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
            'translate-train-ko': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
            'translate-train-ru': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
            'translate-train-sw': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
            'translate-train-te': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
            'validation': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
            'validation-ar': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
            'validation-bn': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
            'validation-en': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
            'validation-fi': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
            'validation-id': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
            'validation-ko': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
            'validation-ru': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
            'validation-sw': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
            'validation-te': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>},
 'xnli': {'train': <_PrefetchDataset element_spec={'hypothesis': {'language': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'translation': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'label': TensorSpec(shape=(), dtype=tf.int64, name=None), 'premise': {'ar': TensorSpec(shape=(), dtype=tf.string, name=None), 'bg': TensorSpec(shape=(), dtype=tf.string, name=None), 'de': TensorSpec(shape=(), dtype=tf.string, name=None), 'el': TensorSpec(shape=(), dtype=tf.string, name=None), 'en': TensorSpec(shape=(), dtype=tf.string, name=None), 'es': TensorSpec(shape=(), dtype=tf.string, name=None), 'fr': TensorSpec(shape=(), dtype=tf.string, name=None), 'hi': TensorSpec(shape=(), dtype=tf.string, name=None), 'ru': TensorSpec(shape=(), dtype=tf.string, name=None), 'sw': TensorSpec(shape=(), dtype=tf.string, name=None), 'th': TensorSpec(shape=(), dtype=tf.string, name=None), 'tr': TensorSpec(shape=(), dtype=tf.string, name=None), 'ur': TensorSpec(shape=(), dtype=tf.string, name=None), 'vi': TensorSpec(shape=(), dtype=tf.string, name=None), 'zh': TensorSpec(shape=(), dtype=tf.string, name=None)} }>},
 'xquad': {'test': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
           'translate-dev': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
           'translate-test': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
           'translate-train': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>} }

load_datasets 方法接受以下可选参数

  • split:要加载的拆分。它接受单个拆分 (split="test") 或拆分列表:(split=["train", "test"])。如果未指定,它将加载给定数据集的所有拆分。
  • loader_kwargs:要传递给 tfds.load 函数的关键字参数。有关不同加载选项的全面概述,请参阅 tfds.load 文档。

指定 loader_kwargs

loader_kwargs 是要传递给 tfds.load 函数的可选关键字参数。它们可以通过三种方式指定

  1. 初始化 DatasetCollectionLoader 类时
collection_loader = tfds.dataset_collection('xtreme', loader_kwargs=dict(split='train', batch_size=10, try_gcs=False))
  1. 使用 DatasetCollectioLoaderset_loader_kwargs 方法
collection_loader.set_loader_kwargs(dict(split='train', batch_size=10, try_gcs=False))
  1. 作为 load_datasetload_datasetsload_all_datasets 方法的可选参数。
dataset = collection_loader.load_dataset('ner', loader_kwargs=dict(split='train', batch_size=10, try_gcs=False))

反馈

我们一直在努力改进数据集创建工作流程,但只有在了解问题的情况下才能做到。您在创建数据集集合时遇到了哪些问题或错误?是否有任何部分令人困惑、重复或第一次无法正常工作?请在 GitHub 上分享您的反馈。