TFDS 现在支持 Croissant 🥐 格式！阅读文档了解更多信息。

添加新的数据集集合

按照本指南创建新的数据集集合（在 TFDS 或您自己的存储库中）。

概览

要向 TFDS 添加新的数据集集合 my_collection，用户需要生成一个 my_collection 文件夹，其中包含以下文件

my_collection/
  __init__.py
  my_collection.py # Dataset collection definition
  my_collection_test.py # (Optional) test
  description.md # (Optional) collection description (if not included in my_collection.py)
  citations.md # (Optional) collection citations (if not included in my_collection.py)

按照惯例，应将新的数据集集合添加到 TFDS 存储库中的 tensorflow_datasets/dataset_collections/ 文件夹中。

编写数据集集合

所有数据集集合都是 tfds.core.dataset_collection_builder.DatasetCollection 的实现子类。

下面是数据集集合构建器的最小示例，在文件 my_collection.py 中定义

import collections
from typing import Mapping
from tensorflow_datasets.core import dataset_collection_builder
from tensorflow_datasets.core import naming

class MyCollection(dataset_collection_builder.DatasetCollection):
  """Dataset collection builder my_dataset_collection."""

  @property
  def info(self) -> dataset_collection_builder.DatasetCollectionInfo:
    return dataset_collection_builder.DatasetCollectionInfo.from_cls(
        dataset_collection_class=self.__class__,
        description="my_dataset_collection description.",
        release_notes={
            "1.0.0": "Initial release",
        },
    )

  @property
  def datasets(
      self,
  ) -> Mapping[str, Mapping[str, naming.DatasetReference]]:
    return collections.OrderedDict({
        "1.0.0":
            naming.references_for({
                "dataset_1": "natural_questions/default:0.0.2",
                "dataset_2": "media_sum:1.0.0",
            }),
        "1.1.0":
            naming.references_for({
                "dataset_1": "natural_questions/longt5:0.1.0",
                "dataset_2": "media_sum:1.0.0",
                "dataset_3": "squad:3.0.0"
            })
    })

下一部分描述了要覆盖的 2 个抽象方法。

`info`：数据集集合元数据

info 方法返回 dataset_collection_builder.DatasetCollectionInfo，其中包含集合的元数据。

数据集集合信息包含四个字段

名称：数据集集合的名称。
说明：数据集集合的 Markdown 格式说明。有两种方法可以定义数据集集合的说明：(1) 作为集合的 my_collection.py 文件中的（多行）字符串 - 类似于已经为 TFDS 数据集完成的方式；(2) 在 description.md 文件中，该文件必须放在数据集集合文件夹中。
发行说明：从数据集集合的版本到相应发行说明的映射。
引用：数据集集合的可选（BibTeX 引用列表）。有两种方法可以定义数据集集合的引用：(1) 作为集合的 my_collection.py 文件中的（多行）字符串 - 类似于已经为 TFDS 数据集完成的方式；(2) 在 citations.bib 文件中，该文件必须放在数据集集合文件夹中。

`datasets`：定义集合中的数据集

datasets 方法返回集合中的 TFDS 数据集。

它被定义为版本字典，描述数据集集合的演变。

对于每个版本，包含的 TFDS 数据集存储为从数据集名称到 naming.DatasetReference 的字典。例如

class MyCollection(dataset_collection_builder.DatasetCollection):
  ...
  @property
  def datasets(self):
    return {
        "1.0.0": {
            "yes_no":
                naming.DatasetReference(
                    dataset_name="yes_no", version="1.0.0"),
            "sst2":
                naming.DatasetReference(
                    dataset_name="glue", config="sst2", version="2.0.0"),
            "assin2":
                naming.DatasetReference(
                    dataset_name="assin2", version="1.0.0"),
        },
        ...
    }

naming.references_for 方法提供了一种更简洁的方式来表达与上述相同的内容

class MyCollection(dataset_collection_builder.DatasetCollection):
  ...
  @property
  def datasets(self):
    return {
        "1.0.0":
            naming.references_for({
                "yes_no": "yes_no:1.0.0",
                "sst2": "glue/sst:2.0.0",
                "assin2": "assin2:1.0.0",
            }),
        ...
    }

单元测试数据集集合

DatasetCollectionTestBase 是数据集集合的基本测试类。它提供了一些简单的检查，以确保数据集集合已正确注册，并且其数据集存在于 TFDS 中。

要设置的唯一类属性是 DATASET_COLLECTION_CLASS，它指定要测试的数据集集合的类对象。

此外，用户可以设置以下类属性

VERSION：用于运行测试的数据集集合的版本（默认为最新版本）。
DATASETS_TO_TEST：包含在 TFDS 中测试存在的的数据集列表（默认为集合中的所有数据集）。
CHECK_DATASETS_VERSION：是否检查数据集集合中版本化数据集的存在，或其默认版本的存在（默认为 true）。

数据集集合的最简单的有效测试如下所示

from tensorflow_datasets.testing.dataset_collection_builder_testing import DatasetCollectionTestBase
from . import my_collection

class TestMyCollection(DatasetCollectionTestBase):
  DATASET_COLLECTION_CLASS = my_collection.MyCollection

运行以下命令以测试数据集集合。

python my_dataset_test.py

反馈

我们一直在尝试改进数据集创建工作流，但只有在了解问题的情况下才能做到。在创建数据集集合时，您遇到了哪些问题或错误？是否有令人困惑的部分，或者第一次没有正常工作？

请在 GitHub 上分享您的反馈。

添加新的数据集集合

概览

编写数据集集合

info：数据集集合元数据

datasets：定义集合中的数据集

单元测试数据集集合

反馈

`info`：数据集集合元数据

`datasets`：定义集合中的数据集