TFDS CLI 是一款命令行工具,提供各种命令,方便您使用 TensorFlow 数据集。
在 TensorFlow.org 上查看
|
在 Google Colab 中运行
|
在 GitHub 上查看源代码
|
下载笔记本
|
在导入时禁用 TF 日志
%%capture
%env TF_CPP_MIN_LOG_LEVEL=1 # Disable logs on TF import
安装
CLI 工具与 tensorflow-datasets(或 tfds-nightly)一起安装。
pip install -q tfds-nightly apache-beamtfds --version
有关所有 CLI 命令的列表
tfds --help
usage: tfds [-h] [--helpfull] [--version] {build,new} ...
Tensorflow Datasets CLI tool
optional arguments:
-h, --help show this help message and exit
--helpfull show full help message and exit
--version show program's version number and exit
command:
{build,new}
build Commands for downloading and preparing datasets.
new Creates a new dataset directory from the template.
tfds new:实现新的数据集
此命令将帮助您通过创建包含默认实现文件的 <dataset_name>/ 目录来启动编写新的 Python 数据集。
用法
tfds new my_dataset
Dataset generated at /tmpfs/src/temp/docs/my_dataset You can start searching `TODO(my_dataset)` to complete the implementation. Please check https://tensorflowcn.cn/datasets/add_dataset for additional details.
tfds new my_dataset 将创建
ls -1 my_dataset/
CITATIONS.bib README.md TAGS.txt __init__.py checksums.tsv dummy_data/ my_dataset_dataset_builder.py my_dataset_dataset_builder_test.py
可选标志 --data_format 可用于生成特定于格式的数据集构建器(例如,conll)。如果没有提供数据格式,它将生成标准 tfds.core.GeneratorBasedBuilder 的模板。有关可用特定于格式的数据集构建器的详细信息,请参阅 文档。
有关更多信息,请参阅我们的 编写数据集指南。
可用选项
tfds new --help
usage: tfds new [-h] [--helpfull] [--data_format {standard,conll,conllu}]
[--dir DIR]
dataset_name
positional arguments:
dataset_name Name of the dataset to be created (in snake_case)
optional arguments:
-h, --help show this help message and exit
--helpfull show full help message and exit
--data_format {standard,conll,conllu}
Optional format of the input data, which is used to
generate a format-specific template.
--dir DIR Path where the dataset directory will be created.
Defaults to current directory.
tfds build:下载并准备数据集
使用 tfds build <my_dataset> 生成新的数据集。 <my_dataset> 可以是
指向
dataset/文件夹或dataset.py文件的路径(当前目录为空)tfds build datasets/my_dataset/cd datasets/my_dataset/ && tfds buildcd datasets/my_dataset/ && tfds build my_datasetcd datasets/my_dataset/ && tfds build my_dataset.py
已注册的数据集
tfds build mnisttfds build my_dataset --imports my_project.datasets
可用选项
tfds build --help
usage: tfds build [-h] [--helpfull]
[--datasets DATASETS_KEYWORD [DATASETS_KEYWORD ...]]
[--overwrite] [--fail_if_exists]
[--max_examples_per_split [MAX_EXAMPLES_PER_SPLIT]]
[--data_dir DATA_DIR] [--download_dir DOWNLOAD_DIR]
[--extract_dir EXTRACT_DIR] [--manual_dir MANUAL_DIR]
[--add_name_to_manual_dir] [--download_only]
[--config CONFIG] [--config_idx CONFIG_IDX]
[--update_metadata_only] [--download_config DOWNLOAD_CONFIG]
[--imports IMPORTS] [--register_checksums]
[--force_checksums_validation]
[--noforce_checksums_validation]
[--beam_pipeline_options BEAM_PIPELINE_OPTIONS]
[--file_format FILE_FORMAT]
[--max_shard_size_mb MAX_SHARD_SIZE_MB]
[--num-processes NUM_PROCESSES] [--publish_dir PUBLISH_DIR]
[--skip_if_published] [--exclude_datasets EXCLUDE_DATASETS]
[--experimental_latest_version]
[datasets ...]
positional arguments:
datasets Name(s) of the dataset(s) to build. Default to current
dir. See https://tensorflowcn.cn/datasets/cli for
accepted values.
optional arguments:
-h, --help show this help message and exit
--helpfull show full help message and exit
--datasets DATASETS_KEYWORD [DATASETS_KEYWORD ...]
Datasets can also be provided as keyword argument.
Debug & tests:
--pdb Enter post-mortem debugging mode if an exception is raised.
--overwrite Delete pre-existing dataset if it exists.
--fail_if_exists Fails the program if there is a pre-existing dataset.
--max_examples_per_split [MAX_EXAMPLES_PER_SPLIT]
When set, only generate the first X examples (default
to 1), rather than the full dataset.If set to 0, only
execute the `_split_generators` (which download the
original data), but skip `_generator_examples`
Paths:
--data_dir DATA_DIR Where to place datasets. Default to
`~/tensorflow_datasets/` or `TFDS_DATA_DIR`
environement variable.
--download_dir DOWNLOAD_DIR
Where to place downloads. Default to
`<data_dir>/downloads/`.
--extract_dir EXTRACT_DIR
Where to extract files. Default to
`<download_dir>/extracted/`.
--manual_dir MANUAL_DIR
Where to manually download data (required for some
datasets). Default to `<download_dir>/manual/`.
--add_name_to_manual_dir
If true, append the dataset name to the `manual_dir`
(e.g. `<download_dir>/manual/<dataset_name>/`. Useful
to avoid collisions if many datasets are generated.
Generation:
--download_only If True, download all files but do not prepare the
dataset. Uses the checksum.tsv to find out what to
download. Therefore, this does not work in combination
with --register_checksums.
--config CONFIG, -c CONFIG
Config name to build. Build all configs if not set.
Can also be a json of the kwargs forwarded to the
config `__init__` (for custom configs).
--config_idx CONFIG_IDX
Config id to build
(`builder_cls.BUILDER_CONFIGS[config_idx]`). Mutually
exclusive with `--config`.
--update_metadata_only
If True, existing dataset_info.json is updated with
metadata defined in Builder class(es). Datasets must
already have been prepared.
--download_config DOWNLOAD_CONFIG
A json of the kwargs forwarded to the config
`__init__` (for custom DownloadConfigs).
--imports IMPORTS, -i IMPORTS
Comma separated list of module to import to register
datasets.
--register_checksums If True, store size and checksum of downloaded files.
--force_checksums_validation
If True, raise an error if the checksums are not
found.
--noforce_checksums_validation
If specified, bypass the checks on the checksums.
--beam_pipeline_options BEAM_PIPELINE_OPTIONS
A (comma-separated) list of flags to pass to
`PipelineOptions` when preparing with Apache Beam.
(see:
https://tensorflowcn.cn/datasets/beam_datasets).
Example: `--beam_pipeline_options=job_name=my-
job,project=my-project`
--file_format FILE_FORMAT
File format to which generate the tf-examples.
Available values: ['tfrecord', 'riegeli',
'array_record'] (see `tfds.core.FileFormat`).
--max_shard_size_mb MAX_SHARD_SIZE_MB
The max shard size in megabytes.
--num-processes NUM_PROCESSES
Number of parallel build processes.
Publishing:
Options for publishing successfully created datasets.
--publish_dir PUBLISH_DIR
Where to optionally publish the dataset after it has
been generated successfully. Should be the root data
dir under whichdatasets are stored. If unspecified,
dataset will not be published
--skip_if_published If the dataset with the same version and config is
already published, then it will not be regenerated.
Automation:
Used by automated scripts.
--exclude_datasets EXCLUDE_DATASETS
If set, generate all datasets except the one defined
here. Comma separated list of datasets to exclude.
--experimental_latest_version
Build the latest Version(experiments=...) available
rather than default version.
在 TensorFlow.org 上查看
在 Google Colab 中运行
在 GitHub 上查看源代码
下载笔记本