在本基于笔记本的教程中,我们将创建和运行 TFX 管道以验证输入数据并创建 ML 模型。本笔记本基于我们在 简单 TFX 管道教程 中构建的 TFX 管道。如果您尚未阅读该教程,请在继续本笔记本之前阅读它。
任何数据科学或 ML 项目中的第一个任务是了解和清理数据,其中包括
- 了解每个特征的数据类型、分布和其他信息(例如,平均值或唯一值数量)
- 生成描述数据的初步模式
- 根据给定的模式识别数据中的异常值和缺失值
在本教程中,我们将创建两个 TFX 管道。
首先,我们将创建一个管道来分析数据集并生成给定数据集的初步模式。此管道将包括两个新组件,StatisticsGen
和 SchemaGen
。
获得适当的数据模式后,我们将创建一个管道来训练基于先前教程中管道创建的 ML 分类模型。在此管道中,我们将使用第一个管道中的模式和一个新组件 ExampleValidator
来验证输入数据。
三个新组件 StatisticsGen、SchemaGen 和 ExampleValidator 是用于数据分析和验证的 TFX 组件,它们使用 TensorFlow 数据验证 库实现。
请参阅 了解 TFX 管道,以详细了解 TFX 中的各种概念。
设置
首先,我们需要安装 TFX Python 包并下载我们将用于模型的数据集。
升级 Pip
为了避免在本地运行时升级系统中的 Pip,请检查我们是否在 Colab 中运行。本地系统当然可以单独升级。
try:
import colab
!pip install --upgrade pip
except:
pass
安装 TFX
pip install -U tfx
您是否重新启动了运行时?
如果您使用的是 Google Colab,则第一次运行上面的单元格时,必须通过点击上面的“重新启动运行时”按钮或使用“运行时 > 重新启动运行时...”菜单来重新启动运行时。这是因为 Colab 加载包的方式。
检查 TensorFlow 和 TFX 版本。
import tensorflow as tf
print('TensorFlow version: {}'.format(tf.__version__))
from tfx import v1 as tfx
print('TFX version: {}'.format(tfx.__version__))
2024-05-08 09:36:04.670322: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-05-08 09:36:04.670389: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-05-08 09:36:04.671916: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered TensorFlow version: 2.15.1 TFX version: 1.15.0
设置变量
有一些变量用于定义管道。您可以根据需要自定义这些变量。默认情况下,管道的所有输出都将在当前目录下生成。
import os
# We will create two pipelines. One for schema generation and one for training.
SCHEMA_PIPELINE_NAME = "penguin-tfdv-schema"
PIPELINE_NAME = "penguin-tfdv"
# Output directory to store artifacts generated from the pipeline.
SCHEMA_PIPELINE_ROOT = os.path.join('pipelines', SCHEMA_PIPELINE_NAME)
PIPELINE_ROOT = os.path.join('pipelines', PIPELINE_NAME)
# Path to a SQLite DB file to use as an MLMD storage.
SCHEMA_METADATA_PATH = os.path.join('metadata', SCHEMA_PIPELINE_NAME,
'metadata.db')
METADATA_PATH = os.path.join('metadata', PIPELINE_NAME, 'metadata.db')
# Output directory where created models from the pipeline will be exported.
SERVING_MODEL_DIR = os.path.join('serving_model', PIPELINE_NAME)
from absl import logging
logging.set_verbosity(logging.INFO) # Set default logging level.
准备示例数据
我们将下载示例数据集以用于我们的 TFX 管道。我们使用的数据集是 Palmer Penguins 数据集,它也用于其他 TFX 示例。
此数据集中有四个数值特征
- culmen_length_mm
- culmen_depth_mm
- flipper_length_mm
- body_mass_g
所有特征都已归一化到 [0,1] 范围内。我们将构建一个分类模型,用于预测企鹅的 species
。
由于 TFX ExampleGen 组件从目录读取输入,因此我们需要创建一个目录并将数据集复制到其中。
import urllib.request
import tempfile
DATA_ROOT = tempfile.mkdtemp(prefix='tfx-data') # Create a temporary directory.
_data_url = 'https://raw.githubusercontent.com/tensorflow/tfx/master/tfx/examples/penguin/data/labelled/penguins_processed.csv'
_data_filepath = os.path.join(DATA_ROOT, "data.csv")
urllib.request.urlretrieve(_data_url, _data_filepath)
('/tmpfs/tmp/tfx-dataj_6ovg52/data.csv', <http.client.HTTPMessage at 0x7ff39cac30a0>)
快速查看 CSV 文件。
head {_data_filepath}
species,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g 0,0.2545454545454545,0.6666666666666666,0.15254237288135594,0.2916666666666667 0,0.26909090909090905,0.5119047619047618,0.23728813559322035,0.3055555555555556 0,0.29818181818181805,0.5833333333333334,0.3898305084745763,0.1527777777777778 0,0.16727272727272732,0.7380952380952381,0.3559322033898305,0.20833333333333334 0,0.26181818181818167,0.892857142857143,0.3050847457627119,0.2638888888888889 0,0.24727272727272717,0.5595238095238096,0.15254237288135594,0.2569444444444444 0,0.25818181818181823,0.773809523809524,0.3898305084745763,0.5486111111111112 0,0.32727272727272727,0.5357142857142859,0.1694915254237288,0.1388888888888889 0,0.23636363636363636,0.9642857142857142,0.3220338983050847,0.3055555555555556
您应该能够看到五个特征列。 species
是 0、1 或 2 之一,所有其他特征的值应在 0 到 1 之间。我们将创建一个 TFX 管道来分析此数据集。
生成初步模式
TFX 管道使用 Python API 定义。我们将创建一个管道,从输入示例自动生成模式。此模式可以由人工审查并根据需要进行调整。模式最终确定后,可用于后续任务中的训练和示例验证。
除了 简单 TFX 管道教程 中使用的 CsvExampleGen
外,我们还将使用 StatisticsGen
和 SchemaGen
- StatisticsGen 计算数据集的统计信息。
- SchemaGen 检查统计信息并创建初始数据模式。
请参阅每个组件的指南或 TFX 组件教程,以详细了解这些组件。
编写管道定义
我们定义一个函数来创建 TFX 管道。一个 Pipeline
对象表示一个 TFX 管道,可以使用 TFX 支持的管道编排系统之一运行。
def _create_schema_pipeline(pipeline_name: str,
pipeline_root: str,
data_root: str,
metadata_path: str) -> tfx.dsl.Pipeline:
"""Creates a pipeline for schema generation."""
# Brings data into the pipeline.
example_gen = tfx.components.CsvExampleGen(input_base=data_root)
# NEW: Computes statistics over data for visualization and schema generation.
statistics_gen = tfx.components.StatisticsGen(
examples=example_gen.outputs['examples'])
# NEW: Generates schema based on the generated statistics.
schema_gen = tfx.components.SchemaGen(
statistics=statistics_gen.outputs['statistics'], infer_feature_shape=True)
components = [
example_gen,
statistics_gen,
schema_gen,
]
return tfx.dsl.Pipeline(
pipeline_name=pipeline_name,
pipeline_root=pipeline_root,
metadata_connection_config=tfx.orchestration.metadata
.sqlite_metadata_connection_config(metadata_path),
components=components)
运行管道
我们将像之前的教程一样使用 LocalDagRunner
。
tfx.orchestration.LocalDagRunner().run(
_create_schema_pipeline(
pipeline_name=SCHEMA_PIPELINE_NAME,
pipeline_root=SCHEMA_PIPELINE_ROOT,
data_root=DATA_ROOT,
metadata_path=SCHEMA_METADATA_PATH))
INFO:absl:Excluding no splits because exclude_splits is not set. INFO:absl:Excluding no splits because exclude_splits is not set. INFO:absl:Using deployment config: executor_specs { key: "CsvExampleGen" value { beam_executable_spec { python_executor_spec { class_path: "tfx.components.example_gen.csv_example_gen.executor.Executor" } } } } executor_specs { key: "SchemaGen" value { python_class_executable_spec { class_path: "tfx.components.schema_gen.executor.Executor" } } } executor_specs { key: "StatisticsGen" value { beam_executable_spec { python_executor_spec { class_path: "tfx.components.statistics_gen.executor.Executor" } } } } custom_driver_specs { key: "CsvExampleGen" value { python_class_executable_spec { class_path: "tfx.components.example_gen.driver.FileBasedDriver" } } } metadata_connection_config { database_connection_config { sqlite { filename_uri: "metadata/penguin-tfdv-schema/metadata.db" connection_mode: READWRITE_OPENCREATE } } } INFO:absl:Using connection config: sqlite { filename_uri: "metadata/penguin-tfdv-schema/metadata.db" connection_mode: READWRITE_OPENCREATE } INFO:absl:Component CsvExampleGen is running. INFO:absl:Running launcher for node_info { type { name: "tfx.components.example_gen.csv_example_gen.component.CsvExampleGen" } id: "CsvExampleGen" } contexts { contexts { type { name: "pipeline" } name { field_value { string_value: "penguin-tfdv-schema" } } } contexts { type { name: "pipeline_run" } name { field_value { string_value: "2024-05-08T09:36:10.555564" } } } contexts { type { name: "node" } name { field_value { string_value: "penguin-tfdv-schema.CsvExampleGen" } } } } outputs { outputs { key: "examples" value { artifact_spec { type { name: "Examples" properties { key: "span" value: INT } properties { key: "split_names" value: STRING } properties { key: "version" value: INT } base_type: DATASET } } } } } parameters { parameters { key: "input_base" value { field_value { string_value: "/tmpfs/tmp/tfx-dataj_6ovg52" } } } parameters { key: "input_config" value { field_value { string_value: "{\n \"splits\": [\n {\n \"name\": \"single_split\",\n \"pattern\": \"*\"\n }\n ]\n}" } } } parameters { key: "output_config" value { field_value { string_value: "{\n \"split_config\": {\n \"splits\": [\n {\n \"hash_buckets\": 2,\n \"name\": \"train\"\n },\n {\n \"hash_buckets\": 1,\n \"name\": \"eval\"\n }\n ]\n }\n}" } } } parameters { key: "output_data_format" value { field_value { int_value: 6 } } } parameters { key: "output_file_format" value { field_value { int_value: 5 } } } } downstream_nodes: "StatisticsGen" execution_options { caching_options { } } INFO:absl:MetadataStore with DB connection initialized INFO:absl:[CsvExampleGen] Resolved inputs: ({},) INFO:absl:select span and version = (0, None) INFO:absl:latest span and version = (0, None) INFO:absl:MetadataStore with DB connection initialized INFO:absl:Going to run a new execution 1 INFO:absl:Going to run a new execution: ExecutionInfo(execution_id=1, input_dict={}, output_dict=defaultdict(<class 'list'>, {'examples': [Artifact(artifact: uri: "pipelines/penguin-tfdv-schema/CsvExampleGen/examples/1" custom_properties { key: "input_fingerprint" value { string_value: "split:single_split,num_files:1,total_bytes:25648,xor_checksum:1715160970,sum_checksum:1715160970" } } custom_properties { key: "span" value { int_value: 0 } } , artifact_type: name: "Examples" properties { key: "span" value: INT } properties { key: "split_names" value: STRING } properties { key: "version" value: INT } base_type: DATASET )]}), exec_properties={'output_data_format': 6, 'output_file_format': 5, 'input_config': '{\n "splits": [\n {\n "name": "single_split",\n "pattern": "*"\n }\n ]\n}', 'output_config': '{\n "split_config": {\n "splits": [\n {\n "hash_buckets": 2,\n "name": "train"\n },\n {\n "hash_buckets": 1,\n "name": "eval"\n }\n ]\n }\n}', 'input_base': '/tmpfs/tmp/tfx-dataj_6ovg52', 'span': 0, 'version': None, 'input_fingerprint': 'split:single_split,num_files:1,total_bytes:25648,xor_checksum:1715160970,sum_checksum:1715160970'}, execution_output_uri='pipelines/penguin-tfdv-schema/CsvExampleGen/.system/executor_execution/1/executor_output.pb', stateful_working_dir='pipelines/penguin-tfdv-schema/CsvExampleGen/.system/stateful_working_dir/d65151e8-a6c7-4b12-8076-f56938dd89f4', tmp_dir='pipelines/penguin-tfdv-schema/CsvExampleGen/.system/executor_execution/1/.temp/', pipeline_node=node_info { type { name: "tfx.components.example_gen.csv_example_gen.component.CsvExampleGen" } id: "CsvExampleGen" } contexts { contexts { type { name: "pipeline" } name { field_value { string_value: "penguin-tfdv-schema" } } } contexts { type { name: "pipeline_run" } name { field_value { string_value: "2024-05-08T09:36:10.555564" } } } contexts { type { name: "node" } name { field_value { string_value: "penguin-tfdv-schema.CsvExampleGen" } } } } outputs { outputs { key: "examples" value { artifact_spec { type { name: "Examples" properties { key: "span" value: INT } properties { key: "split_names" value: STRING } properties { key: "version" value: INT } base_type: DATASET } } } } } parameters { parameters { key: "input_base" value { field_value { string_value: "/tmpfs/tmp/tfx-dataj_6ovg52" } } } parameters { key: "input_config" value { field_value { string_value: "{\n \"splits\": [\n {\n \"name\": \"single_split\",\n \"pattern\": \"*\"\n }\n ]\n}" } } } parameters { key: "output_config" value { field_value { string_value: "{\n \"split_config\": {\n \"splits\": [\n {\n \"hash_buckets\": 2,\n \"name\": \"train\"\n },\n {\n \"hash_buckets\": 1,\n \"name\": \"eval\"\n }\n ]\n }\n}" } } } parameters { key: "output_data_format" value { field_value { int_value: 6 } } } parameters { key: "output_file_format" value { field_value { int_value: 5 } } } } downstream_nodes: "StatisticsGen" execution_options { caching_options { } } , pipeline_info=id: "penguin-tfdv-schema" , pipeline_run_id='2024-05-08T09:36:10.555564', top_level_pipeline_run_id=None, frontend_url=None) INFO:absl:Generating examples. WARNING:apache_beam.runners.interactive.interactive_environment:Dependencies required for Interactive Beam PCollection visualization are not available, please use: `pip install apache-beam[interactive]` to install necessary dependencies to enable all data visualization features. INFO:absl:Processing input csv data /tmpfs/tmp/tfx-dataj_6ovg52/* to TFExample. WARNING:apache_beam.io.tfrecordio:Couldn't find python-snappy so the implementation of _TFRecordUtil._masked_crc32c is not as fast as it could be. INFO:absl:Examples generated. INFO:absl:Value type <class 'NoneType'> of key version in exec_properties is not supported, going to drop it INFO:absl:Value type <class 'list'> of key _beam_pipeline_args in exec_properties is not supported, going to drop it INFO:absl:Cleaning up stateless execution info. INFO:absl:Execution 1 succeeded. INFO:absl:Cleaning up stateful execution info. INFO:absl:Deleted stateful_working_dir pipelines/penguin-tfdv-schema/CsvExampleGen/.system/stateful_working_dir/d65151e8-a6c7-4b12-8076-f56938dd89f4 INFO:absl:Publishing output artifacts defaultdict(<class 'list'>, {'examples': [Artifact(artifact: uri: "pipelines/penguin-tfdv-schema/CsvExampleGen/examples/1" custom_properties { key: "input_fingerprint" value { string_value: "split:single_split,num_files:1,total_bytes:25648,xor_checksum:1715160970,sum_checksum:1715160970" } } custom_properties { key: "span" value { int_value: 0 } } , artifact_type: name: "Examples" properties { key: "span" value: INT } properties { key: "split_names" value: STRING } properties { key: "version" value: INT } base_type: DATASET )]}) for execution 1 INFO:absl:MetadataStore with DB connection initialized INFO:absl:Component CsvExampleGen is finished. INFO:absl:Component StatisticsGen is running. INFO:absl:Running launcher for node_info { type { name: "tfx.components.statistics_gen.component.StatisticsGen" base_type: PROCESS } id: "StatisticsGen" } contexts { contexts { type { name: "pipeline" } name { field_value { string_value: "penguin-tfdv-schema" } } } contexts { type { name: "pipeline_run" } name { field_value { string_value: "2024-05-08T09:36:10.555564" } } } contexts { type { name: "node" } name { field_value { string_value: "penguin-tfdv-schema.StatisticsGen" } } } } inputs { inputs { key: "examples" value { channels { producer_node_query { id: "CsvExampleGen" } context_queries { type { name: "pipeline" } name { field_value { string_value: "penguin-tfdv-schema" } } } context_queries { type { name: "pipeline_run" } name { field_value { string_value: "2024-05-08T09:36:10.555564" } } } context_queries { type { name: "node" } name { field_value { string_value: "penguin-tfdv-schema.CsvExampleGen" } } } artifact_query { type { name: "Examples" base_type: DATASET } } output_key: "examples" } min_count: 1 } } } outputs { outputs { key: "statistics" value { artifact_spec { type { name: "ExampleStatistics" properties { key: "span" value: INT } properties { key: "split_names" value: STRING } base_type: STATISTICS } } } } } parameters { parameters { key: "exclude_splits" value { field_value { string_value: "[]" } } } } upstream_nodes: "CsvExampleGen" downstream_nodes: "SchemaGen" execution_options { caching_options { } } INFO:absl:MetadataStore with DB connection initialized WARNING:absl:ArtifactQuery.property_predicate is not supported. INFO:absl:[StatisticsGen] Resolved inputs: ({'examples': [Artifact(artifact: id: 1 type_id: 15 uri: "pipelines/penguin-tfdv-schema/CsvExampleGen/examples/1" properties { key: "split_names" value { string_value: "[\"train\", \"eval\"]" } } custom_properties { key: "file_format" value { string_value: "tfrecords_gzip" } } custom_properties { key: "input_fingerprint" value { string_value: "split:single_split,num_files:1,total_bytes:25648,xor_checksum:1715160970,sum_checksum:1715160970" } } custom_properties { key: "is_external" value { int_value: 0 } } custom_properties { key: "payload_format" value { string_value: "FORMAT_TF_EXAMPLE" } } custom_properties { key: "span" value { int_value: 0 } } custom_properties { key: "tfx_version" value { string_value: "1.15.0" } } state: LIVE type: "Examples" create_time_since_epoch: 1715160971690 last_update_time_since_epoch: 1715160971690 , artifact_type: id: 15 name: "Examples" properties { key: "span" value: INT } properties { key: "split_names" value: STRING } properties { key: "version" value: INT } base_type: DATASET )]},) INFO:absl:MetadataStore with DB connection initialized INFO:absl:Going to run a new execution 2 INFO:absl:Going to run a new execution: ExecutionInfo(execution_id=2, input_dict={'examples': [Artifact(artifact: id: 1 type_id: 15 uri: "pipelines/penguin-tfdv-schema/CsvExampleGen/examples/1" properties { key: "split_names" value { string_value: "[\"train\", \"eval\"]" } } custom_properties { key: "file_format" value { string_value: "tfrecords_gzip" } } custom_properties { key: "input_fingerprint" value { string_value: "split:single_split,num_files:1,total_bytes:25648,xor_checksum:1715160970,sum_checksum:1715160970" } } custom_properties { key: "is_external" value { int_value: 0 } } custom_properties { key: "payload_format" value { string_value: "FORMAT_TF_EXAMPLE" } } custom_properties { key: "span" value { int_value: 0 } } custom_properties { key: "tfx_version" value { string_value: "1.15.0" } } state: LIVE type: "Examples" create_time_since_epoch: 1715160971690 last_update_time_since_epoch: 1715160971690 , artifact_type: id: 15 name: "Examples" properties { key: "span" value: INT } properties { key: "split_names" value: STRING } properties { key: "version" value: INT } base_type: DATASET )]}, output_dict=defaultdict(<class 'list'>, {'statistics': [Artifact(artifact: uri: "pipelines/penguin-tfdv-schema/StatisticsGen/statistics/2" , artifact_type: name: "ExampleStatistics" properties { key: "span" value: INT } properties { key: "split_names" value: STRING } base_type: STATISTICS )]}), exec_properties={'exclude_splits': '[]'}, execution_output_uri='pipelines/penguin-tfdv-schema/StatisticsGen/.system/executor_execution/2/executor_output.pb', stateful_working_dir='pipelines/penguin-tfdv-schema/StatisticsGen/.system/stateful_working_dir/3dc1ed50-c155-41f6-8457-b71a0b0ebe51', tmp_dir='pipelines/penguin-tfdv-schema/StatisticsGen/.system/executor_execution/2/.temp/', pipeline_node=node_info { type { name: "tfx.components.statistics_gen.component.StatisticsGen" base_type: PROCESS } id: "StatisticsGen" } contexts { contexts { type { name: "pipeline" } name { field_value { string_value: "penguin-tfdv-schema" } } } contexts { type { name: "pipeline_run" } name { field_value { string_value: "2024-05-08T09:36:10.555564" } } } contexts { type { name: "node" } name { field_value { string_value: "penguin-tfdv-schema.StatisticsGen" } } } } inputs { inputs { key: "examples" value { channels { producer_node_query { id: "CsvExampleGen" } context_queries { type { name: "pipeline" } name { field_value { string_value: "penguin-tfdv-schema" } } } context_queries { type { name: "pipeline_run" } name { field_value { string_value: "2024-05-08T09:36:10.555564" } } } context_queries { type { name: "node" } name { field_value { string_value: "penguin-tfdv-schema.CsvExampleGen" } } } artifact_query { type { name: "Examples" base_type: DATASET } } output_key: "examples" } min_count: 1 } } } outputs { outputs { key: "statistics" value { artifact_spec { type { name: "ExampleStatistics" properties { key: "span" value: INT } properties { key: "split_names" value: STRING } base_type: STATISTICS } } } } } parameters { parameters { key: "exclude_splits" value { field_value { string_value: "[]" } } } } upstream_nodes: "CsvExampleGen" downstream_nodes: "SchemaGen" execution_options { caching_options { } } , pipeline_info=id: "penguin-tfdv-schema" , pipeline_run_id='2024-05-08T09:36:10.555564', top_level_pipeline_run_id=None, frontend_url=None) INFO:absl:Generating statistics for split train. INFO:absl:Statistics for split train written to pipelines/penguin-tfdv-schema/StatisticsGen/statistics/2/Split-train. INFO:absl:Generating statistics for split eval. INFO:absl:Statistics for split eval written to pipelines/penguin-tfdv-schema/StatisticsGen/statistics/2/Split-eval. INFO:absl:Cleaning up stateless execution info. INFO:absl:Execution 2 succeeded. INFO:absl:Cleaning up stateful execution info. INFO:absl:Deleted stateful_working_dir pipelines/penguin-tfdv-schema/StatisticsGen/.system/stateful_working_dir/3dc1ed50-c155-41f6-8457-b71a0b0ebe51 INFO:absl:Publishing output artifacts defaultdict(<class 'list'>, {'statistics': [Artifact(artifact: uri: "pipelines/penguin-tfdv-schema/StatisticsGen/statistics/2" , artifact_type: name: "ExampleStatistics" properties { key: "span" value: INT } properties { key: "split_names" value: STRING } base_type: STATISTICS )]}) for execution 2 INFO:absl:MetadataStore with DB connection initialized INFO:absl:Component StatisticsGen is finished. INFO:absl:Component SchemaGen is running. INFO:absl:Running launcher for node_info { type { name: "tfx.components.schema_gen.component.SchemaGen" base_type: PROCESS } id: "SchemaGen" } contexts { contexts { type { name: "pipeline" } name { field_value { string_value: "penguin-tfdv-schema" } } } contexts { type { name: "pipeline_run" } name { field_value { string_value: "2024-05-08T09:36:10.555564" } } } contexts { type { name: "node" } name { field_value { string_value: "penguin-tfdv-schema.SchemaGen" } } } } inputs { inputs { key: "statistics" value { channels { producer_node_query { id: "StatisticsGen" } context_queries { type { name: "pipeline" } name { field_value { string_value: "penguin-tfdv-schema" } } } context_queries { type { name: "pipeline_run" } name { field_value { string_value: "2024-05-08T09:36:10.555564" } } } context_queries { type { name: "node" } name { field_value { string_value: "penguin-tfdv-schema.StatisticsGen" } } } artifact_query { type { name: "ExampleStatistics" base_type: STATISTICS } } output_key: "statistics" } min_count: 1 } } } outputs { outputs { key: "schema" value { artifact_spec { type { name: "Schema" } } } } } parameters { parameters { key: "exclude_splits" value { field_value { string_value: "[]" } } } parameters { key: "infer_feature_shape" value { field_value { int_value: 1 } } } } upstream_nodes: "StatisticsGen" execution_options { caching_options { } } INFO:absl:MetadataStore with DB connection initialized WARNING:absl:ArtifactQuery.property_predicate is not supported. INFO:absl:[SchemaGen] Resolved inputs: ({'statistics': [Artifact(artifact: id: 2 type_id: 17 uri: "pipelines/penguin-tfdv-schema/StatisticsGen/statistics/2" properties { key: "span" value { int_value: 0 } } properties { key: "split_names" value { string_value: "[\"train\", \"eval\"]" } } custom_properties { key: "is_external" value { int_value: 0 } } custom_properties { key: "stats_dashboard_link" value { string_value: "" } } custom_properties { key: "tfx_version" value { string_value: "1.15.0" } } state: LIVE type: "ExampleStatistics" create_time_since_epoch: 1715160975131 last_update_time_since_epoch: 1715160975131 , artifact_type: id: 17 name: "ExampleStatistics" properties { key: "span" value: INT } properties { key: "split_names" value: STRING } base_type: STATISTICS )]},) INFO:absl:MetadataStore with DB connection initialized INFO:absl:Going to run a new execution 3 INFO:absl:Going to run a new execution: ExecutionInfo(execution_id=3, input_dict={'statistics': [Artifact(artifact: id: 2 type_id: 17 uri: "pipelines/penguin-tfdv-schema/StatisticsGen/statistics/2" properties { key: "span" value { int_value: 0 } } properties { key: "split_names" value { string_value: "[\"train\", \"eval\"]" } } custom_properties { key: "is_external" value { int_value: 0 } } custom_properties { key: "stats_dashboard_link" value { string_value: "" } } custom_properties { key: "tfx_version" value { string_value: "1.15.0" } } state: LIVE type: "ExampleStatistics" create_time_since_epoch: 1715160975131 last_update_time_since_epoch: 1715160975131 , artifact_type: id: 17 name: "ExampleStatistics" properties { key: "span" value: INT } properties { key: "split_names" value: STRING } base_type: STATISTICS )]}, output_dict=defaultdict(<class 'list'>, {'schema': [Artifact(artifact: uri: "pipelines/penguin-tfdv-schema/SchemaGen/schema/3" , artifact_type: name: "Schema" )]}), exec_properties={'infer_feature_shape': 1, 'exclude_splits': '[]'}, execution_output_uri='pipelines/penguin-tfdv-schema/SchemaGen/.system/executor_execution/3/executor_output.pb', stateful_working_dir='pipelines/penguin-tfdv-schema/SchemaGen/.system/stateful_working_dir/9cb63ad1-17a3-4aaa-a3d3-5059f958bf6f', tmp_dir='pipelines/penguin-tfdv-schema/SchemaGen/.system/executor_execution/3/.temp/', pipeline_node=node_info { type { name: "tfx.components.schema_gen.component.SchemaGen" base_type: PROCESS } id: "SchemaGen" } contexts { contexts { type { name: "pipeline" } name { field_value { string_value: "penguin-tfdv-schema" } } } contexts { type { name: "pipeline_run" } name { field_value { string_value: "2024-05-08T09:36:10.555564" } } } contexts { type { name: "node" } name { field_value { string_value: "penguin-tfdv-schema.SchemaGen" } } } } inputs { inputs { key: "statistics" value { channels { producer_node_query { id: "StatisticsGen" } context_queries { type { name: "pipeline" } name { field_value { string_value: "penguin-tfdv-schema" } } } context_queries { type { name: "pipeline_run" } name { field_value { string_value: "2024-05-08T09:36:10.555564" } } } context_queries { type { name: "node" } name { field_value { string_value: "penguin-tfdv-schema.StatisticsGen" } } } artifact_query { type { name: "ExampleStatistics" base_type: STATISTICS } } output_key: "statistics" } min_count: 1 } } } outputs { outputs { key: "schema" value { artifact_spec { type { name: "Schema" } } } } } parameters { parameters { key: "exclude_splits" value { field_value { string_value: "[]" } } } parameters { key: "infer_feature_shape" value { field_value { int_value: 1 } } } } upstream_nodes: "StatisticsGen" execution_options { caching_options { } } , pipeline_info=id: "penguin-tfdv-schema" , pipeline_run_id='2024-05-08T09:36:10.555564', top_level_pipeline_run_id=None, frontend_url=None) INFO:absl:Processing schema from statistics for split train. INFO:absl:Processing schema from statistics for split eval. INFO:absl:Schema written to pipelines/penguin-tfdv-schema/SchemaGen/schema/3/schema.pbtxt. INFO:absl:Cleaning up stateless execution info. INFO:absl:Execution 3 succeeded. INFO:absl:Cleaning up stateful execution info. INFO:absl:Deleted stateful_working_dir pipelines/penguin-tfdv-schema/SchemaGen/.system/stateful_working_dir/9cb63ad1-17a3-4aaa-a3d3-5059f958bf6f INFO:absl:Publishing output artifacts defaultdict(<class 'list'>, {'schema': [Artifact(artifact: uri: "pipelines/penguin-tfdv-schema/SchemaGen/schema/3" , artifact_type: name: "Schema" )]}) for execution 3 INFO:absl:MetadataStore with DB connection initialized INFO:absl:Component SchemaGen is finished.
如果管道成功完成,您应该看到“INFO:absl:Component SchemaGen is finished.”。
我们将检查管道的输出,以了解我们的数据集。
查看管道的输出
如之前的教程中所述,TFX 管道会生成两种输出,即工件和 元数据数据库 (MLMD),其中包含工件和管道执行的元数据。我们在上面的单元格中定义了这些输出的位置。默认情况下,工件存储在 pipelines
目录下,元数据存储为 metadata
目录下的 sqlite 数据库。
您可以使用 MLMD API 以编程方式定位这些输出。首先,我们将定义一些实用程序函数来搜索刚刚生成的输出工件。
from ml_metadata.proto import metadata_store_pb2
# Non-public APIs, just for showcase.
from tfx.orchestration.portable.mlmd import execution_lib
# TODO(b/171447278): Move these functions into the TFX library.
def get_latest_artifacts(metadata, pipeline_name, component_id):
"""Output artifacts of the latest run of the component."""
context = metadata.store.get_context_by_type_and_name(
'node', f'{pipeline_name}.{component_id}')
executions = metadata.store.get_executions_by_context(context.id)
latest_execution = max(executions,
key=lambda e:e.last_update_time_since_epoch)
return execution_lib.get_output_artifacts(metadata, latest_execution.id)
# Non-public APIs, just for showcase.
from tfx.orchestration.experimental.interactive import visualizations
def visualize_artifacts(artifacts):
"""Visualizes artifacts using standard visualization modules."""
for artifact in artifacts:
visualization = visualizations.get_registry().get_visualization(
artifact.type_name)
if visualization:
visualization.display(artifact)
from tfx.orchestration.experimental.interactive import standard_visualizations
standard_visualizations.register_standard_visualizations()
现在我们可以检查管道执行的输出。
# Non-public APIs, just for showcase.
from tfx.orchestration.metadata import Metadata
from tfx.types import standard_component_specs
metadata_connection_config = tfx.orchestration.metadata.sqlite_metadata_connection_config(
SCHEMA_METADATA_PATH)
with Metadata(metadata_connection_config) as metadata_handler:
# Find output artifacts from MLMD.
stat_gen_output = get_latest_artifacts(metadata_handler, SCHEMA_PIPELINE_NAME,
'StatisticsGen')
stats_artifacts = stat_gen_output[standard_component_specs.STATISTICS_KEY]
schema_gen_output = get_latest_artifacts(metadata_handler,
SCHEMA_PIPELINE_NAME, 'SchemaGen')
schema_artifacts = schema_gen_output[standard_component_specs.SCHEMA_KEY]
INFO:absl:MetadataStore with DB connection initialized
现在该检查每个组件的输出。如上所述,Tensorflow 数据验证 (TFDV) 用于 StatisticsGen
和 SchemaGen
,TFDV 还提供这些组件输出的可视化。
在本教程中,我们将使用 TFX 中的辅助可视化方法(在内部使用 TFDV)来显示可视化。
检查 StatisticsGen 的输出
# docs-infra: no-execute
visualize_artifacts(stats_artifacts)
您可以看到输入数据的各种统计信息。这些统计信息被提供给 SchemaGen
,以自动构建数据的初始模式。
检查 SchemaGen 的输出
visualize_artifacts(schema_artifacts)
此模式是根据 StatisticsGen 的输出自动推断的。您应该能够看到 4 个 FLOAT 特征和 1 个 INT 特征。
导出模式以供将来使用
我们需要审查和细化生成的模式。审查后的模式需要持久化,以便在后续管道中用于 ML 模型训练。换句话说,您可能希望将模式文件添加到您的版本控制系统中以供实际使用。在本教程中,为了简单起见,我们将模式复制到预定义的文件系统路径。
import shutil
_schema_filename = 'schema.pbtxt'
SCHEMA_PATH = 'schema'
os.makedirs(SCHEMA_PATH, exist_ok=True)
_generated_path = os.path.join(schema_artifacts[0].uri, _schema_filename)
# Copy the 'schema.pbtxt' file from the artifact uri to a predefined path.
shutil.copy(_generated_path, SCHEMA_PATH)
'schema/schema.pbtxt'
模式文件使用 Protocol Buffer 文本格式 和 TensorFlow 元数据模式协议 的实例。
print(f'Schema at {SCHEMA_PATH}-----')
!cat {SCHEMA_PATH}/*
Schema at schema----- feature { name: "body_mass_g" type: FLOAT presence { min_fraction: 1.0 min_count: 1 } shape { dim { size: 1 } } } feature { name: "culmen_depth_mm" type: FLOAT presence { min_fraction: 1.0 min_count: 1 } shape { dim { size: 1 } } } feature { name: "culmen_length_mm" type: FLOAT presence { min_fraction: 1.0 min_count: 1 } shape { dim { size: 1 } } } feature { name: "flipper_length_mm" type: FLOAT presence { min_fraction: 1.0 min_count: 1 } shape { dim { size: 1 } } } feature { name: "species" type: INT presence { min_fraction: 1.0 min_count: 1 } shape { dim { size: 1 } } }
您应该确保根据需要审查并可能编辑模式定义。在本教程中,我们将直接使用生成的模式,不做任何更改。
验证输入示例并训练 ML 模型
我们将回到我们在 简单 TFX 管道教程 中创建的管道,训练 ML 模型并使用生成的模式编写模型训练代码。
我们还将添加一个 ExampleValidator 组件,该组件将根据模式查找传入数据集中的异常和缺失值。
编写模型训练代码
我们需要像在 简单 TFX 管道教程 中一样编写模型代码。
模型本身与之前的教程相同,但这次我们将使用从上一个管道生成的模式,而不是手动指定特征。大多数代码没有更改。唯一的区别是,我们不需要在此文件中指定特征的名称和类型。相反,我们从 schema 文件中读取它们。
_trainer_module_file = 'penguin_trainer.py'
%%writefile {_trainer_module_file}
from typing import List
from absl import logging
import tensorflow as tf
from tensorflow import keras
from tensorflow_transform.tf_metadata import schema_utils
from tfx import v1 as tfx
from tfx_bsl.public import tfxio
from tensorflow_metadata.proto.v0 import schema_pb2
# We don't need to specify _FEATURE_KEYS and _FEATURE_SPEC any more.
# Those information can be read from the given schema file.
_LABEL_KEY = 'species'
_TRAIN_BATCH_SIZE = 20
_EVAL_BATCH_SIZE = 10
def _input_fn(file_pattern: List[str],
data_accessor: tfx.components.DataAccessor,
schema: schema_pb2.Schema,
batch_size: int = 200) -> tf.data.Dataset:
"""Generates features and label for training.
Args:
file_pattern: List of paths or patterns of input tfrecord files.
data_accessor: DataAccessor for converting input to RecordBatch.
schema: schema of the input data.
batch_size: representing the number of consecutive elements of returned
dataset to combine in a single batch
Returns:
A dataset that contains (features, indices) tuple where features is a
dictionary of Tensors, and indices is a single Tensor of label indices.
"""
return data_accessor.tf_dataset_factory(
file_pattern,
tfxio.TensorFlowDatasetOptions(
batch_size=batch_size, label_key=_LABEL_KEY),
schema=schema).repeat()
def _build_keras_model(schema: schema_pb2.Schema) -> tf.keras.Model:
"""Creates a DNN Keras model for classifying penguin data.
Returns:
A Keras Model.
"""
# The model below is built with Functional API, please refer to
# https://tensorflowcn.cn/guide/keras/overview for all API options.
# ++ Changed code: Uses all features in the schema except the label.
feature_keys = [f.name for f in schema.feature if f.name != _LABEL_KEY]
inputs = [keras.layers.Input(shape=(1,), name=f) for f in feature_keys]
# ++ End of the changed code.
d = keras.layers.concatenate(inputs)
for _ in range(2):
d = keras.layers.Dense(8, activation='relu')(d)
outputs = keras.layers.Dense(3)(d)
model = keras.Model(inputs=inputs, outputs=outputs)
model.compile(
optimizer=keras.optimizers.Adam(1e-2),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=[keras.metrics.SparseCategoricalAccuracy()])
model.summary(print_fn=logging.info)
return model
# TFX Trainer will call this function.
def run_fn(fn_args: tfx.components.FnArgs):
"""Train the model based on given args.
Args:
fn_args: Holds args used to train the model as name/value pairs.
"""
# ++ Changed code: Reads in schema file passed to the Trainer component.
schema = tfx.utils.parse_pbtxt_file(fn_args.schema_path, schema_pb2.Schema())
# ++ End of the changed code.
train_dataset = _input_fn(
fn_args.train_files,
fn_args.data_accessor,
schema,
batch_size=_TRAIN_BATCH_SIZE)
eval_dataset = _input_fn(
fn_args.eval_files,
fn_args.data_accessor,
schema,
batch_size=_EVAL_BATCH_SIZE)
model = _build_keras_model(schema)
model.fit(
train_dataset,
steps_per_epoch=fn_args.train_steps,
validation_data=eval_dataset,
validation_steps=fn_args.eval_steps)
# The result of the training should be saved in `fn_args.serving_model_dir`
# directory.
model.save(fn_args.serving_model_dir, save_format='tf')
Writing penguin_trainer.py
现在您已完成构建用于模型训练的 TFX 管道的所有准备步骤。
编写管道定义
我们将添加两个新组件,Importer
和 ExampleValidator
。Importer 将外部文件导入 TFX 管道。在本例中,它是一个包含模式定义的文件。ExampleValidator 将检查输入数据并验证所有输入数据是否符合我们提供的模式。
def _create_pipeline(pipeline_name: str, pipeline_root: str, data_root: str,
schema_path: str, module_file: str, serving_model_dir: str,
metadata_path: str) -> tfx.dsl.Pipeline:
"""Creates a pipeline using predefined schema with TFX."""
# Brings data into the pipeline.
example_gen = tfx.components.CsvExampleGen(input_base=data_root)
# Computes statistics over data for visualization and example validation.
statistics_gen = tfx.components.StatisticsGen(
examples=example_gen.outputs['examples'])
# NEW: Import the schema.
schema_importer = tfx.dsl.Importer(
source_uri=schema_path,
artifact_type=tfx.types.standard_artifacts.Schema).with_id(
'schema_importer')
# NEW: Performs anomaly detection based on statistics and data schema.
example_validator = tfx.components.ExampleValidator(
statistics=statistics_gen.outputs['statistics'],
schema=schema_importer.outputs['result'])
# Uses user-provided Python function that trains a model.
trainer = tfx.components.Trainer(
module_file=module_file,
examples=example_gen.outputs['examples'],
schema=schema_importer.outputs['result'], # Pass the imported schema.
train_args=tfx.proto.TrainArgs(num_steps=100),
eval_args=tfx.proto.EvalArgs(num_steps=5))
# Pushes the model to a filesystem destination.
pusher = tfx.components.Pusher(
model=trainer.outputs['model'],
push_destination=tfx.proto.PushDestination(
filesystem=tfx.proto.PushDestination.Filesystem(
base_directory=serving_model_dir)))
components = [
example_gen,
# NEW: Following three components were added to the pipeline.
statistics_gen,
schema_importer,
example_validator,
trainer,
pusher,
]
return tfx.dsl.Pipeline(
pipeline_name=pipeline_name,
pipeline_root=pipeline_root,
metadata_connection_config=tfx.orchestration.metadata
.sqlite_metadata_connection_config(metadata_path),
components=components)
运行管道
tfx.orchestration.LocalDagRunner().run(
_create_pipeline(
pipeline_name=PIPELINE_NAME,
pipeline_root=PIPELINE_ROOT,
data_root=DATA_ROOT,
schema_path=SCHEMA_PATH,
module_file=_trainer_module_file,
serving_model_dir=SERVING_MODEL_DIR,
metadata_path=METADATA_PATH))
INFO:absl:Excluding no splits because exclude_splits is not set. INFO:absl:Excluding no splits because exclude_splits is not set. INFO:absl:Generating ephemeral wheel package for '/tmpfs/src/temp/docs/tutorials/tfx/penguin_trainer.py' (including modules: ['penguin_trainer']). INFO:absl:User module package has hash fingerprint version 000876a22093ec764e3751d5a3ed939f1b107d1d6ade133f954ea2a767b8dfb2. INFO:absl:Executing: ['/tmpfs/src/tf_docs_env/bin/python', '/tmpfs/tmp/tmpw96a2pj7/_tfx_generated_setup.py', 'bdist_wheel', '--bdist-dir', '/tmpfs/tmp/tmp42iap5mu', '--dist-dir', '/tmpfs/tmp/tmpx8p04zcg'] /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/setuptools/_distutils/cmd.py:66: SetuptoolsDeprecationWarning: setup.py install is deprecated. !! ******************************************************************************** Please avoid running ``setup.py`` directly. Instead, use pypa/build, pypa/installer or other standards-based tools. See https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html for details. ******************************************************************************** !! self.initialize_options() INFO:absl:Successfully built user code wheel distribution at 'pipelines/penguin-tfdv/_wheels/tfx_user_code_Trainer-0.0+000876a22093ec764e3751d5a3ed939f1b107d1d6ade133f954ea2a767b8dfb2-py3-none-any.whl'; target user module is 'penguin_trainer'. INFO:absl:Full user module path is 'penguin_trainer@pipelines/penguin-tfdv/_wheels/tfx_user_code_Trainer-0.0+000876a22093ec764e3751d5a3ed939f1b107d1d6ade133f954ea2a767b8dfb2-py3-none-any.whl' INFO:absl:Using deployment config: executor_specs { key: "CsvExampleGen" value { beam_executable_spec { python_executor_spec { class_path: "tfx.components.example_gen.csv_example_gen.executor.Executor" } } } } executor_specs { key: "ExampleValidator" value { python_class_executable_spec { class_path: "tfx.components.example_validator.executor.Executor" } } } executor_specs { key: "Pusher" value { python_class_executable_spec { class_path: "tfx.components.pusher.executor.Executor" } } } executor_specs { key: "StatisticsGen" value { beam_executable_spec { python_executor_spec { class_path: "tfx.components.statistics_gen.executor.Executor" } } } } executor_specs { key: "Trainer" value { python_class_executable_spec { class_path: "tfx.components.trainer.executor.GenericExecutor" } } } custom_driver_specs { key: "CsvExampleGen" value { python_class_executable_spec { class_path: "tfx.components.example_gen.driver.FileBasedDriver" } } } metadata_connection_config { database_connection_config { sqlite { filename_uri: "metadata/penguin-tfdv/metadata.db" connection_mode: READWRITE_OPENCREATE } } } INFO:absl:Using connection config: sqlite { filename_uri: "metadata/penguin-tfdv/metadata.db" connection_mode: READWRITE_OPENCREATE } INFO:absl:Component CsvExampleGen is running. INFO:absl:Running launcher for node_info { type { name: "tfx.components.example_gen.csv_example_gen.component.CsvExampleGen" } id: "CsvExampleGen" } contexts { contexts { type { name: "pipeline" } name { field_value { string_value: "penguin-tfdv" } } } contexts { type { name: "pipeline_run" } name { field_value { string_value: "2024-05-08T09:36:15.816321" } } } contexts { type { name: "node" } name { field_value { string_value: "penguin-tfdv.CsvExampleGen" } } } } outputs { outputs { key: "examples" value { artifact_spec { type { name: "Examples" properties { key: "span" value: INT } properties { key: "split_names" value: STRING } properties { key: "version" value: INT } base_type: DATASET } } } } } parameters { parameters { key: "input_base" value { field_value { string_value: "/tmpfs/tmp/tfx-dataj_6ovg52" } } } parameters { key: "input_config" value { field_value { string_value: "{\n \"splits\": [\n {\n \"name\": \"single_split\",\n \"pattern\": \"*\"\n }\n ]\n}" } } } parameters { key: "output_config" value { field_value { string_value: "{\n \"split_config\": {\n \"splits\": [\n {\n \"hash_buckets\": 2,\n \"name\": \"train\"\n },\n {\n \"hash_buckets\": 1,\n \"name\": \"eval\"\n }\n ]\n }\n}" } } } parameters { key: "output_data_format" value { field_value { int_value: 6 } } } parameters { key: "output_file_format" value { field_value { int_value: 5 } } } } downstream_nodes: "StatisticsGen" downstream_nodes: "Trainer" execution_options { caching_options { } } INFO:absl:MetadataStore with DB connection initialized INFO:absl:[CsvExampleGen] Resolved inputs: ({},) running bdist_wheel running build running build_py creating build creating build/lib copying penguin_trainer.py -> build/lib installing to /tmpfs/tmp/tmp42iap5mu running install running install_lib copying build/lib/penguin_trainer.py -> /tmpfs/tmp/tmp42iap5mu running install_egg_info running egg_info creating tfx_user_code_Trainer.egg-info writing tfx_user_code_Trainer.egg-info/PKG-INFO writing dependency_links to tfx_user_code_Trainer.egg-info/dependency_links.txt writing top-level names to tfx_user_code_Trainer.egg-info/top_level.txt writing manifest file 'tfx_user_code_Trainer.egg-info/SOURCES.txt' reading manifest file 'tfx_user_code_Trainer.egg-info/SOURCES.txt' writing manifest file 'tfx_user_code_Trainer.egg-info/SOURCES.txt' Copying tfx_user_code_Trainer.egg-info to /tmpfs/tmp/tmp42iap5mu/tfx_user_code_Trainer-0.0+000876a22093ec764e3751d5a3ed939f1b107d1d6ade133f954ea2a767b8dfb2-py3.9.egg-info running install_scripts creating /tmpfs/tmp/tmp42iap5mu/tfx_user_code_Trainer-0.0+000876a22093ec764e3751d5a3ed939f1b107d1d6ade133f954ea2a767b8dfb2.dist-info/WHEEL creating '/tmpfs/tmp/tmpx8p04zcg/tfx_user_code_Trainer-0.0+000876a22093ec764e3751d5a3ed939f1b107d1d6ade133f954ea2a767b8dfb2-py3-none-any.whl' and adding '/tmpfs/tmp/tmp42iap5mu' to it adding 'penguin_trainer.py' adding 'tfx_user_code_Trainer-0.0+000876a22093ec764e3751d5a3ed939f1b107d1d6ade133f954ea2a767b8dfb2.dist-info/METADATA' adding 'tfx_user_code_Trainer-0.0+000876a22093ec764e3751d5a3ed939f1b107d1d6ade133f954ea2a767b8dfb2.dist-info/WHEEL' adding 'tfx_user_code_Trainer-0.0+000876a22093ec764e3751d5a3ed939f1b107d1d6ade133f954ea2a767b8dfb2.dist-info/top_level.txt' adding 'tfx_user_code_Trainer-0.0+000876a22093ec764e3751d5a3ed939f1b107d1d6ade133f954ea2a767b8dfb2.dist-info/RECORD' removing /tmpfs/tmp/tmp42iap5mu INFO:absl:select span and version = (0, None) INFO:absl:latest span and version = (0, None) INFO:absl:MetadataStore with DB connection initialized INFO:absl:Going to run a new execution 1 INFO:absl:Going to run a new execution: ExecutionInfo(execution_id=1, input_dict={}, output_dict=defaultdict(<class 'list'>, {'examples': [Artifact(artifact: uri: "pipelines/penguin-tfdv/CsvExampleGen/examples/1" custom_properties { key: "input_fingerprint" value { string_value: "split:single_split,num_files:1,total_bytes:25648,xor_checksum:1715160970,sum_checksum:1715160970" } } custom_properties { key: "span" value { int_value: 0 } } , artifact_type: name: "Examples" properties { key: "span" value: INT } properties { key: "split_names" value: STRING } properties { key: "version" value: INT } base_type: DATASET )]}), exec_properties={'output_file_format': 5, 'input_config': '{\n "splits": [\n {\n "name": "single_split",\n "pattern": "*"\n }\n ]\n}', 'output_data_format': 6, 'output_config': '{\n "split_config": {\n "splits": [\n {\n "hash_buckets": 2,\n "name": "train"\n },\n {\n "hash_buckets": 1,\n "name": "eval"\n }\n ]\n }\n}', 'input_base': '/tmpfs/tmp/tfx-dataj_6ovg52', 'span': 0, 'version': None, 'input_fingerprint': 'split:single_split,num_files:1,total_bytes:25648,xor_checksum:1715160970,sum_checksum:1715160970'}, execution_output_uri='pipelines/penguin-tfdv/CsvExampleGen/.system/executor_execution/1/executor_output.pb', stateful_working_dir='pipelines/penguin-tfdv/CsvExampleGen/.system/stateful_working_dir/0858d568-1a97-401a-afc6-a9932ff9a1e3', tmp_dir='pipelines/penguin-tfdv/CsvExampleGen/.system/executor_execution/1/.temp/', pipeline_node=node_info { type { name: "tfx.components.example_gen.csv_example_gen.component.CsvExampleGen" } id: "CsvExampleGen" } contexts { contexts { type { name: "pipeline" } name { field_value { string_value: "penguin-tfdv" } } } contexts { type { name: "pipeline_run" } name { field_value { string_value: "2024-05-08T09:36:15.816321" } } } contexts { type { name: "node" } name { field_value { string_value: "penguin-tfdv.CsvExampleGen" } } } } outputs { outputs { key: "examples" value { artifact_spec { type { name: "Examples" properties { key: "span" value: INT } properties { key: "split_names" value: STRING } properties { key: "version" value: INT } base_type: DATASET } } } } } parameters { parameters { key: "input_base" value { field_value { string_value: "/tmpfs/tmp/tfx-dataj_6ovg52" } } } parameters { key: "input_config" value { field_value { string_value: "{\n \"splits\": [\n {\n \"name\": \"single_split\",\n \"pattern\": \"*\"\n }\n ]\n}" } } } parameters { key: "output_config" value { field_value { string_value: "{\n \"split_config\": {\n \"splits\": [\n {\n \"hash_buckets\": 2,\n \"name\": \"train\"\n },\n {\n \"hash_buckets\": 1,\n \"name\": \"eval\"\n }\n ]\n }\n}" } } } parameters { key: "output_data_format" value { field_value { int_value: 6 } } } parameters { key: "output_file_format" value { field_value { int_value: 5 } } } } downstream_nodes: "StatisticsGen" downstream_nodes: "Trainer" execution_options { caching_options { } } , pipeline_info=id: "penguin-tfdv" , pipeline_run_id='2024-05-08T09:36:15.816321', top_level_pipeline_run_id=None, frontend_url=None) INFO:absl:Generating examples. INFO:absl:Processing input csv data /tmpfs/tmp/tfx-dataj_6ovg52/* to TFExample. INFO:absl:Examples generated. INFO:absl:Value type <class 'NoneType'> of key version in exec_properties is not supported, going to drop it INFO:absl:Value type <class 'list'> of key _beam_pipeline_args in exec_properties is not supported, going to drop it INFO:absl:Cleaning up stateless execution info. INFO:absl:Execution 1 succeeded. INFO:absl:Cleaning up stateful execution info. INFO:absl:Deleted stateful_working_dir pipelines/penguin-tfdv/CsvExampleGen/.system/stateful_working_dir/0858d568-1a97-401a-afc6-a9932ff9a1e3 INFO:absl:Publishing output artifacts defaultdict(<class 'list'>, {'examples': [Artifact(artifact: uri: "pipelines/penguin-tfdv/CsvExampleGen/examples/1" custom_properties { key: "input_fingerprint" value { string_value: "split:single_split,num_files:1,total_bytes:25648,xor_checksum:1715160970,sum_checksum:1715160970" } } custom_properties { key: "span" value { int_value: 0 } } , artifact_type: name: "Examples" properties { key: "span" value: INT } properties { key: "split_names" value: STRING } properties { key: "version" value: INT } base_type: DATASET )]}) for execution 1 INFO:absl:MetadataStore with DB connection initialized INFO:absl:Component CsvExampleGen is finished. INFO:absl:Component schema_importer is running. INFO:absl:Running launcher for node_info { type { name: "tfx.dsl.components.common.importer.Importer" } id: "schema_importer" } contexts { contexts { type { name: "pipeline" } name { field_value { string_value: "penguin-tfdv" } } } contexts { type { name: "pipeline_run" } name { field_value { string_value: "2024-05-08T09:36:15.816321" } } } contexts { type { name: "node" } name { field_value { string_value: "penguin-tfdv.schema_importer" } } } } outputs { outputs { key: "result" value { artifact_spec { type { name: "Schema" } } } } } parameters { parameters { key: "artifact_uri" value { field_value { string_value: "schema" } } } parameters { key: "output_key" value { field_value { string_value: "result" } } } parameters { key: "reimport" value { field_value { int_value: 0 } } } } downstream_nodes: "ExampleValidator" downstream_nodes: "Trainer" execution_options { caching_options { } } INFO:absl:Running as an importer node. INFO:absl:MetadataStore with DB connection initialized INFO:absl:Processing source uri: schema, properties: {}, custom_properties: {} INFO:absl:Component schema_importer is finished. INFO:absl:Component StatisticsGen is running. INFO:absl:Running launcher for node_info { type { name: "tfx.components.statistics_gen.component.StatisticsGen" base_type: PROCESS } id: "StatisticsGen" } contexts { contexts { type { name: "pipeline" } name { field_value { string_value: "penguin-tfdv" } } } contexts { type { name: "pipeline_run" } name { field_value { string_value: "2024-05-08T09:36:15.816321" } } } contexts { type { name: "node" } name { field_value { string_value: "penguin-tfdv.StatisticsGen" } } } } inputs { inputs { key: "examples" value { channels { producer_node_query { id: "CsvExampleGen" } context_queries { type { name: "pipeline" } name { field_value { string_value: "penguin-tfdv" } } } context_queries { type { name: "pipeline_run" } name { field_value { string_value: "2024-05-08T09:36:15.816321" } } } context_queries { type { name: "node" } name { field_value { string_value: "penguin-tfdv.CsvExampleGen" } } } artifact_query { type { name: "Examples" base_type: DATASET } } output_key: "examples" } min_count: 1 } } } outputs { outputs { key: "statistics" value { artifact_spec { type { name: "ExampleStatistics" properties { key: "span" value: INT } properties { key: "split_names" value: STRING } base_type: STATISTICS } } } } } parameters { parameters { key: "exclude_splits" value { field_value { string_value: "[]" } } } } upstream_nodes: "CsvExampleGen" downstream_nodes: "ExampleValidator" execution_options { caching_options { } } INFO:absl:MetadataStore with DB connection initialized WARNING:absl:ArtifactQuery.property_predicate is not supported. INFO:absl:[StatisticsGen] Resolved inputs: ({'examples': [Artifact(artifact: id: 1 type_id: 15 uri: "pipelines/penguin-tfdv/CsvExampleGen/examples/1" properties { key: "split_names" value { string_value: "[\"train\", \"eval\"]" } } custom_properties { key: "file_format" value { string_value: "tfrecords_gzip" } } custom_properties { key: "input_fingerprint" value { string_value: "split:single_split,num_files:1,total_bytes:25648,xor_checksum:1715160970,sum_checksum:1715160970" } } custom_properties { key: "is_external" value { int_value: 0 } } custom_properties { key: "payload_format" value { string_value: "FORMAT_TF_EXAMPLE" } } custom_properties { key: "span" value { int_value: 0 } } custom_properties { key: "tfx_version" value { string_value: "1.15.0" } } state: LIVE type: "Examples" create_time_since_epoch: 1715160976759 last_update_time_since_epoch: 1715160976759 , artifact_type: id: 15 name: "Examples" properties { key: "span" value: INT } properties { key: "split_names" value: STRING } properties { key: "version" value: INT } base_type: DATASET )]},) INFO:absl:MetadataStore with DB connection initialized INFO:absl:Going to run a new execution 3 INFO:absl:Going to run a new execution: ExecutionInfo(execution_id=3, input_dict={'examples': [Artifact(artifact: id: 1 type_id: 15 uri: "pipelines/penguin-tfdv/CsvExampleGen/examples/1" properties { key: "split_names" value { string_value: "[\"train\", \"eval\"]" } } custom_properties { key: "file_format" value { string_value: "tfrecords_gzip" } } custom_properties { key: "input_fingerprint" value { string_value: "split:single_split,num_files:1,total_bytes:25648,xor_checksum:1715160970,sum_checksum:1715160970" } } custom_properties { key: "is_external" value { int_value: 0 } } custom_properties { key: "payload_format" value { string_value: "FORMAT_TF_EXAMPLE" } } custom_properties { key: "span" value { int_value: 0 } } custom_properties { key: "tfx_version" value { string_value: "1.15.0" } } state: LIVE type: "Examples" create_time_since_epoch: 1715160976759 last_update_time_since_epoch: 1715160976759 , artifact_type: id: 15 name: "Examples" properties { key: "span" value: INT } properties { key: "split_names" value: STRING } properties { key: "version" value: INT } base_type: DATASET )]}, output_dict=defaultdict(<class 'list'>, {'statistics': [Artifact(artifact: uri: "pipelines/penguin-tfdv/StatisticsGen/statistics/3" , artifact_type: name: "ExampleStatistics" properties { key: "span" value: INT } properties { key: "split_names" value: STRING } base_type: STATISTICS )]}), exec_properties={'exclude_splits': '[]'}, execution_output_uri='pipelines/penguin-tfdv/StatisticsGen/.system/executor_execution/3/executor_output.pb', stateful_working_dir='pipelines/penguin-tfdv/StatisticsGen/.system/stateful_working_dir/110d150d-d7b8-4a54-9e4b-de96d2f275fe', tmp_dir='pipelines/penguin-tfdv/StatisticsGen/.system/executor_execution/3/.temp/', pipeline_node=node_info { type { name: "tfx.components.statistics_gen.component.StatisticsGen" base_type: PROCESS } id: "StatisticsGen" } contexts { contexts { type { name: "pipeline" } name { field_value { string_value: "penguin-tfdv" } } } contexts { type { name: "pipeline_run" } name { field_value { string_value: "2024-05-08T09:36:15.816321" } } } contexts { type { name: "node" } name { field_value { string_value: "penguin-tfdv.StatisticsGen" } } } } inputs { inputs { key: "examples" value { channels { producer_node_query { id: "CsvExampleGen" } context_queries { type { name: "pipeline" } name { field_value { string_value: "penguin-tfdv" } } } context_queries { type { name: "pipeline_run" } name { field_value { string_value: "2024-05-08T09:36:15.816321" } } } context_queries { type { name: "node" } name { field_value { string_value: "penguin-tfdv.CsvExampleGen" } } } artifact_query { type { name: "Examples" base_type: DATASET } } output_key: "examples" } min_count: 1 } } } outputs { outputs { key: "statistics" value { artifact_spec { type { name: "ExampleStatistics" properties { key: "span" value: INT } properties { key: "split_names" value: STRING } base_type: STATISTICS } } } } } parameters { parameters { key: "exclude_splits" value { field_value { string_value: "[]" } } } } upstream_nodes: "CsvExampleGen" downstream_nodes: "ExampleValidator" execution_options { caching_options { } } , pipeline_info=id: "penguin-tfdv" , pipeline_run_id='2024-05-08T09:36:15.816321', top_level_pipeline_run_id=None, frontend_url=None) INFO:absl:Generating statistics for split train. INFO:absl:Statistics for split train written to pipelines/penguin-tfdv/StatisticsGen/statistics/3/Split-train. INFO:absl:Generating statistics for split eval. INFO:absl:Statistics for split eval written to pipelines/penguin-tfdv/StatisticsGen/statistics/3/Split-eval. INFO:absl:Cleaning up stateless execution info. INFO:absl:Execution 3 succeeded. INFO:absl:Cleaning up stateful execution info. INFO:absl:Deleted stateful_working_dir pipelines/penguin-tfdv/StatisticsGen/.system/stateful_working_dir/110d150d-d7b8-4a54-9e4b-de96d2f275fe INFO:absl:Publishing output artifacts defaultdict(<class 'list'>, {'statistics': [Artifact(artifact: uri: "pipelines/penguin-tfdv/StatisticsGen/statistics/3" , artifact_type: name: "ExampleStatistics" properties { key: "span" value: INT } properties { key: "split_names" value: STRING } base_type: STATISTICS )]}) for execution 3 INFO:absl:MetadataStore with DB connection initialized INFO:absl:Component StatisticsGen is finished. INFO:absl:Component Trainer is running. INFO:absl:Running launcher for node_info { type { name: "tfx.components.trainer.component.Trainer" base_type: TRAIN } id: "Trainer" } contexts { contexts { type { name: "pipeline" } name { field_value { string_value: "penguin-tfdv" } } } contexts { type { name: "pipeline_run" } name { field_value { string_value: "2024-05-08T09:36:15.816321" } } } contexts { type { name: "node" } name { field_value { string_value: "penguin-tfdv.Trainer" } } } } inputs { inputs { key: "examples" value { channels { producer_node_query { id: "CsvExampleGen" } context_queries { type { name: "pipeline" } name { field_value { string_value: "penguin-tfdv" } } } context_queries { type { name: "pipeline_run" } name { field_value { string_value: "2024-05-08T09:36:15.816321" } } } context_queries { type { name: "node" } name { field_value { string_value: "penguin-tfdv.CsvExampleGen" } } } artifact_query { type { name: "Examples" base_type: DATASET } } output_key: "examples" } min_count: 1 } } inputs { key: "schema" value { channels { producer_node_query { id: "schema_importer" } context_queries { type { name: "pipeline" } name { field_value { string_value: "penguin-tfdv" } } } context_queries { type { name: "pipeline_run" } name { field_value { string_value: "2024-05-08T09:36:15.816321" } } } context_queries { type { name: "node" } name { field_value { string_value: "penguin-tfdv.schema_importer" } } } artifact_query { type { name: "Schema" } } output_key: "result" } } } } outputs { outputs { key: "model" value { artifact_spec { type { name: "Model" base_type: MODEL } } } } outputs { key: "model_run" value { artifact_spec { type { name: "ModelRun" } } } } } parameters { parameters { key: "custom_config" value { field_value { string_value: "null" } } } parameters { key: "eval_args" value { field_value { string_value: "{\n \"num_steps\": 5\n}" } } } parameters { key: "module_path" value { field_value { string_value: "penguin_trainer@pipelines/penguin-tfdv/_wheels/tfx_user_code_Trainer-0.0+000876a22093ec764e3751d5a3ed939f1b107d1d6ade133f954ea2a767b8dfb2-py3-none-any.whl" } } } parameters { key: "train_args" value { field_value { string_value: "{\n \"num_steps\": 100\n}" } } } } upstream_nodes: "CsvExampleGen" upstream_nodes: "schema_importer" downstream_nodes: "Pusher" execution_options { caching_options { } } INFO:absl:MetadataStore with DB connection initialized WARNING:absl:ArtifactQuery.property_predicate is not supported. WARNING:absl:ArtifactQuery.property_predicate is not supported. INFO:absl:[Trainer] Resolved inputs: ({'schema': [Artifact(artifact: id: 2 type_id: 17 uri: "schema" custom_properties { key: "is_external" value { int_value: 1 } } custom_properties { key: "tfx_version" value { string_value: "1.15.0" } } state: LIVE type: "Schema" create_time_since_epoch: 1715160976782 last_update_time_since_epoch: 1715160976782 , artifact_type: id: 17 name: "Schema" )], 'examples': [Artifact(artifact: id: 1 type_id: 15 uri: "pipelines/penguin-tfdv/CsvExampleGen/examples/1" properties { key: "split_names" value { string_value: "[\"train\", \"eval\"]" } } custom_properties { key: "file_format" value { string_value: "tfrecords_gzip" } } custom_properties { key: "input_fingerprint" value { string_value: "split:single_split,num_files:1,total_bytes:25648,xor_checksum:1715160970,sum_checksum:1715160970" } } custom_properties { key: "is_external" value { int_value: 0 } } custom_properties { key: "payload_format" value { string_value: "FORMAT_TF_EXAMPLE" } } custom_properties { key: "span" value { int_value: 0 } } custom_properties { key: "tfx_version" value { string_value: "1.15.0" } } state: LIVE type: "Examples" create_time_since_epoch: 1715160976759 last_update_time_since_epoch: 1715160976759 , artifact_type: id: 15 name: "Examples" properties { key: "span" value: INT } properties { key: "split_names" value: STRING } properties { key: "version" value: INT } base_type: DATASET )]},) INFO:absl:MetadataStore with DB connection initialized INFO:absl:Going to run a new execution 4 INFO:absl:Going to run a new execution: ExecutionInfo(execution_id=4, input_dict={'schema': [Artifact(artifact: id: 2 type_id: 17 uri: "schema" custom_properties { key: "is_external" value { int_value: 1 } } custom_properties { key: "tfx_version" value { string_value: "1.15.0" } } state: LIVE type: "Schema" create_time_since_epoch: 1715160976782 last_update_time_since_epoch: 1715160976782 , artifact_type: id: 17 name: "Schema" )], 'examples': [Artifact(artifact: id: 1 type_id: 15 uri: "pipelines/penguin-tfdv/CsvExampleGen/examples/1" properties { key: "split_names" value { string_value: "[\"train\", \"eval\"]" } } custom_properties { key: "file_format" value { string_value: "tfrecords_gzip" } } custom_properties { key: "input_fingerprint" value { string_value: "split:single_split,num_files:1,total_bytes:25648,xor_checksum:1715160970,sum_checksum:1715160970" } } custom_properties { key: "is_external" value { int_value: 0 } } custom_properties { key: "payload_format" value { string_value: "FORMAT_TF_EXAMPLE" } } custom_properties { key: "span" value { int_value: 0 } } custom_properties { key: "tfx_version" value { string_value: "1.15.0" } } state: LIVE type: "Examples" create_time_since_epoch: 1715160976759 last_update_time_since_epoch: 1715160976759 , artifact_type: id: 15 name: "Examples" properties { key: "span" value: INT } properties { key: "split_names" value: STRING } properties { key: "version" value: INT } base_type: DATASET )]}, output_dict=defaultdict(<class 'list'>, {'model': [Artifact(artifact: uri: "pipelines/penguin-tfdv/Trainer/model/4" , artifact_type: name: "Model" base_type: MODEL )], 'model_run': [Artifact(artifact: uri: "pipelines/penguin-tfdv/Trainer/model_run/4" , artifact_type: name: "ModelRun" )]}), exec_properties={'custom_config': 'null', 'train_args': '{\n "num_steps": 100\n}', 'eval_args': '{\n "num_steps": 5\n}', 'module_path': 'penguin_trainer@pipelines/penguin-tfdv/_wheels/tfx_user_code_Trainer-0.0+000876a22093ec764e3751d5a3ed939f1b107d1d6ade133f954ea2a767b8dfb2-py3-none-any.whl'}, execution_output_uri='pipelines/penguin-tfdv/Trainer/.system/executor_execution/4/executor_output.pb', stateful_working_dir='pipelines/penguin-tfdv/Trainer/.system/stateful_working_dir/d3fcfa17-7c4e-4e63-a48c-deb3cc064ab5', tmp_dir='pipelines/penguin-tfdv/Trainer/.system/executor_execution/4/.temp/', pipeline_node=node_info { type { name: "tfx.components.trainer.component.Trainer" base_type: TRAIN } id: "Trainer" } contexts { contexts { type { name: "pipeline" } name { field_value { string_value: "penguin-tfdv" } } } contexts { type { name: "pipeline_run" } name { field_value { string_value: "2024-05-08T09:36:15.816321" } } } contexts { type { name: "node" } name { field_value { string_value: "penguin-tfdv.Trainer" } } } } inputs { inputs { key: "examples" value { channels { producer_node_query { id: "CsvExampleGen" } context_queries { type { name: "pipeline" } name { field_value { string_value: "penguin-tfdv" } } } context_queries { type { name: "pipeline_run" } name { field_value { string_value: "2024-05-08T09:36:15.816321" } } } context_queries { type { name: "node" } name { field_value { string_value: "penguin-tfdv.CsvExampleGen" } } } artifact_query { type { name: "Examples" base_type: DATASET } } output_key: "examples" } min_count: 1 } } inputs { key: "schema" value { channels { producer_node_query { id: "schema_importer" } context_queries { type { name: "pipeline" } name { field_value { string_value: "penguin-tfdv" } } } context_queries { type { name: "pipeline_run" } name { field_value { string_value: "2024-05-08T09:36:15.816321" } } } context_queries { type { name: "node" } name { field_value { string_value: "penguin-tfdv.schema_importer" } } } artifact_query { type { name: "Schema" } } output_key: "result" } } } } outputs { outputs { key: "model" value { artifact_spec { type { name: "Model" base_type: MODEL } } } } outputs { key: "model_run" value { artifact_spec { type { name: "ModelRun" } } } } } parameters { parameters { key: "custom_config" value { field_value { string_value: "null" } } } parameters { key: "eval_args" value { field_value { string_value: "{\n \"num_steps\": 5\n}" } } } parameters { key: "module_path" value { field_value { string_value: "penguin_trainer@pipelines/penguin-tfdv/_wheels/tfx_user_code_Trainer-0.0+000876a22093ec764e3751d5a3ed939f1b107d1d6ade133f954ea2a767b8dfb2-py3-none-any.whl" } } } parameters { key: "train_args" value { field_value { string_value: "{\n \"num_steps\": 100\n}" } } } } upstream_nodes: "CsvExampleGen" upstream_nodes: "schema_importer" downstream_nodes: "Pusher" execution_options { caching_options { } } , pipeline_info=id: "penguin-tfdv" , pipeline_run_id='2024-05-08T09:36:15.816321', top_level_pipeline_run_id=None, frontend_url=None) INFO:absl:Train on the 'train' split when train_args.splits is not set. INFO:absl:Evaluate on the 'eval' split when eval_args.splits is not set. INFO:absl:udf_utils.get_fn {'custom_config': 'null', 'train_args': '{\n "num_steps": 100\n}', 'eval_args': '{\n "num_steps": 5\n}', 'module_path': 'penguin_trainer@pipelines/penguin-tfdv/_wheels/tfx_user_code_Trainer-0.0+000876a22093ec764e3751d5a3ed939f1b107d1d6ade133f954ea2a767b8dfb2-py3-none-any.whl'} 'run_fn' INFO:absl:Installing 'pipelines/penguin-tfdv/_wheels/tfx_user_code_Trainer-0.0+000876a22093ec764e3751d5a3ed939f1b107d1d6ade133f954ea2a767b8dfb2-py3-none-any.whl' to a temporary directory. INFO:absl:Executing: ['/tmpfs/src/tf_docs_env/bin/python', '-m', 'pip', 'install', '--target', '/tmpfs/tmp/tmp8javf_ly', 'pipelines/penguin-tfdv/_wheels/tfx_user_code_Trainer-0.0+000876a22093ec764e3751d5a3ed939f1b107d1d6ade133f954ea2a767b8dfb2-py3-none-any.whl'] Processing ./pipelines/penguin-tfdv/_wheels/tfx_user_code_Trainer-0.0+000876a22093ec764e3751d5a3ed939f1b107d1d6ade133f954ea2a767b8dfb2-py3-none-any.whl INFO:absl:Successfully installed 'pipelines/penguin-tfdv/_wheels/tfx_user_code_Trainer-0.0+000876a22093ec764e3751d5a3ed939f1b107d1d6ade133f954ea2a767b8dfb2-py3-none-any.whl'. INFO:absl:Training model. INFO:absl:Feature body_mass_g has a shape dim { size: 1 } . Setting to DenseTensor. INFO:absl:Feature culmen_depth_mm has a shape dim { size: 1 } . Setting to DenseTensor. INFO:absl:Feature culmen_length_mm has a shape dim { size: 1 } . Setting to DenseTensor. INFO:absl:Feature flipper_length_mm has a shape dim { size: 1 } . Setting to DenseTensor. INFO:absl:Feature species has a shape dim { size: 1 } . Setting to DenseTensor. Installing collected packages: tfx-user-code-Trainer Successfully installed tfx-user-code-Trainer-0.0+000876a22093ec764e3751d5a3ed939f1b107d1d6ade133f954ea2a767b8dfb2 WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tfx_bsl/tfxio/tf_example_record.py:343: parse_example_dataset (from tensorflow.python.data.experimental.ops.parsing_ops) is deprecated and will be removed in a future version. Instructions for updating: Use `tf.data.Dataset.map(tf.io.parse_example(...))` instead. WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tfx_bsl/tfxio/tf_example_record.py:343: parse_example_dataset (from tensorflow.python.data.experimental.ops.parsing_ops) is deprecated and will be removed in a future version. Instructions for updating: Use `tf.data.Dataset.map(tf.io.parse_example(...))` instead. INFO:absl:Feature body_mass_g has a shape dim { size: 1 } . Setting to DenseTensor. INFO:absl:Feature culmen_depth_mm has a shape dim { size: 1 } . Setting to DenseTensor. INFO:absl:Feature culmen_length_mm has a shape dim { size: 1 } . Setting to DenseTensor. INFO:absl:Feature flipper_length_mm has a shape dim { size: 1 } . Setting to DenseTensor. INFO:absl:Feature species has a shape dim { size: 1 } . Setting to DenseTensor. INFO:absl:Feature body_mass_g has a shape dim { size: 1 } . Setting to DenseTensor. INFO:absl:Feature culmen_depth_mm has a shape dim { size: 1 } . Setting to DenseTensor. INFO:absl:Feature culmen_length_mm has a shape dim { size: 1 } . Setting to DenseTensor. INFO:absl:Feature flipper_length_mm has a shape dim { size: 1 } . Setting to DenseTensor. INFO:absl:Feature species has a shape dim { size: 1 } . Setting to DenseTensor. INFO:absl:Feature body_mass_g has a shape dim { size: 1 } . Setting to DenseTensor. INFO:absl:Feature culmen_depth_mm has a shape dim { size: 1 } . Setting to DenseTensor. INFO:absl:Feature culmen_length_mm has a shape dim { size: 1 } . Setting to DenseTensor. INFO:absl:Feature flipper_length_mm has a shape dim { size: 1 } . Setting to DenseTensor. INFO:absl:Feature species has a shape dim { size: 1 } . Setting to DenseTensor. INFO:absl:Model: "model" INFO:absl:__________________________________________________________________________________________________ INFO:absl: Layer (type) Output Shape Param # Connected to INFO:absl:================================================================================================== INFO:absl: body_mass_g (InputLayer) [(None, 1)] 0 [] INFO:absl: INFO:absl: culmen_depth_mm (InputLaye [(None, 1)] 0 [] INFO:absl: r) INFO:absl: INFO:absl: culmen_length_mm (InputLay [(None, 1)] 0 [] INFO:absl: er) INFO:absl: INFO:absl: flipper_length_mm (InputLa [(None, 1)] 0 [] INFO:absl: yer) INFO:absl: INFO:absl: concatenate (Concatenate) (None, 4) 0 ['body_mass_g[0][0]', INFO:absl: 'culmen_depth_mm[0][0]', INFO:absl: 'culmen_length_mm[0][0]', INFO:absl: 'flipper_length_mm[0][0]'] INFO:absl: INFO:absl: dense (Dense) (None, 8) 40 ['concatenate[0][0]'] INFO:absl: INFO:absl: dense_1 (Dense) (None, 8) 72 ['dense[0][0]'] INFO:absl: INFO:absl: dense_2 (Dense) (None, 3) 27 ['dense_1[0][0]'] INFO:absl: INFO:absl:================================================================================================== INFO:absl:Total params: 139 (556.00 Byte) INFO:absl:Trainable params: 139 (556.00 Byte) INFO:absl:Non-trainable params: 0 (0.00 Byte) INFO:absl:__________________________________________________________________________________________________ WARNING: All log messages before absl::InitializeLog() is called are written to STDERR I0000 00:00:1715160986.539652 29188 device_compiler.h:186] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process. 100/100 [==============================] - 2s 5ms/step - loss: 0.5183 - sparse_categorical_accuracy: 0.8225 - val_loss: 0.1883 - val_sparse_categorical_accuracy: 0.9000 INFO:absl:Function `_wrapped_model` contains input name(s) resource with unsupported characters which will be renamed to model_dense_2_biasadd_readvariableop_resource in the SavedModel. INFO:tensorflow:Assets written to: pipelines/penguin-tfdv/Trainer/model/4/Format-Serving/assets INFO:tensorflow:Assets written to: pipelines/penguin-tfdv/Trainer/model/4/Format-Serving/assets INFO:absl:Writing fingerprint to pipelines/penguin-tfdv/Trainer/model/4/Format-Serving/fingerprint.pb INFO:absl:Training complete. Model written to pipelines/penguin-tfdv/Trainer/model/4/Format-Serving. ModelRun written to pipelines/penguin-tfdv/Trainer/model_run/4 INFO:absl:Cleaning up stateless execution info. INFO:absl:Execution 4 succeeded. INFO:absl:Cleaning up stateful execution info. INFO:absl:Deleted stateful_working_dir pipelines/penguin-tfdv/Trainer/.system/stateful_working_dir/d3fcfa17-7c4e-4e63-a48c-deb3cc064ab5 INFO:absl:Publishing output artifacts defaultdict(<class 'list'>, {'model': [Artifact(artifact: uri: "pipelines/penguin-tfdv/Trainer/model/4" , artifact_type: name: "Model" base_type: MODEL )], 'model_run': [Artifact(artifact: uri: "pipelines/penguin-tfdv/Trainer/model_run/4" , artifact_type: name: "ModelRun" )]}) for execution 4 INFO:absl:MetadataStore with DB connection initialized INFO:absl:Component Trainer is finished. INFO:absl:Component ExampleValidator is running. INFO:absl:Running launcher for node_info { type { name: "tfx.components.example_validator.component.ExampleValidator" } id: "ExampleValidator" } contexts { contexts { type { name: "pipeline" } name { field_value { string_value: "penguin-tfdv" } } } contexts { type { name: "pipeline_run" } name { field_value { string_value: "2024-05-08T09:36:15.816321" } } } contexts { type { name: "node" } name { field_value { string_value: "penguin-tfdv.ExampleValidator" } } } } inputs { inputs { key: "schema" value { channels { producer_node_query { id: "schema_importer" } context_queries { type { name: "pipeline" } name { field_value { string_value: "penguin-tfdv" } } } context_queries { type { name: "pipeline_run" } name { field_value { string_value: "2024-05-08T09:36:15.816321" } } } context_queries { type { name: "node" } name { field_value { string_value: "penguin-tfdv.schema_importer" } } } artifact_query { type { name: "Schema" } } output_key: "result" } min_count: 1 } } inputs { key: "statistics" value { channels { producer_node_query { id: "StatisticsGen" } context_queries { type { name: "pipeline" } name { field_value { string_value: "penguin-tfdv" } } } context_queries { type { name: "pipeline_run" } name { field_value { string_value: "2024-05-08T09:36:15.816321" } } } context_queries { type { name: "node" } name { field_value { string_value: "penguin-tfdv.StatisticsGen" } } } artifact_query { type { name: "ExampleStatistics" base_type: STATISTICS } } output_key: "statistics" } min_count: 1 } } } outputs { outputs { key: "anomalies" value { artifact_spec { type { name: "ExampleAnomalies" properties { key: "span" value: INT } properties { key: "split_names" value: STRING } } } } } } parameters { parameters { key: "exclude_splits" value { field_value { string_value: "[]" } } } } upstream_nodes: "StatisticsGen" upstream_nodes: "schema_importer" execution_options { caching_options { } } INFO:absl:MetadataStore with DB connection initialized WARNING:absl:ArtifactQuery.property_predicate is not supported. WARNING:absl:ArtifactQuery.property_predicate is not supported. INFO:absl:[ExampleValidator] Resolved inputs: ({'schema': [Artifact(artifact: id: 2 type_id: 17 uri: "schema" custom_properties { key: "is_external" value { int_value: 1 } } custom_properties { key: "tfx_version" value { string_value: "1.15.0" } } state: LIVE type: "Schema" create_time_since_epoch: 1715160976782 last_update_time_since_epoch: 1715160976782 , artifact_type: id: 17 name: "Schema" )], 'statistics': [Artifact(artifact: id: 3 type_id: 19 uri: "pipelines/penguin-tfdv/StatisticsGen/statistics/3" properties { key: "span" value { int_value: 0 } } properties { key: "split_names" value { string_value: "[\"train\", \"eval\"]" } } custom_properties { key: "is_external" value { int_value: 0 } } custom_properties { key: "stats_dashboard_link" value { string_value: "" } } custom_properties { key: "tfx_version" value { string_value: "1.15.0" } } state: LIVE type: "ExampleStatistics" create_time_since_epoch: 1715160979570 last_update_time_since_epoch: 1715160979570 , artifact_type: id: 19 name: "ExampleStatistics" properties { key: "span" value: INT } properties { key: "split_names" value: STRING } base_type: STATISTICS )]},) INFO:absl:MetadataStore with DB connection initialized INFO:absl:Going to run a new execution 5 INFO:absl:Going to run a new execution: ExecutionInfo(execution_id=5, input_dict={'schema': [Artifact(artifact: id: 2 type_id: 17 uri: "schema" custom_properties { key: "is_external" value { int_value: 1 } } custom_properties { key: "tfx_version" value { string_value: "1.15.0" } } state: LIVE type: "Schema" create_time_since_epoch: 1715160976782 last_update_time_since_epoch: 1715160976782 , artifact_type: id: 17 name: "Schema" )], 'statistics': [Artifact(artifact: id: 3 type_id: 19 uri: "pipelines/penguin-tfdv/StatisticsGen/statistics/3" properties { key: "span" value { int_value: 0 } } properties { key: "split_names" value { string_value: "[\"train\", \"eval\"]" } } custom_properties { key: "is_external" value { int_value: 0 } } custom_properties { key: "stats_dashboard_link" value { string_value: "" } } custom_properties { key: "tfx_version" value { string_value: "1.15.0" } } state: LIVE type: "ExampleStatistics" create_time_since_epoch: 1715160979570 last_update_time_since_epoch: 1715160979570 , artifact_type: id: 19 name: "ExampleStatistics" properties { key: "span" value: INT } properties { key: "split_names" value: STRING } base_type: STATISTICS )]}, output_dict=defaultdict(<class 'list'>, {'anomalies': [Artifact(artifact: uri: "pipelines/penguin-tfdv/ExampleValidator/anomalies/5" , artifact_type: name: "ExampleAnomalies" properties { key: "span" value: INT } properties { key: "split_names" value: STRING } )]}), exec_properties={'exclude_splits': '[]'}, execution_output_uri='pipelines/penguin-tfdv/ExampleValidator/.system/executor_execution/5/executor_output.pb', stateful_working_dir='pipelines/penguin-tfdv/ExampleValidator/.system/stateful_working_dir/37093a55-ba7a-42c7-a4fc-388b5f69b7d8', tmp_dir='pipelines/penguin-tfdv/ExampleValidator/.system/executor_execution/5/.temp/', pipeline_node=node_info { type { name: "tfx.components.example_validator.component.ExampleValidator" } id: "ExampleValidator" } contexts { contexts { type { name: "pipeline" } name { field_value { string_value: "penguin-tfdv" } } } contexts { type { name: "pipeline_run" } name { field_value { string_value: "2024-05-08T09:36:15.816321" } } } contexts { type { name: "node" } name { field_value { string_value: "penguin-tfdv.ExampleValidator" } } } } inputs { inputs { key: "schema" value { channels { producer_node_query { id: "schema_importer" } context_queries { type { name: "pipeline" } name { field_value { string_value: "penguin-tfdv" } } } context_queries { type { name: "pipeline_run" } name { field_value { string_value: "2024-05-08T09:36:15.816321" } } } context_queries { type { name: "node" } name { field_value { string_value: "penguin-tfdv.schema_importer" } } } artifact_query { type { name: "Schema" } } output_key: "result" } min_count: 1 } } inputs { key: "statistics" value { channels { producer_node_query { id: "StatisticsGen" } context_queries { type { name: "pipeline" } name { field_value { string_value: "penguin-tfdv" } } } context_queries { type { name: "pipeline_run" } name { field_value { string_value: "2024-05-08T09:36:15.816321" } } } context_queries { type { name: "node" } name { field_value { string_value: "penguin-tfdv.StatisticsGen" } } } artifact_query { type { name: "ExampleStatistics" base_type: STATISTICS } } output_key: "statistics" } min_count: 1 } } } outputs { outputs { key: "anomalies" value { artifact_spec { type { name: "ExampleAnomalies" properties { key: "span" value: INT } properties { key: "split_names" value: STRING } } } } } } parameters { parameters { key: "exclude_splits" value { field_value { string_value: "[]" } } } } upstream_nodes: "StatisticsGen" upstream_nodes: "schema_importer" execution_options { caching_options { } } , pipeline_info=id: "penguin-tfdv" , pipeline_run_id='2024-05-08T09:36:15.816321', top_level_pipeline_run_id=None, frontend_url=None) INFO:absl:Validating schema against the computed statistics for split train. INFO:absl:Anomalies alerts created for split train. INFO:absl:Validation complete for split train. Anomalies written to pipelines/penguin-tfdv/ExampleValidator/anomalies/5/Split-train. INFO:absl:Validating schema against the computed statistics for split eval. INFO:absl:Anomalies alerts created for split eval. INFO:absl:Validation complete for split eval. Anomalies written to pipelines/penguin-tfdv/ExampleValidator/anomalies/5/Split-eval. INFO:absl:Cleaning up stateless execution info. INFO:absl:Execution 5 succeeded. INFO:absl:Cleaning up stateful execution info. INFO:absl:Deleted stateful_working_dir pipelines/penguin-tfdv/ExampleValidator/.system/stateful_working_dir/37093a55-ba7a-42c7-a4fc-388b5f69b7d8 INFO:absl:Publishing output artifacts defaultdict(<class 'list'>, {'anomalies': [Artifact(artifact: uri: "pipelines/penguin-tfdv/ExampleValidator/anomalies/5" , artifact_type: name: "ExampleAnomalies" properties { key: "span" value: INT } properties { key: "split_names" value: STRING } )]}) for execution 5 INFO:absl:MetadataStore with DB connection initialized INFO:absl:Component ExampleValidator is finished. INFO:absl:Component Pusher is running. INFO:absl:Running launcher for node_info { type { name: "tfx.components.pusher.component.Pusher" base_type: DEPLOY } id: "Pusher" } contexts { contexts { type { name: "pipeline" } name { field_value { string_value: "penguin-tfdv" } } } contexts { type { name: "pipeline_run" } name { field_value { string_value: "2024-05-08T09:36:15.816321" } } } contexts { type { name: "node" } name { field_value { string_value: "penguin-tfdv.Pusher" } } } } inputs { inputs { key: "model" value { channels { producer_node_query { id: "Trainer" } context_queries { type { name: "pipeline" } name { field_value { string_value: "penguin-tfdv" } } } context_queries { type { name: "pipeline_run" } name { field_value { string_value: "2024-05-08T09:36:15.816321" } } } context_queries { type { name: "node" } name { field_value { string_value: "penguin-tfdv.Trainer" } } } artifact_query { type { name: "Model" base_type: MODEL } } output_key: "model" } } } } outputs { outputs { key: "pushed_model" value { artifact_spec { type { name: "PushedModel" base_type: MODEL } } } } } parameters { parameters { key: "custom_config" value { field_value { string_value: "null" } } } parameters { key: "push_destination" value { field_value { string_value: "{\n \"filesystem\": {\n \"base_directory\": \"serving_model/penguin-tfdv\"\n }\n}" } } } } upstream_nodes: "Trainer" execution_options { caching_options { } } INFO:absl:MetadataStore with DB connection initialized WARNING:absl:ArtifactQuery.property_predicate is not supported. INFO:absl:[Pusher] Resolved inputs: ({'model': [Artifact(artifact: id: 4 type_id: 21 uri: "pipelines/penguin-tfdv/Trainer/model/4" custom_properties { key: "is_external" value { int_value: 0 } } custom_properties { key: "tfx_version" value { string_value: "1.15.0" } } state: LIVE type: "Model" create_time_since_epoch: 1715160988205 last_update_time_since_epoch: 1715160988205 , artifact_type: id: 21 name: "Model" base_type: MODEL )]},) INFO:absl:MetadataStore with DB connection initialized INFO:absl:Going to run a new execution 6 INFO:absl:Going to run a new execution: ExecutionInfo(execution_id=6, input_dict={'model': [Artifact(artifact: id: 4 type_id: 21 uri: "pipelines/penguin-tfdv/Trainer/model/4" custom_properties { key: "is_external" value { int_value: 0 } } custom_properties { key: "tfx_version" value { string_value: "1.15.0" } } state: LIVE type: "Model" create_time_since_epoch: 1715160988205 last_update_time_since_epoch: 1715160988205 , artifact_type: id: 21 name: "Model" base_type: MODEL )]}, output_dict=defaultdict(<class 'list'>, {'pushed_model': [Artifact(artifact: uri: "pipelines/penguin-tfdv/Pusher/pushed_model/6" , artifact_type: name: "PushedModel" base_type: MODEL )]}), exec_properties={'custom_config': 'null', 'push_destination': '{\n "filesystem": {\n "base_directory": "serving_model/penguin-tfdv"\n }\n}'}, execution_output_uri='pipelines/penguin-tfdv/Pusher/.system/executor_execution/6/executor_output.pb', stateful_working_dir='pipelines/penguin-tfdv/Pusher/.system/stateful_working_dir/5146eed9-4c4d-4bef-b849-ce87e44956ad', tmp_dir='pipelines/penguin-tfdv/Pusher/.system/executor_execution/6/.temp/', pipeline_node=node_info { type { name: "tfx.components.pusher.component.Pusher" base_type: DEPLOY } id: "Pusher" } contexts { contexts { type { name: "pipeline" } name { field_value { string_value: "penguin-tfdv" } } } contexts { type { name: "pipeline_run" } name { field_value { string_value: "2024-05-08T09:36:15.816321" } } } contexts { type { name: "node" } name { field_value { string_value: "penguin-tfdv.Pusher" } } } } inputs { inputs { key: "model" value { channels { producer_node_query { id: "Trainer" } context_queries { type { name: "pipeline" } name { field_value { string_value: "penguin-tfdv" } } } context_queries { type { name: "pipeline_run" } name { field_value { string_value: "2024-05-08T09:36:15.816321" } } } context_queries { type { name: "node" } name { field_value { string_value: "penguin-tfdv.Trainer" } } } artifact_query { type { name: "Model" base_type: MODEL } } output_key: "model" } } } } outputs { outputs { key: "pushed_model" value { artifact_spec { type { name: "PushedModel" base_type: MODEL } } } } } parameters { parameters { key: "custom_config" value { field_value { string_value: "null" } } } parameters { key: "push_destination" value { field_value { string_value: "{\n \"filesystem\": {\n \"base_directory\": \"serving_model/penguin-tfdv\"\n }\n}" } } } } upstream_nodes: "Trainer" execution_options { caching_options { } } , pipeline_info=id: "penguin-tfdv" , pipeline_run_id='2024-05-08T09:36:15.816321', top_level_pipeline_run_id=None, frontend_url=None) WARNING:absl:Pusher is going to push the model without validation. Consider using Evaluator or InfraValidator in your pipeline. INFO:absl:Model version: 1715160988 INFO:absl:Model written to serving path serving_model/penguin-tfdv/1715160988. INFO:absl:Model pushed to pipelines/penguin-tfdv/Pusher/pushed_model/6. INFO:absl:Cleaning up stateless execution info. INFO:absl:Execution 6 succeeded. INFO:absl:Cleaning up stateful execution info. INFO:absl:Deleted stateful_working_dir pipelines/penguin-tfdv/Pusher/.system/stateful_working_dir/5146eed9-4c4d-4bef-b849-ce87e44956ad INFO:absl:Publishing output artifacts defaultdict(<class 'list'>, {'pushed_model': [Artifact(artifact: uri: "pipelines/penguin-tfdv/Pusher/pushed_model/6" , artifact_type: name: "PushedModel" base_type: MODEL )]}) for execution 6 INFO:absl:MetadataStore with DB connection initialized INFO:absl:Component Pusher is finished.
如果管道成功完成,您应该看到“INFO:absl:Component Pusher is finished.”。
检查管道的输出
我们已经训练了企鹅的分类模型,并且还在 ExampleValidator 组件中验证了输入示例。我们可以像分析上一个管道一样分析 ExampleValidator 的输出。
metadata_connection_config = tfx.orchestration.metadata.sqlite_metadata_connection_config(
METADATA_PATH)
with Metadata(metadata_connection_config) as metadata_handler:
ev_output = get_latest_artifacts(metadata_handler, PIPELINE_NAME,
'ExampleValidator')
anomalies_artifacts = ev_output[standard_component_specs.ANOMALIES_KEY]
INFO:absl:MetadataStore with DB connection initialized
ExampleValidator 中的 ExampleAnomalies 也可以可视化。
visualize_artifacts(anomalies_artifacts)
您应该看到每个示例拆分的“未发现异常”。因为我们使用了与在此管道中用于模式生成相同的数据,所以这里预计不会出现异常。如果您使用新的传入数据重复运行此管道,ExampleValidator 应该能够找到新数据与现有模式之间的任何差异。
如果发现任何异常,您可以查看您的数据,以检查是否有任何示例不符合您的假设。来自其他组件(如 StatisticsGen)的输出可能会有用。但是,发现的任何异常都不会阻止进一步的管道执行。
后续步骤
您可以在 https://tensorflowcn.cn/tfx/tutorials 上找到更多资源
请参阅 了解 TFX 管道,以详细了解 TFX 中的各种概念。