Tensorflow 模型分析指标和图表

概述

TFMA 支持以下指标和图表

标准 keras 指标 (tf.keras.metrics.*)
- 请注意，您不需要 keras 模型即可使用 keras 指标。指标是在 beam 中使用指标类直接在图之外计算的。
标准 TFMA 指标和图表 (tfma.metrics.*)
自定义 keras 指标（从 tf.keras.metrics.Metric 派生的指标）
自定义 TFMA 指标（从 tfma.metrics.Metric 派生的指标），使用自定义 beam 组合器或从其他指标派生的指标）。

TFMA 还提供内置支持，用于将二元分类指标转换为用于多类/多标签问题。

基于类 ID、前 K 等的二值化
基于微平均、宏平均等的聚合指标

TFMA 还提供内置支持，用于查询/排名指标，其中示例在管道中按查询键自动分组。

结合起来，有 50 多种标准指标和图表可用于各种问题，包括回归、二元分类、多类/多标签分类、排名等。

配置

有两种方法可以在 TFMA 中配置指标：（1）使用 tfma.MetricsSpec 或 (2) 通过在 python 中创建 tf.keras.metrics.* 和/或 tfma.metrics.* 类的实例，并使用 tfma.metrics.specs_from_metrics 将它们转换为 tfma.MetricsSpec 列表。

以下部分描述了不同类型的机器学习问题的示例配置。

回归指标

以下是回归问题的示例配置设置。请参阅 tf.keras.metrics.* 和 tfma.metrics.* 模块以了解支持的可能的其他指标。

from google.protobuf import text_format

metrics_specs = text_format.Parse("""
  metrics_specs {
    metrics { class_name: "ExampleCount" }
    metrics { class_name: "MeanSquaredError" }
    metrics { class_name: "Accuracy" }
    metrics { class_name: "MeanLabel" }
    metrics { class_name: "MeanPrediction" }
    metrics { class_name: "Calibration" }
    metrics {
      class_name: "CalibrationPlot"
      config: '"min_value": 0, "max_value": 10'
    }
  }
""", tfma.EvalConfig()).metrics_specs

可以使用以下 python 代码创建相同的设置

metrics = [
    tfma.metrics.ExampleCount(name='example_count'),
    tf.keras.metrics.MeanSquaredError(name='mse'),
    tf.keras.metrics.Accuracy(name='accuracy'),
    tfma.metrics.MeanLabel(name='mean_label'),
    tfma.metrics.MeanPrediction(name='mean_prediction'),
    tfma.metrics.Calibration(name='calibration'),
    tfma.metrics.CalibrationPlot(
        name='calibration', min_value=0, max_value=10)
]
metrics_specs = tfma.metrics.specs_from_metrics(metrics)

请注意，此设置也可以通过调用 tfma.metrics.default_regression_specs 来获得。

二元分类指标

以下是二元分类问题的示例配置设置。请参阅 tf.keras.metrics.* 和 tfma.metrics.* 模块以了解支持的可能的其他指标。

from google.protobuf import text_format

metrics_specs = text_format.Parse("""
  metrics_specs {
    metrics { class_name: "ExampleCount" }
    metrics { class_name: "BinaryCrossentropy" }
    metrics { class_name: "BinaryAccuracy" }
    metrics { class_name: "AUC" }
    metrics { class_name: "AUCPrecisionRecall" }
    metrics { class_name: "MeanLabel" }
    metrics { class_name: "MeanPrediction" }
    metrics { class_name: "Calibration" }
    metrics { class_name: "ConfusionMatrixPlot" }
    metrics { class_name: "CalibrationPlot" }
  }
""", tfma.EvalConfig()).metrics_specs

可以使用以下 python 代码创建相同的设置

metrics = [
    tfma.metrics.ExampleCount(name='example_count'),
    tf.keras.metrics.BinaryCrossentropy(name='binary_crossentropy'),
    tf.keras.metrics.BinaryAccuracy(name='accuracy'),
    tf.keras.metrics.AUC(name='auc', num_thresholds=10000),
    tf.keras.metrics.AUC(
        name='auc_precision_recall', curve='PR', num_thresholds=10000),
    tf.keras.metrics.Precision(name='precision'),
    tf.keras.metrics.Recall(name='recall'),
    tfma.metrics.MeanLabel(name='mean_label'),
    tfma.metrics.MeanPrediction(name='mean_prediction'),
    tfma.metrics.Calibration(name='calibration'),
    tfma.metrics.ConfusionMatrixPlot(name='confusion_matrix_plot'),
    tfma.metrics.CalibrationPlot(name='calibration_plot')
]
metrics_specs = tfma.metrics.specs_from_metrics(metrics)

请注意，此设置也可以通过调用 tfma.metrics.default_binary_classification_specs 来获得。

多类/多标签分类指标

以下是多类分类问题的示例配置设置。请参阅 tf.keras.metrics.* 和 tfma.metrics.* 模块以了解支持的可能的其他指标。

from google.protobuf import text_format

metrics_specs = text_format.Parse("""
  metrics_specs {
    metrics { class_name: "ExampleCount" }
    metrics { class_name: "SparseCategoricalCrossentropy" }
    metrics { class_name: "SparseCategoricalAccuracy" }
    metrics { class_name: "Precision" config: '"top_k": 1' }
    metrics { class_name: "Precision" config: '"top_k": 3' }
    metrics { class_name: "Recall" config: '"top_k": 1' }
    metrics { class_name: "Recall" config: '"top_k": 3' }
    metrics { class_name: "MultiClassConfusionMatrixPlot" }
  }
""", tfma.EvalConfig()).metrics_specs

可以使用以下 python 代码创建相同的设置

metrics = [
    tfma.metrics.ExampleCount(name='example_count'),
    tf.keras.metrics.SparseCategoricalCrossentropy(
        name='sparse_categorical_crossentropy'),
    tf.keras.metrics.SparseCategoricalAccuracy(name='accuracy'),
    tf.keras.metrics.Precision(name='precision', top_k=1),
    tf.keras.metrics.Precision(name='precision', top_k=3),
    tf.keras.metrics.Recall(name='recall', top_k=1),
    tf.keras.metrics.Recall(name='recall', top_k=3),
    tfma.metrics.MultiClassConfusionMatrixPlot(
        name='multi_class_confusion_matrix_plot'),
]
metrics_specs = tfma.metrics.specs_from_metrics(metrics)

请注意，此设置也可以通过调用 tfma.metrics.default_multi_class_classification_specs 来获得。

多类/多标签二值化指标

多类/多标签指标可以使用 tfma.BinarizationOptions 进行二值化，以生成每个类别的指标、每个 top_k 等。例如

from google.protobuf import text_format

metrics_specs = text_format.Parse("""
  metrics_specs {
    binarize: { class_ids: { values: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] } }
    // Metrics to binarize
    metrics { class_name: "AUC" }
    ...
  }
""", tfma.EvalConfig()).metrics_specs

可以使用以下 python 代码创建相同的设置

metrics = [
    // Metrics to binarize
    tf.keras.metrics.AUC(name='auc', num_thresholds=10000),
    ...
]
metrics_specs = tfma.metrics.specs_from_metrics(
    metrics, binarize=tfma.BinarizationOptions(
        class_ids={'values': [0,1,2,3,4,5,6,7,8,9]}))

多类/多标签聚合指标

可以使用 tfma.AggregationOptions 聚合多类/多标签指标，以生成二元分类指标的单个聚合值。

请注意，聚合设置独立于二值化设置，因此您可以同时使用 tfma.AggregationOptions 和 tfma.BinarizationOptions。

微平均

可以通过在 tfma.AggregationOptions 中使用 micro_average 选项来执行微平均。例如

from google.protobuf import text_format

metrics_specs = text_format.Parse("""
  metrics_specs {
    aggregate: { micro_average: true }
    // Metrics to aggregate
    metrics { class_name: "AUC" }
    ...
  }
""", tfma.EvalConfig()).metrics_specs

可以使用以下 python 代码创建相同的设置

metrics = [
    // Metrics to aggregate
    tf.keras.metrics.AUC(name='auc', num_thresholds=10000),
    ...
]
metrics_specs = tfma.metrics.specs_from_metrics(
    metrics, aggregate=tfma.AggregationOptions(micro_average=True))

微平均还支持设置 top_k，其中仅使用前 k 个值进行计算。例如

from google.protobuf import text_format

metrics_specs = text_format.Parse("""
  metrics_specs {
    aggregate: {
      micro_average: true
      top_k_list: { values: [1, 3] }
    }
    // Metrics to aggregate
    metrics { class_name: "AUC" }
    ...
  }
""", tfma.EvalConfig()).metrics_specs

可以使用以下 python 代码创建相同的设置

metrics = [
    // Metrics to aggregate
    tf.keras.metrics.AUC(name='auc', num_thresholds=10000),
    ...
]
metrics_specs = tfma.metrics.specs_from_metrics(
    metrics,
    aggregate=tfma.AggregationOptions(micro_average=True,
                                      top_k_list={'values': [1, 3]}))

宏/加权宏平均

可以通过在 tfma.AggregationOptions 中使用 macro_average 或 weighted_macro_average 选项来执行宏平均。除非使用 top_k 设置，否则宏需要设置 class_weights 才能知道要计算哪些类的平均值。如果未提供 class_weight，则假定为 0.0。例如

from google.protobuf import text_format

metrics_specs = text_format.Parse("""
  metrics_specs {
    aggregate: {
      macro_average: true
      class_weights: { key: 0 value: 1.0 }
      class_weights: { key: 1 value: 1.0 }
      class_weights: { key: 2 value: 1.0 }
      class_weights: { key: 3 value: 1.0 }
      class_weights: { key: 4 value: 1.0 }
      class_weights: { key: 5 value: 1.0 }
      class_weights: { key: 6 value: 1.0 }
      class_weights: { key: 7 value: 1.0 }
      class_weights: { key: 8 value: 1.0 }
      class_weights: { key: 9 value: 1.0 }
    }
    // Metrics to aggregate
    metrics { class_name: "AUC" }
    ...
  }
""", tfma.EvalConfig()).metrics_specs

可以使用以下 python 代码创建相同的设置

metrics = [
    // Metrics to aggregate
    tf.keras.metrics.AUC(name='auc', num_thresholds=10000),
    ...
]
metrics_specs = tfma.metrics.specs_from_metrics(
    metrics,
    aggregate=tfma.AggregationOptions(
        macro_average=True, class_weights={i: 1.0 for i in range(10)}))

与微平均一样，宏平均也支持设置 top_k，其中仅使用前 k 个值进行计算。例如

from google.protobuf import text_format

metrics_specs = text_format.Parse("""
  metrics_specs {
    aggregate: {
      macro_average: true
      top_k_list: { values: [1, 3] }
    }
    // Metrics to aggregate
    metrics { class_name: "AUC" }
    ...
  }
""", tfma.EvalConfig()).metrics_specs

可以使用以下 python 代码创建相同的设置

metrics = [
    // Metrics to aggregate
    tf.keras.metrics.AUC(name='auc', num_thresholds=10000),
    ...
]
metrics_specs = tfma.metrics.specs_from_metrics(
    metrics,
    aggregate=tfma.AggregationOptions(macro_average=True,
                                      top_k_list={'values': [1, 3]}))

基于查询/排名的指标

通过在指标规范中指定 query_key 选项来启用基于查询/排名的指标。例如

from google.protobuf import text_format

metrics_specs = text_format.Parse("""
  metrics_specs {
    query_key: "doc_id"
    metrics {
      class_name: "NDCG"
      config: '"gain_key": "gain", "top_k_list": [1, 2]'
    }
    metrics { class_name: "MinLabelPosition" }
  }
""", tfma.EvalConfig()).metrics_specs

可以使用以下 python 代码创建相同的设置

metrics = [
    tfma.metrics.NDCG(name='ndcg', gain_key='gain', top_k_list=[1, 2]),
    tfma.metrics.MinLabelPosition(name='min_label_position')
]
metrics_specs = tfma.metrics.specs_from_metrics(metrics, query_key='doc_id')

多模型评估指标

TFMA 支持同时评估多个模型。执行多模型评估时，将为每个模型计算指标。例如

from google.protobuf import text_format

metrics_specs = text_format.Parse("""
  metrics_specs {
    # no model_names means all models
    ...
  }
""", tfma.EvalConfig()).metrics_specs

如果需要为模型子集计算指标，请在 metric_specs 中设置 model_names。例如

from google.protobuf import text_format

metrics_specs = text_format.Parse("""
  metrics_specs {
    model_names: ["my-model1"]
    ...
  }
""", tfma.EvalConfig()).metrics_specs

specs_from_metrics API 也支持传递模型名称

metrics = [
    ...
]
metrics_specs = tfma.metrics.specs_from_metrics(
    metrics, model_names=['my-model1'])

模型比较指标

TFMA 支持评估候选模型相对于基线模型的比较指标。设置候选模型和基线模型对的一种简单方法是使用适当的模型名称（tfma.BASELINE_KEY 和 tfma.CANDIDATE_KEY）传递 eval_shared_model


eval_config = text_format.Parse("""
  model_specs {
    # ... model_spec without names ...
  }
  metrics_spec {
    # ... metrics ...
  }
""", tfma.EvalConfig())

eval_shared_models = [
  tfma.default_eval_shared_model(
      model_name=tfma.CANDIDATE_KEY,
      eval_saved_model_path='/path/to/saved/candidate/model',
      eval_config=eval_config),
  tfma.default_eval_shared_model(
      model_name=tfma.BASELINE_KEY,
      eval_saved_model_path='/path/to/saved/baseline/model',
      eval_config=eval_config),
]

eval_result = tfma.run_model_analysis(
    eval_shared_models,
    eval_config=eval_config,
    # This assumes your data is a TFRecords file containing records in the
    # tf.train.Example format.
    data_location="/path/to/file/containing/tfrecords",
    output_path="/path/for/output")

将自动为所有可差分指标（目前仅限于标量值指标，如准确率和 AUC）计算比较指标。

多输出模型指标

TFMA 支持评估具有不同输出的模型的指标。多输出模型以字典的形式存储其输出预测，字典的键是输出名称。使用多输出模型时，必须在 MetricsSpec 的 output_names 部分中指定与一组指标关联的输出名称。例如

from google.protobuf import text_format

metrics_specs = text_format.Parse("""
  metrics_specs {
    output_names: ["my-output"]
    ...
  }
""", tfma.EvalConfig()).metrics_specs

specs_from_metrics API 也支持传递输出名称

metrics = [
    ...
]
metrics_specs = tfma.metrics.specs_from_metrics(
    metrics, output_names=['my-output'])

自定义指标设置

TFMA 允许自定义用于不同指标的设置。例如，您可能希望更改名称、设置阈值等。这可以通过在指标配置中添加 config 部分来完成。配置使用将传递给指标 __init__ 方法的参数的 JSON 字符串版本指定（为了便于使用，可以省略开头和结尾的 '{' 和 '}' 括号）。例如

from google.protobuf import text_format

metrics_specs = text_format.Parse("""
  metrics_specs {
    metrics {
      class_name: "ConfusionMatrixAtThresholds"
      config: '"thresholds": [0.3, 0.5, 0.8]'
    }
  }
""", tfma.MetricsSpec()).metrics_specs

当然，这种自定义也直接支持

metrics = [
   tfma.metrics.ConfusionMatrixAtThresholds(thresholds=[0.3, 0.5, 0.8]),
]
metrics_specs = tfma.metrics.specs_from_metrics(metrics)

输出

指标评估的输出是一系列基于所用配置的指标键/值和/或绘图键/值。

指标键

MetricKeys 使用结构化键类型定义。此键唯一标识指标的以下每个方面

指标名称（auc、mean_label 等）
模型名称（仅在多模型评估时使用）
输出名称（仅在评估多输出模型时使用）
子键（例如，如果多类模型被二值化，则为类 ID）

指标值

MetricValues 使用一个 proto 定义，该 proto 封装了不同指标支持的不同值类型（例如，double、ConfusionMatrixAtThresholds 等）。

以下是支持的指标值类型

double_value - 双精度类型的包装器。
bytes_value - 字节值。
bounded_value - 表示一个实数值，它可以是逐点估计，可以选择性地包含某种近似边界。具有属性 value、lower_bound 和 upper_bound。
value_at_cutoffs - 阈值处的价值（例如，precision@K、recall@K）。具有属性 values，每个属性都具有属性 cutoff 和 value。
confusion_matrix_at_thresholds - 阈值处的混淆矩阵。具有属性 matrices，每个属性都具有 threshold、precision、recall 和混淆矩阵值的属性，例如 false_negatives。
array_value - 用于返回一组值的指标。

绘图键

PlotKeys 与指标键类似，只是出于历史原因，所有绘图值都存储在一个 proto 中，因此绘图键没有名称。

绘图值

所有支持的绘图都存储在一个名为 PlotData 的单个 proto 中。

EvalResult

评估运行的返回值是 tfma.EvalResult。此记录包含 slicing_metrics，它将指标键编码为多级字典，其中级别分别对应于输出名称、类 ID、指标名称和指标值。这旨在用于在 Jupiter 笔记本中进行 UI 显示。如果需要访问底层数据，则应改用 metrics 结果文件（请参阅 metrics_for_slice.proto）。

自定义

除了作为保存的 keras（或传统 EvalSavedModel）的一部分添加的自定义指标外，还有两种方法可以在 TFMA 保存后自定义指标：（1）通过定义自定义 keras 指标类，以及（2）通过定义由 beam 组合器支持的自定义 TFMA 指标类。

在这两种情况下，指标都是通过指定指标类的名称和关联的模块来配置的。例如

from google.protobuf import text_format

metrics_specs = text_format.Parse("""
  metrics_specs {
    metrics { class_name: "MyMetric" module: "my.module"}
  }
""", tfma.EvalConfig()).metrics_specs

自定义 Keras 指标

要创建自定义 keras 指标，用户需要使用他们的实现扩展 tf.keras.metrics.Metric，然后确保指标的模块在评估时可用。

请注意，对于在模型保存后添加的指标，TFMA 仅支持将标签（即 y_true）、预测（y_pred）和示例权重（sample_weight）作为参数传递给 update_state 方法的指标。

Keras 指标示例

以下是一个自定义 keras 指标的示例

class MyMetric(tf.keras.metrics.Mean):

  def __init__(self, name='my_metric', dtype=None):
    super(MyMetric, self).__init__(name=name, dtype=dtype)

  def update_state(self, y_true, y_pred, sample_weight=None):
    return super(MyMetric, self).update_state(
        y_pred, sample_weight=sample_weight)

自定义 TFMA 指标

要创建自定义 TFMA 指标，用户需要使用他们的实现扩展 tfma.metrics.Metric，然后确保指标的模块在评估时可用。

指标

tfma.metrics.Metric 实现由一组定义指标配置的 kwargs 以及用于创建计算（可能多个）以计算指标值的函数组成。可以使用两种主要的计算类型：tfma.metrics.MetricComputation 和 tfma.metrics.DerivedMetricComputation，将在下面的部分中进行描述。创建这些计算的函数将以下参数作为输入传递

eval_config: tfam.EvalConfig
- 传递给评估器的评估配置（用于查找模型规范设置，例如要使用的预测键等）。
model_names: List[Text]
- 要为其计算指标的模型名称列表（如果为单模型，则为 None）
output_names: List[Text].
- 要为其计算指标的输出名称列表（如果为单模型，则为 None）
sub_keys: List[tfma.SubKey].
- 要为其计算指标的子键列表（类 ID、top K 等）（或 None）
aggregation_type: tfma.AggregationType
- 如果计算聚合指标，则为聚合类型。
class_weights: Dict[int, float].
- 如果计算聚合指标，则为要使用的类权重。
query_key: Text
- 如果计算基于查询/排名的指标，则为查询键。

如果指标与这些设置中的一个或多个设置不相关联，则它可能会将其签名定义中的这些参数排除在外。

如果每个模型、输出和子键的指标计算方式相同，则可以使用 tfma.metrics.merge_per_key_computations 对每个输入分别执行相同的计算。

指标计算

一个 MetricComputation 由 preprocessors 和 combiner 组成。 preprocessors 是一个 preprocessor 列表，它是一个 beam.DoFn，它以提取物作为输入，并输出将由组合器使用的初始状态（有关提取物的更多信息，请参阅架构）。所有预处理器都将按列表中的顺序依次执行。如果 preprocessors 为空，则组合器将传递 StandardMetricInputs（标准指标输入包含标签、预测和示例权重）。 combiner 是一个 beam.CombineFn，它以 (切片键，预处理器输出) 元组作为输入，并输出 (切片键，指标结果字典) 元组作为结果。

请注意，切片发生在 preprocessors 和 combiner 之间。

请注意，如果指标计算想要同时使用标准指标输入，但用 features 提取物中的几个特征对其进行增强，则可以使用特殊的 FeaturePreprocessor，它将从多个组合器中请求的特征合并到单个共享的 StandardMetricsInputs 值中，该值传递给所有组合器（组合器负责读取它们感兴趣的特征并忽略其余特征）。

示例

以下是一个非常简单的 TFMA 指标定义示例，用于计算 ExampleCount

class ExampleCount(tfma.metrics.Metric):

  def __init__(self, name: Text = 'example_count'):
    super(ExampleCount, self).__init__(_example_count, name=name)


def _example_count(
    name: Text = 'example_count') -> tfma.metrics.MetricComputations:
  key = tfma.metrics.MetricKey(name=name)
  return [
      tfma.metrics.MetricComputation(
          keys=[key],
          preprocessors=[_ExampleCountPreprocessor()],
          combiner=_ExampleCountCombiner(key))
  ]


class _ExampleCountPreprocessor(beam.DoFn):

  def process(self, extracts: tfma.Extracts) -> Iterable[int]:
    yield 1


class _ExampleCountCombiner(beam.CombineFn):

  def __init__(self, metric_key: tfma.metrics.MetricKey):
    self._metric_key = metric_key

  def create_accumulator(self) -> int:
    return 0

  def add_input(self, accumulator: int, state: int) -> int:
    return accumulator + state

  def merge_accumulators(self, accumulators: Iterable[int]) -> int:
    accumulators = iter(accumulators)
    result = next(accumulator)
    for accumulator in accumulators:
      result += accumulator
    return result

  def extract_output(self,
                     accumulator: int) -> Dict[tfma.metrics.MetricKey, int]:
    return {self._metric_key: accumulator}

派生指标计算

一个 DerivedMetricComputation 由一个结果函数组成，该函数用于根据其他指标计算的输出计算指标值。结果函数以计算值的字典作为输入，并输出一个额外的指标结果字典。

请注意，在指标创建的计算列表中包含派生计算所依赖的计算是可接受的（推荐的）。这避免了必须预先创建和传递在多个指标之间共享的计算。评估器将自动对具有相同定义的计算进行去重，因此实际上只运行一个计算。

示例

TJUR 指标提供了派生指标的良好示例。