维基谈话评论毒性预测

在 TensorFlow.org 上查看 在 Google Colab 中运行 在 GitHub 上查看 下载笔记本

在这个示例中,我们考虑预测维基谈话页面上发布的讨论评论是否包含有毒内容(即包含“粗鲁、不尊重或不合理”的内容)的任务。我们使用由 对话 AI 项目发布的公开 数据集,该数据集包含来自英文维基百科的超过 10 万条评论,这些评论由众包工作者进行标注(有关标注方法,请参阅 论文)。

该数据集的一个挑战是,只有极少一部分评论涉及敏感话题,例如性或宗教。因此,在该数据集上训练神经网络模型会导致在较小的敏感话题上表现出差异。这意味着,关于这些话题的无害陈述可能会被错误地标记为“有毒”,从而导致言论被不公平地审查。

通过在训练期间施加约束,我们可以训练一个更公平的模型,该模型在不同主题组之间表现得更加公平。

我们将在训练期间使用 TFCO 库来优化我们的公平性目标。

安装

首先,让我们安装并导入相关的库。请注意,您可能需要在运行第一个单元格后重新启动您的 colab,因为运行时中的软件包已过时。完成此操作后,导入应该不会再出现问题。

pip 安装

请注意,根据您运行下方单元格的时间,您可能会收到有关 Colab 中的 TensorFlow 默认版本即将切换到 TensorFlow 2.X 的警告。您可以安全地忽略该警告,因为此笔记本旨在与 TensorFlow 1.X 和 2.X 兼容。

导入模块

虽然 TFCO 与急切执行和图形执行兼容,但此笔记本假设默认情况下启用了急切执行。为了确保不会出现任何问题,将在下方单元格中启用急切执行。

启用急切执行并打印版本

超参数

首先,我们设置数据预处理和模型训练所需的一些超参数。

hparams = {
    "batch_size": 128,
    "cnn_filter_sizes": [128, 128, 128],
    "cnn_kernel_sizes": [5, 5, 5],
    "cnn_pooling_sizes": [5, 5, 40],
    "constraint_learning_rate": 0.01,
    "embedding_dim": 100,
    "embedding_trainable": False,
    "learning_rate": 0.005,
    "max_num_words": 10000,
    "max_sequence_length": 250
}

加载和预处理数据集

接下来,我们下载数据集并对其进行预处理。训练集、测试集和验证集以单独的 CSV 文件形式提供。

toxicity_data_url = ("https://github.com/conversationai/unintended-ml-bias-analysis/"
                     "raw/e02b9f12b63a39235e57ba6d3d62d8139ca5572c/data/")

data_train = pd.read_csv(toxicity_data_url + "wiki_train.csv")
data_test = pd.read_csv(toxicity_data_url + "wiki_test.csv")
data_vali = pd.read_csv(toxicity_data_url + "wiki_dev.csv")

data_train.head()

The comment 列包含讨论评论,而 is_toxic 列指示评论是否被标注为有毒。

在下文中,我们

  1. 分离标签
  2. 对文本评论进行标记
  3. 识别包含敏感主题词的评论

首先,我们将标签与训练集、测试集和验证集分开。标签都是二进制的(0 或 1)。

labels_train = data_train["is_toxic"].values.reshape(-1, 1) * 1.0
labels_test = data_test["is_toxic"].values.reshape(-1, 1) * 1.0
labels_vali = data_vali["is_toxic"].values.reshape(-1, 1) * 1.0

接下来,我们使用 Keras 提供的 Tokenizer 对文本评论进行标记。我们仅使用训练集评论来构建一个标记词汇表,并使用它们将所有评论转换为相同长度的(填充)标记序列。

tokenizer = text.Tokenizer(num_words=hparams["max_num_words"])
tokenizer.fit_on_texts(data_train["comment"])

def prep_text(texts, tokenizer, max_sequence_length):
    # Turns text into into padded sequences.
    text_sequences = tokenizer.texts_to_sequences(texts)
    return sequence.pad_sequences(text_sequences, maxlen=max_sequence_length)

text_train = prep_text(data_train["comment"], tokenizer, hparams["max_sequence_length"])
text_test = prep_text(data_test["comment"], tokenizer, hparams["max_sequence_length"])
text_vali = prep_text(data_vali["comment"], tokenizer, hparams["max_sequence_length"])

最后,我们识别与某些敏感主题组相关的评论。我们考虑数据集提供的 身份词 的子集,并将它们分组为四个广泛的主题组:性别认同宗教种族

terms = {
    'sexuality': ['gay', 'lesbian', 'bisexual', 'homosexual', 'straight', 'heterosexual'], 
    'gender identity': ['trans', 'transgender', 'cis', 'nonbinary'],
    'religion': ['christian', 'muslim', 'jewish', 'buddhist', 'catholic', 'protestant', 'sikh', 'taoist'],
    'race': ['african', 'african american', 'black', 'white', 'european', 'hispanic', 'latino', 'latina', 
             'latinx', 'mexican', 'canadian', 'american', 'asian', 'indian', 'middle eastern', 'chinese', 
             'japanese']}

group_names = list(terms.keys())
num_groups = len(group_names)

然后,我们为训练集、测试集和验证集创建单独的组成员矩阵,其中行对应于评论,列对应于四个敏感组,每个条目都是一个布尔值,指示评论是否包含来自主题组的词语。

def get_groups(text):
    # Returns a boolean NumPy array of shape (n, k), where n is the number of comments, 
    # and k is the number of groups. Each entry (i, j) indicates if the i-th comment 
    # contains a term from the j-th group.
    groups = np.zeros((text.shape[0], num_groups))
    for ii in range(num_groups):
        groups[:, ii] = text.str.contains('|'.join(terms[group_names[ii]]), case=False)
    return groups

groups_train = get_groups(data_train["comment"])
groups_test = get_groups(data_test["comment"])
groups_vali = get_groups(data_vali["comment"])

如下所示,所有四个主题组仅占整个数据集的一小部分,并且有毒评论的比例各不相同。

print("Overall label proportion = %.1f%%" % (labels_train.mean() * 100))

group_stats = []
for ii in range(num_groups):
    group_proportion = groups_train[:, ii].mean()
    group_pos_proportion = labels_train[groups_train[:, ii] == 1].mean()
    group_stats.append([group_names[ii],
                        "%.2f%%" % (group_proportion * 100), 
                        "%.1f%%" % (group_pos_proportion * 100)])
group_stats = pd.DataFrame(group_stats, 
                           columns=["Topic group", "Group proportion", "Label proportion"])
group_stats

我们看到,只有 1.3% 的数据集包含与性相关的评论。其中,37% 的评论被标注为有毒。请注意,这明显大于被标注为有毒的评论的总体比例。这可能是因为使用这些身份词的少数评论是在贬义的语境中使用的。如上所述,这会导致我们的模型在包含这些词语时,不成比例地将评论错误地分类为有毒。由于这是我们的担忧,因此我们将确保在评估模型的性能时查看误报率

构建 CNN 毒性预测模型

在准备完数据集后,我们现在构建一个 Keras 模型来预测毒性。我们使用的模型是一个卷积神经网络 (CNN),其架构与对话 AI 项目用于其去偏见分析的架构相同。我们调整了他们提供的 代码 来构建模型层。

该模型使用嵌入层将文本标记转换为固定长度的向量。该层将输入文本序列转换为向量序列,并将它们通过多层卷积和池化操作,最后通过一个完全连接层。

我们使用预训练的 GloVe 词向量嵌入,我们将在下面下载这些嵌入。这可能需要几分钟才能完成。

zip_file_url = "http://nlp.stanford.edu/data/glove.6B.zip"
zip_file = urllib.request.urlopen(zip_file_url)
archive = zipfile.ZipFile(io.BytesIO(zip_file.read()))

我们使用下载的 GloVe 嵌入来创建一个嵌入矩阵,其中行包含 Tokenizer 词汇表中标记的词向量嵌入。

embeddings_index = {}
glove_file = "glove.6B.100d.txt"

with archive.open(glove_file) as f:
    for line in f:
        values = line.split()
        word = values[0].decode("utf-8") 
        coefs = np.asarray(values[1:], dtype="float32")
        embeddings_index[word] = coefs

embedding_matrix = np.zeros((len(tokenizer.word_index) + 1, hparams["embedding_dim"]))
num_words_in_embedding = 0
for word, i in tokenizer.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        num_words_in_embedding += 1
        embedding_matrix[i] = embedding_vector

现在我们可以指定 Keras 层了。我们编写一个函数来创建一个新模型,每当我们想要训练一个新模型时,都会调用它。

def create_model():
    model = keras.Sequential()

    # Embedding layer.
    embedding_layer = layers.Embedding(
        embedding_matrix.shape[0],
        embedding_matrix.shape[1],
        weights=[embedding_matrix],
        input_length=hparams["max_sequence_length"],
        trainable=hparams['embedding_trainable'])
    model.add(embedding_layer)

    # Convolution layers.
    for filter_size, kernel_size, pool_size in zip(
        hparams['cnn_filter_sizes'], hparams['cnn_kernel_sizes'],
        hparams['cnn_pooling_sizes']):

        conv_layer = layers.Conv1D(
            filter_size, kernel_size, activation='relu', padding='same')
        model.add(conv_layer)

        pooled_layer = layers.MaxPooling1D(pool_size, padding='same')
        model.add(pooled_layer)

    # Add a flatten layer, a fully-connected layer and an output layer.
    model.add(layers.Flatten())
    model.add(layers.Dense(128, activation='relu'))
    model.add(layers.Dense(1))

    return model

我们还定义了一个方法来设置随机种子。这样做是为了确保结果可重复。

def set_seeds():
  np.random.seed(121212)
  tf.compat.v1.set_random_seed(212121)

公平性指标

我们还编写了函数来绘制公平性指标。

def create_examples(labels, predictions, groups, group_names):
  # Returns tf.examples with given labels, predictions, and group information.  
  examples = []
  sigmoid = lambda x: 1/(1 + np.exp(-x)) 
  for ii in range(labels.shape[0]):
    example = tf.train.Example()
    example.features.feature['toxicity'].float_list.value.append(
        labels[ii][0])
    example.features.feature['prediction'].float_list.value.append(
        sigmoid(predictions[ii][0]))  # predictions need to be in [0, 1].
    for jj in range(groups.shape[1]):
      example.features.feature[group_names[jj]].bytes_list.value.append(
          b'Yes' if groups[ii, jj] else b'No')
    examples.append(example)
  return examples
def evaluate_results(labels, predictions, groups, group_names):
  # Evaluates fairness indicators for given labels, predictions and group
  # membership info.
  examples = create_examples(labels, predictions, groups, group_names)

  # Create feature map for labels, predictions and each group.
  feature_map = {
      'prediction': tf.io.FixedLenFeature([], tf.float32),
      'toxicity': tf.io.FixedLenFeature([], tf.float32),
  }
  for group in group_names:
    feature_map[group] = tf.io.FixedLenFeature([], tf.string)

  # Serialize the examples.
  serialized_examples = [e.SerializeToString() for e in examples]

  BASE_DIR = tempfile.gettempdir()
  OUTPUT_DIR = os.path.join(BASE_DIR, 'output')

  with beam.Pipeline() as pipeline:
    model_agnostic_config = agnostic_predict.ModelAgnosticConfig(
              label_keys=['toxicity'],
              prediction_keys=['prediction'],
              feature_spec=feature_map)

    slices = [tfma.slicer.SingleSliceSpec()]
    for group in group_names:
      slices.append(
          tfma.slicer.SingleSliceSpec(columns=[group]))

    extractors = [
            model_agnostic_extractor.ModelAgnosticExtractor(
                model_agnostic_config=model_agnostic_config),
            tfma.extractors.slice_key_extractor.SliceKeyExtractor(slices)
        ]

    metrics_callbacks = [
      tfma.post_export_metrics.fairness_indicators(
          thresholds=[0.5],
          target_prediction_keys=['prediction'],
          labels_key='toxicity'),
      tfma.post_export_metrics.example_count()]

    # Create a model agnostic aggregator.
    eval_shared_model = tfma.types.EvalSharedModel(
        add_metrics_callbacks=metrics_callbacks,
        construct_fn=model_agnostic_evaluate_graph.make_construct_fn(
            add_metrics_callbacks=metrics_callbacks,
            config=model_agnostic_config))

    # Run Model Agnostic Eval.
    _ = (
        pipeline
        | beam.Create(serialized_examples)
        | 'ExtractEvaluateAndWriteResults' >>
          tfma.ExtractEvaluateAndWriteResults(
              eval_shared_model=eval_shared_model,
              output_path=OUTPUT_DIR,
              extractors=extractors,
              compute_confidence_intervals=True
          )
    )

  fairness_ind_result = tfma.load_eval_result(output_path=OUTPUT_DIR)

  # Also evaluate accuracy of the model.
  accuracy = np.mean(labels == (predictions > 0.0))

  return fairness_ind_result, accuracy
def plot_fairness_indicators(eval_result, title):
  fairness_ind_result, accuracy = eval_result
  display(HTML("<center><h2>" + title + 
               " (Accuracy = %.2f%%)" % (accuracy * 100) + "</h2></center>"))
  widget_view.render_fairness_indicator(fairness_ind_result)
def plot_multi_fairness_indicators(multi_eval_results):

  multi_results = {}
  multi_accuracy = {}
  for title, (fairness_ind_result, accuracy) in multi_eval_results.items():
    multi_results[title] = fairness_ind_result
    multi_accuracy[title] = accuracy

  title_str = "<center><h2>"
  for title in multi_eval_results.keys():
      title_str+=title + " (Accuracy = %.2f%%)" % (multi_accuracy[title] * 100) + "; "
  title_str=title_str[:-2]
  title_str+="</h2></center>"
  # fairness_ind_result, accuracy = eval_result
  display(HTML(title_str))
  widget_view.render_fairness_indicator(multi_eval_results=multi_results)

训练无约束模型

对于我们训练的第一个模型,我们优化了一个简单的交叉熵损失,没有任何约束。

# Set random seed for reproducible results.
set_seeds()
# Optimizer and loss.
optimizer = tf.keras.optimizers.Adam(learning_rate=hparams["learning_rate"])
loss = lambda y_true, y_pred: tf.keras.losses.binary_crossentropy(
    y_true, y_pred, from_logits=True)

# Create, compile and fit model.
model_unconstrained = create_model()
model_unconstrained.compile(optimizer=optimizer, loss=loss)

model_unconstrained.fit(
    x=text_train, y=labels_train, batch_size=hparams["batch_size"], epochs=2)

训练完无约束模型后,我们在测试集上绘制模型的各种评估指标。

scores_unconstrained_test = model_unconstrained.predict(text_test)
eval_result_unconstrained = evaluate_results(
    labels_test, scores_unconstrained_test, groups_test, group_names)

如上所述,我们专注于假阳性率。在当前版本 (0.1.2) 中,公平性指标默认选择假阴性率。运行以下代码行后,请取消选中 false_negative_rate 并选中 false_positive_rate,以查看我们感兴趣的指标。

plot_fairness_indicators(eval_result_unconstrained, "Unconstrained")

虽然总体假阳性率低于 2%,但与性相关评论的假阳性率明显更高。这是因为性相关群体规模很小,并且注释为有毒的评论所占比例过高。因此,在没有约束的情况下训练模型会导致模型认为与性相关的术语是有毒性的强指标。

在假阳性率上进行约束训练

为了避免不同群体之间假阳性率的较大差异,我们接下来通过约束每个群体的假阳性率在所需限制内来训练模型。在本例中,我们将优化模型的错误率,前提是每个群体的假阳性率小于或等于 2%

在具有每个群体约束的小批量上进行训练对于此数据集来说可能具有挑战性,因为我们希望约束的群体规模都很小,并且单个小批量可能包含来自每个群体的示例很少。因此,我们在训练期间计算的梯度将很嘈杂,导致模型收敛速度非常慢。

为了缓解这个问题,我们建议使用两条小批量流,第一条流像以前一样从整个训练集中形成,第二条流仅从敏感群体示例中形成。我们将使用来自第一条流的小批量来计算目标,并使用来自第二条流的小批量来计算每个群体的约束。由于来自第二条流的批次可能包含来自每个群体的更多示例,因此我们预计我们的更新将更少噪声。

我们创建单独的特征、标签和群体张量来保存来自两条流的小批量。

# Set random seed.
set_seeds()

# Features tensors.
batch_shape = (hparams["batch_size"], hparams['max_sequence_length'])
features_tensor = tf.Variable(np.zeros(batch_shape, dtype='int32'), name='x')
features_tensor_sen = tf.Variable(np.zeros(batch_shape, dtype='int32'), name='x_sen')

# Labels tensors.
batch_shape = (hparams["batch_size"], 1)
labels_tensor = tf.Variable(np.zeros(batch_shape, dtype='float32'), name='labels')
labels_tensor_sen = tf.Variable(np.zeros(batch_shape, dtype='float32'), name='labels_sen')

# Groups tensors.
batch_shape = (hparams["batch_size"], num_groups)
groups_tensor_sen = tf.Variable(np.zeros(batch_shape, dtype='float32'), name='groups_sen')

我们实例化一个新模型,并计算来自两条流的小批量的预测。

# Create model, and separate prediction functions for the two streams. 
# For the predictions, we use a nullary function returning a Tensor to support eager mode.
model_constrained = create_model()

def predictions():
  return model_constrained(features_tensor)

def predictions_sen():
  return model_constrained(features_tensor_sen)

然后,我们建立一个约束优化问题,其中错误率作为目标,并对每个群体的假阳性率进行约束。

epsilon = 0.02  # Desired false-positive rate threshold.

# Set up separate contexts for the two minibatch streams.
context = tfco.rate_context(predictions, lambda:labels_tensor)
context_sen = tfco.rate_context(predictions_sen, lambda:labels_tensor_sen)

# Compute the objective using the first stream.
objective = tfco.error_rate(context)

# Compute the constraint using the second stream.
# Subset the examples belonging to the "sexuality" group from the second stream 
# and add a constraint on the group's false positive rate.
context_sen_subset = context_sen.subset(lambda: groups_tensor_sen[:, 0] > 0)
constraint = [tfco.false_positive_rate(context_sen_subset) <= epsilon]

# Create a rate minimization problem.
problem = tfco.RateMinimizationProblem(objective, constraint)

# Set up a constrained optimizer.
optimizer = tfco.ProxyLagrangianOptimizerV2(
    optimizer=tf.keras.optimizers.Adam(learning_rate=hparams["learning_rate"]),
    num_constraints=problem.num_constraints)

# List of variables to optimize include the model weights, 
# and the trainable variables from the rate minimization problem and 
# the constrained optimizer.
var_list = (model_constrained.trainable_weights + list(problem.trainable_variables) +
            optimizer.trainable_variables())

我们准备训练模型。我们为两条小批量流维护一个单独的计数器。每次执行梯度更新时,我们都需要将第一条流的小批量内容复制到张量 features_tensorlabels_tensor 中,并将第二条流的小批量内容复制到张量 features_tensor_senlabels_tensor_sengroups_tensor_sen 中。

# Indices of sensitive group members.
protected_group_indices = np.nonzero(groups_train.sum(axis=1))[0]

num_examples = text_train.shape[0]
num_examples_sen = protected_group_indices.shape[0]
batch_size = hparams["batch_size"]

# Number of steps needed for one epoch over the training sample.
num_steps = int(num_examples / batch_size)

start_time = time.time()

# Loop over minibatches.
for batch_index in range(num_steps):
    # Indices for current minibatch in the first stream.
    batch_indices = np.arange(
        batch_index * batch_size, (batch_index + 1) * batch_size)
    batch_indices = [ind % num_examples for ind in batch_indices]

    # Indices for current minibatch in the second stream.
    batch_indices_sen = np.arange(
        batch_index * batch_size, (batch_index + 1) * batch_size)
    batch_indices_sen = [protected_group_indices[ind % num_examples_sen]
                         for ind in batch_indices_sen]

    # Assign features, labels, groups from the minibatches to the respective tensors.
    features_tensor.assign(text_train[batch_indices, :])
    labels_tensor.assign(labels_train[batch_indices])

    features_tensor_sen.assign(text_train[batch_indices_sen, :])
    labels_tensor_sen.assign(labels_train[batch_indices_sen])
    groups_tensor_sen.assign(groups_train[batch_indices_sen, :])

    # Gradient update.
    optimizer.minimize(problem, var_list=var_list)

    # Record and print batch training stats every 10 steps.
    if (batch_index + 1) % 10 == 0 or batch_index in (0, num_steps - 1):
      hinge_loss = problem.objective()
      max_violation = max(problem.constraints())

      elapsed_time = time.time() - start_time
      sys.stdout.write(
          "\rStep %d / %d: Elapsed time = %ds, Loss = %.3f, Violation = %.3f" % 
          (batch_index + 1, num_steps, elapsed_time, hinge_loss, max_violation))

训练完约束模型后,我们在测试集上绘制模型的各种评估指标。

scores_constrained_test = model_constrained.predict(text_test)
eval_result_constrained = evaluate_results(
    labels_test, scores_constrained_test, groups_test, group_names)

与上次一样,请记住选择 false_positive_rate。

plot_fairness_indicators(eval_result_constrained, "Constrained")
multi_results = {
    'constrained':eval_result_constrained,
    'unconstrained':eval_result_unconstrained,
}
plot_multi_fairness_indicators(multi_eval_results=multi_results)

从公平性指标中可以看出,与无约束模型相比,约束模型在与性相关的评论中产生了明显更低的假阳性率,并且仅略微降低了整体准确率。