推荐电影:排名

在 TensorFlow.org 上查看 在 Google Colab 中运行 在 GitHub 上查看源代码 下载笔记本

现实世界的推荐系统通常由两个阶段组成

  1. 检索阶段负责从所有可能的候选中选择初始的数百个候选。该模型的主要目标是有效地剔除用户不感兴趣的所有候选。由于检索模型可能处理数百万个候选,因此它必须在计算上高效。
  2. 排名阶段采用检索模型的输出并对其进行微调,以选择最佳的几个推荐。它的任务是将用户可能感兴趣的项目集缩减为一个可能候选的简短列表。

我们将重点关注第二阶段,排名。如果您对检索阶段感兴趣,请查看我们的 检索 教程。

在本教程中,我们将

  1. 获取我们的数据并将其拆分为训练集和测试集。
  2. 实现排名模型。
  3. 拟合和评估它。

导入

首先,让我们完成导入。

pip install -q tensorflow-recommenders
pip install -q --upgrade tensorflow-datasets
import os
import pprint
import tempfile

from typing import Dict, Text

import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
2022-12-14 12:17:03.715935: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2022-12-14 12:17:03.716032: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2022-12-14 12:17:03.716042: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
import tensorflow_recommenders as tfrs

准备数据集

我们将使用与 检索 教程相同的数据。这次,我们还将保留评分:这些是我们试图预测的目标。

ratings = tfds.load("movielens/100k-ratings", split="train")

ratings = ratings.map(lambda x: {
    "movie_title": x["movie_title"],
    "user_id": x["user_id"],
    "user_rating": x["user_rating"]
})
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/autograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23.
Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/autograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23.
Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089

与之前一样,我们将通过将 80% 的评分放入训练集中,将 20% 的评分放入测试集中来拆分数据。

tf.random.set_seed(42)
shuffled = ratings.shuffle(100_000, seed=42, reshuffle_each_iteration=False)

train = shuffled.take(80_000)
test = shuffled.skip(80_000).take(20_000)

让我们也找出数据中存在的唯一用户 ID 和电影标题。

这很重要,因为我们需要能够将分类特征的原始值映射到模型中的嵌入向量。为此,我们需要一个词汇表,将原始特征值映射到连续范围内的整数:这使我们能够在嵌入表中查找相应的嵌入。

movie_titles = ratings.batch(1_000_000).map(lambda x: x["movie_title"])
user_ids = ratings.batch(1_000_000).map(lambda x: x["user_id"])

unique_movie_titles = np.unique(np.concatenate(list(movie_titles)))
unique_user_ids = np.unique(np.concatenate(list(user_ids)))

实现模型

架构

排名模型不像检索模型那样面临效率限制,因此我们在架构选择方面有更大的自由度。

由多个堆叠密集层组成的模型是排名任务的常见架构。我们可以按如下方式实现它

class RankingModel(tf.keras.Model):

  def __init__(self):
    super().__init__()
    embedding_dimension = 32

    # Compute embeddings for users.
    self.user_embeddings = tf.keras.Sequential([
      tf.keras.layers.StringLookup(
        vocabulary=unique_user_ids, mask_token=None),
      tf.keras.layers.Embedding(len(unique_user_ids) + 1, embedding_dimension)
    ])

    # Compute embeddings for movies.
    self.movie_embeddings = tf.keras.Sequential([
      tf.keras.layers.StringLookup(
        vocabulary=unique_movie_titles, mask_token=None),
      tf.keras.layers.Embedding(len(unique_movie_titles) + 1, embedding_dimension)
    ])

    # Compute predictions.
    self.ratings = tf.keras.Sequential([
      # Learn multiple dense layers.
      tf.keras.layers.Dense(256, activation="relu"),
      tf.keras.layers.Dense(64, activation="relu"),
      # Make rating predictions in the final layer.
      tf.keras.layers.Dense(1)
  ])

  def call(self, inputs):

    user_id, movie_title = inputs

    user_embedding = self.user_embeddings(user_id)
    movie_embedding = self.movie_embeddings(movie_title)

    return self.ratings(tf.concat([user_embedding, movie_embedding], axis=1))

该模型接受用户 ID 和电影标题,并输出预测评分

RankingModel()((["42"], ["One Flew Over the Cuckoo's Nest (1975)"]))
WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs=['42']. Consider rewriting this model with the Functional API.
WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs=['42']. Consider rewriting this model with the Functional API.
WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs=["One Flew Over the Cuckoo's Nest (1975)"]. Consider rewriting this model with the Functional API.
WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs=["One Flew Over the Cuckoo's Nest (1975)"]. Consider rewriting this model with the Functional API.
<tf.Tensor: shape=(1, 1), dtype=float32, numpy=array([[-0.01534399]], dtype=float32)>

损失和指标

下一个组件是用于训练模型的损失。TFRS 有几个损失层和任务,可以轻松实现这一点。

在本例中,我们将使用 Ranking 任务对象:一个将损失函数和指标计算捆绑在一起的便捷包装器。

我们将它与 MeanSquaredError Keras 损失一起使用,以预测评分。

task = tfrs.tasks.Ranking(
  loss = tf.keras.losses.MeanSquaredError(),
  metrics=[tf.keras.metrics.RootMeanSquaredError()]
)

任务本身是一个 Keras 层,它接受真值和预测值作为参数,并返回计算的损失。我们将使用它来实现模型的训练循环。

完整模型

现在,我们可以将所有内容整合到一个模型中。TFRS 公开了基本模型类(tfrs.models.Model),它简化了模型构建:我们只需要在 __init__ 方法中设置组件,并实现 compute_loss 方法,接受原始特征并返回损失值。

基本模型将负责创建适当的训练循环来拟合我们的模型。

class MovielensModel(tfrs.models.Model):

  def __init__(self):
    super().__init__()
    self.ranking_model: tf.keras.Model = RankingModel()
    self.task: tf.keras.layers.Layer = tfrs.tasks.Ranking(
      loss = tf.keras.losses.MeanSquaredError(),
      metrics=[tf.keras.metrics.RootMeanSquaredError()]
    )

  def call(self, features: Dict[str, tf.Tensor]) -> tf.Tensor:
    return self.ranking_model(
        (features["user_id"], features["movie_title"]))

  def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:
    labels = features.pop("user_rating")

    rating_predictions = self(features)

    # The task computes the loss and the metrics.
    return self.task(labels=labels, predictions=rating_predictions)

拟合和评估

定义模型后,我们可以使用标准 Keras 拟合和评估例程来拟合和评估模型。

首先,让我们实例化模型。

model = MovielensModel()
model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.1))

然后对训练和评估数据进行混洗、批处理和缓存。

cached_train = train.shuffle(100_000).batch(8192).cache()
cached_test = test.batch(4096).cache()

然后训练模型

model.fit(cached_train, epochs=3)
Epoch 1/3
10/10 [==============================] - 4s 166ms/step - root_mean_squared_error: 2.0902 - loss: 4.0368 - regularization_loss: 0.0000e+00 - total_loss: 4.0368
Epoch 2/3
10/10 [==============================] - 0s 4ms/step - root_mean_squared_error: 1.1613 - loss: 1.3426 - regularization_loss: 0.0000e+00 - total_loss: 1.3426
Epoch 3/3
10/10 [==============================] - 0s 4ms/step - root_mean_squared_error: 1.1140 - loss: 1.2414 - regularization_loss: 0.0000e+00 - total_loss: 1.2414
<keras.callbacks.History at 0x7fd31445d490>

随着模型的训练,损失正在下降,RMSE 指标正在改善。

最后,我们可以使用测试集评估模型

model.evaluate(cached_test, return_dict=True)
5/5 [==============================] - 2s 9ms/step - root_mean_squared_error: 1.1009 - loss: 1.2072 - regularization_loss: 0.0000e+00 - total_loss: 1.2072
{'root_mean_squared_error': 1.100862741470337,
 'loss': 1.1866925954818726,
 'regularization_loss': 0,
 'total_loss': 1.1866925954818726}

RMSE 指标越低,我们的模型在预测评分方面的准确性就越高。

测试排名模型

现在,我们可以通过计算一组电影的预测,然后根据预测对这些电影进行排名来测试排名模型

test_ratings = {}
test_movie_titles = ["M*A*S*H (1970)", "Dances with Wolves (1990)", "Speed (1994)"]
for movie_title in test_movie_titles:
  test_ratings[movie_title] = model({
      "user_id": np.array(["42"]),
      "movie_title": np.array([movie_title])
  })

print("Ratings:")
for title, score in sorted(test_ratings.items(), key=lambda x: x[1], reverse=True):
  print(f"{title}: {score}")
Ratings:
Dances with Wolves (1990): [[3.539769]]
M*A*S*H (1970): [[3.5356772]]
Speed (1994): [[3.4501984]]

导出以供服务

模型可以轻松导出以供服务

tf.saved_model.save(model, "export")
INFO:tensorflow:Unsupported signature for serialization: ((IndexedSlicesSpec(TensorShape([None, 32]), tf.float32, tf.int64, tf.int32, TensorShape([None])), <tensorflow.python.framework.func_graph.UnknownArgument object at 0x7fd314460fa0>, 140544556312496), {}).
INFO:tensorflow:Unsupported signature for serialization: ((IndexedSlicesSpec(TensorShape([None, 32]), tf.float32, tf.int64, tf.int32, TensorShape([None])), <tensorflow.python.framework.func_graph.UnknownArgument object at 0x7fd314460fa0>, 140544556312496), {}).
INFO:tensorflow:Unsupported signature for serialization: ((IndexedSlicesSpec(TensorShape([None, 32]), tf.float32, tf.int64, tf.int32, TensorShape([None])), <tensorflow.python.framework.func_graph.UnknownArgument object at 0x7fd314497c10>, 140544555465184), {}).
INFO:tensorflow:Unsupported signature for serialization: ((IndexedSlicesSpec(TensorShape([None, 32]), tf.float32, tf.int64, tf.int32, TensorShape([None])), <tensorflow.python.framework.func_graph.UnknownArgument object at 0x7fd314497c10>, 140544555465184), {}).
INFO:tensorflow:Unsupported signature for serialization: ((TensorSpec(shape=(64, 256), dtype=tf.float32, name='gradient'), <tensorflow.python.framework.func_graph.UnknownArgument object at 0x7fd31449fa90>, 140544555465504), {}).
INFO:tensorflow:Unsupported signature for serialization: ((TensorSpec(shape=(64, 256), dtype=tf.float32, name='gradient'), <tensorflow.python.framework.func_graph.UnknownArgument object at 0x7fd31449fa90>, 140544555465504), {}).
INFO:tensorflow:Unsupported signature for serialization: ((TensorSpec(shape=(256,), dtype=tf.float32, name='gradient'), <tensorflow.python.framework.func_graph.UnknownArgument object at 0x7fd314452af0>, 140544555464064), {}).
INFO:tensorflow:Unsupported signature for serialization: ((TensorSpec(shape=(256,), dtype=tf.float32, name='gradient'), <tensorflow.python.framework.func_graph.UnknownArgument object at 0x7fd314452af0>, 140544555464064), {}).
INFO:tensorflow:Unsupported signature for serialization: ((TensorSpec(shape=(256, 64), dtype=tf.float32, name='gradient'), <tensorflow.python.framework.func_graph.UnknownArgument object at 0x7fd3144a9760>, 140544555331632), {}).
INFO:tensorflow:Unsupported signature for serialization: ((TensorSpec(shape=(256, 64), dtype=tf.float32, name='gradient'), <tensorflow.python.framework.func_graph.UnknownArgument object at 0x7fd3144a9760>, 140544555331632), {}).
INFO:tensorflow:Unsupported signature for serialization: ((TensorSpec(shape=(64,), dtype=tf.float32, name='gradient'), <tensorflow.python.framework.func_graph.UnknownArgument object at 0x7fd31448f490>, 140544555331952), {}).
INFO:tensorflow:Unsupported signature for serialization: ((TensorSpec(shape=(64,), dtype=tf.float32, name='gradient'), <tensorflow.python.framework.func_graph.UnknownArgument object at 0x7fd31448f490>, 140544555331952), {}).
INFO:tensorflow:Unsupported signature for serialization: ((TensorSpec(shape=(64, 1), dtype=tf.float32, name='gradient'), <tensorflow.python.framework.func_graph.UnknownArgument object at 0x7fd314493d60>, 140544555334032), {}).
INFO:tensorflow:Unsupported signature for serialization: ((TensorSpec(shape=(64, 1), dtype=tf.float32, name='gradient'), <tensorflow.python.framework.func_graph.UnknownArgument object at 0x7fd314493d60>, 140544555334032), {}).
INFO:tensorflow:Unsupported signature for serialization: ((TensorSpec(shape=(1,), dtype=tf.float32, name='gradient'), <tensorflow.python.framework.func_graph.UnknownArgument object at 0x7fd31448deb0>, 140544555334272), {}).
INFO:tensorflow:Unsupported signature for serialization: ((TensorSpec(shape=(1,), dtype=tf.float32, name='gradient'), <tensorflow.python.framework.func_graph.UnknownArgument object at 0x7fd31448deb0>, 140544555334272), {}).
INFO:tensorflow:Unsupported signature for serialization: ((IndexedSlicesSpec(TensorShape([None, 32]), tf.float32, tf.int64, tf.int32, TensorShape([None])), <tensorflow.python.framework.func_graph.UnknownArgument object at 0x7fd314460fa0>, 140544556312496), {}).
INFO:tensorflow:Unsupported signature for serialization: ((IndexedSlicesSpec(TensorShape([None, 32]), tf.float32, tf.int64, tf.int32, TensorShape([None])), <tensorflow.python.framework.func_graph.UnknownArgument object at 0x7fd314460fa0>, 140544556312496), {}).
INFO:tensorflow:Unsupported signature for serialization: ((IndexedSlicesSpec(TensorShape([None, 32]), tf.float32, tf.int64, tf.int32, TensorShape([None])), <tensorflow.python.framework.func_graph.UnknownArgument object at 0x7fd314497c10>, 140544555465184), {}).
INFO:tensorflow:Unsupported signature for serialization: ((IndexedSlicesSpec(TensorShape([None, 32]), tf.float32, tf.int64, tf.int32, TensorShape([None])), <tensorflow.python.framework.func_graph.UnknownArgument object at 0x7fd314497c10>, 140544555465184), {}).
INFO:tensorflow:Unsupported signature for serialization: ((TensorSpec(shape=(64, 256), dtype=tf.float32, name='gradient'), <tensorflow.python.framework.func_graph.UnknownArgument object at 0x7fd31449fa90>, 140544555465504), {}).
INFO:tensorflow:Unsupported signature for serialization: ((TensorSpec(shape=(64, 256), dtype=tf.float32, name='gradient'), <tensorflow.python.framework.func_graph.UnknownArgument object at 0x7fd31449fa90>, 140544555465504), {}).
INFO:tensorflow:Unsupported signature for serialization: ((TensorSpec(shape=(256,), dtype=tf.float32, name='gradient'), <tensorflow.python.framework.func_graph.UnknownArgument object at 0x7fd314452af0>, 140544555464064), {}).
INFO:tensorflow:Unsupported signature for serialization: ((TensorSpec(shape=(256,), dtype=tf.float32, name='gradient'), <tensorflow.python.framework.func_graph.UnknownArgument object at 0x7fd314452af0>, 140544555464064), {}).
INFO:tensorflow:Unsupported signature for serialization: ((TensorSpec(shape=(256, 64), dtype=tf.float32, name='gradient'), <tensorflow.python.framework.func_graph.UnknownArgument object at 0x7fd3144a9760>, 140544555331632), {}).
INFO:tensorflow:Unsupported signature for serialization: ((TensorSpec(shape=(256, 64), dtype=tf.float32, name='gradient'), <tensorflow.python.framework.func_graph.UnknownArgument object at 0x7fd3144a9760>, 140544555331632), {}).
INFO:tensorflow:Unsupported signature for serialization: ((TensorSpec(shape=(64,), dtype=tf.float32, name='gradient'), <tensorflow.python.framework.func_graph.UnknownArgument object at 0x7fd31448f490>, 140544555331952), {}).
INFO:tensorflow:Unsupported signature for serialization: ((TensorSpec(shape=(64,), dtype=tf.float32, name='gradient'), <tensorflow.python.framework.func_graph.UnknownArgument object at 0x7fd31448f490>, 140544555331952), {}).
INFO:tensorflow:Unsupported signature for serialization: ((TensorSpec(shape=(64, 1), dtype=tf.float32, name='gradient'), <tensorflow.python.framework.func_graph.UnknownArgument object at 0x7fd314493d60>, 140544555334032), {}).
INFO:tensorflow:Unsupported signature for serialization: ((TensorSpec(shape=(64, 1), dtype=tf.float32, name='gradient'), <tensorflow.python.framework.func_graph.UnknownArgument object at 0x7fd314493d60>, 140544555334032), {}).
INFO:tensorflow:Unsupported signature for serialization: ((TensorSpec(shape=(1,), dtype=tf.float32, name='gradient'), <tensorflow.python.framework.func_graph.UnknownArgument object at 0x7fd31448deb0>, 140544555334272), {}).
INFO:tensorflow:Unsupported signature for serialization: ((TensorSpec(shape=(1,), dtype=tf.float32, name='gradient'), <tensorflow.python.framework.func_graph.UnknownArgument object at 0x7fd31448deb0>, 140544555334272), {}).
WARNING:absl:Found untraced functions such as ranking_1_layer_call_fn, ranking_1_layer_call_and_return_conditional_losses, _update_step_xla while saving (showing 3 of 3). These functions will not be directly callable after loading.
INFO:tensorflow:Assets written to: export/assets
INFO:tensorflow:Assets written to: export/assets

现在,我们可以重新加载它并执行预测

loaded = tf.saved_model.load("export")

loaded({"user_id": np.array(["42"]), "movie_title": ["Speed (1994)"]}).numpy()
array([[3.4501984]], dtype=float32)

将模型转换为 TensorFLow Lite

虽然 TensorFlow Recommenders 主要用于执行服务器端推荐,但您仍然可以将训练好的排名模型转换为 TensorFLow Lite 并将其在设备上运行(以提高用户隐私和降低延迟)。

converter = tf.lite.TFLiteConverter.from_saved_model("export")
tflite_model = converter.convert()
open("converted_model.tflite", "wb").write(tflite_model)
2022-12-14 12:17:24.837136: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:362] Ignored output_format.
2022-12-14 12:17:24.837175: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:365] Ignored drop_control_dependency.
544480

转换模型后,您可以像运行常规 TensorFlow Lite 模型一样运行它。请查看 TensorFlow Lite 文档 以了解更多信息。

interpreter = tf.lite.Interpreter(model_path="converted_model.tflite")
interpreter.allocate_tensors()

# Get input and output tensors.
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

# Test the model.
if input_details[0]["name"] == "serving_default_movie_title:0":
  interpreter.set_tensor(input_details[0]["index"], np.array(["Speed (1994)"]))
  interpreter.set_tensor(input_details[1]["index"], np.array(["42"]))
else:
  interpreter.set_tensor(input_details[0]["index"], np.array(["42"]))
  interpreter.set_tensor(input_details[1]["index"], np.array(["Speed (1994)"]))

interpreter.invoke()

rating = interpreter.get_tensor(output_details[0]['index'])
print(rating)
[[3.450199]]
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.

后续步骤

上面的模型为我们构建排名系统提供了一个良好的开端。

当然,构建实用的排名系统需要付出更多努力。

在大多数情况下,排名模型可以通过使用更多特征而不是仅使用用户和候选标识符来大幅度改进。要了解如何做到这一点,请查看我们的 侧边特征 教程。

仔细了解值得优化的目标也很重要。要开始构建优化多个目标的推荐器,请查看我们的 多任务 教程。