利用上下文特征

在 TensorFlow.org 上查看 在 Google Colab 中运行 在 GitHub 上查看源代码 下载笔记本

特征化教程 中,我们将多个特征(不仅仅是用户和电影标识符)纳入我们的模型,但我们尚未探索这些特征是否提高了模型精度。

许多因素会影响在推荐器模型中使用除 ID 之外的特征是否有效

  1. 上下文的重要性:如果用户偏好在不同上下文和时间跨度内相对稳定,上下文特征可能不会带来太多好处。但是,如果用户的偏好高度依赖于上下文,添加上下文将显著提高模型精度。例如,星期几可能是决定是否推荐短片或电影的重要特征:用户可能只有在工作日才有时间观看短片,但在周末可以放松身心,观看完整电影。同样,查询时间戳可能在建模流行度动态方面发挥重要作用:一部电影可能在上映时非常受欢迎,但之后很快就会衰退。相反,其他电影可能是常青树,人们乐于一遍又一遍地观看。
  2. 数据稀疏性:如果数据稀疏,使用非 ID 特征可能至关重要。如果对给定用户或项目的观测结果很少,模型可能难以估计出良好的用户或项目表示。为了构建准确的模型,必须使用其他特征(例如项目类别、描述和图像)来帮助模型泛化到训练数据之外。这在 冷启动 情况下尤其重要,在这种情况下,某些项目或用户可用的数据相对较少。

在本教程中,我们将尝试在 MovieLens 模型中使用除电影标题和用户 ID 之外的特征。

准备工作

我们首先导入必要的包。

pip install -q tensorflow-recommenders
pip install -q --upgrade tensorflow-datasets
import os
import tempfile

import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds

import tensorflow_recommenders as tfrs
2022-12-14 12:11:12.860829: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2022-12-14 12:11:12.860920: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2022-12-14 12:11:12.860930: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

我们遵循 特征化教程,保留用户 ID、时间戳和电影标题特征。

ratings = tfds.load("movielens/100k-ratings", split="train")
movies = tfds.load("movielens/100k-movies", split="train")

ratings = ratings.map(lambda x: {
    "movie_title": x["movie_title"],
    "user_id": x["user_id"],
    "timestamp": x["timestamp"],
})
movies = movies.map(lambda x: x["movie_title"])
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/autograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23.
Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/autograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23.
Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089

我们还进行一些整理工作,以准备特征词汇表。

timestamps = np.concatenate(list(ratings.map(lambda x: x["timestamp"]).batch(100)))

max_timestamp = timestamps.max()
min_timestamp = timestamps.min()

timestamp_buckets = np.linspace(
    min_timestamp, max_timestamp, num=1000,
)

unique_movie_titles = np.unique(np.concatenate(list(movies.batch(1000))))
unique_user_ids = np.unique(np.concatenate(list(ratings.batch(1_000).map(
    lambda x: x["user_id"]))))

模型定义

查询模型

我们从 特征化教程 中定义的用户模型开始,作为我们模型的第一层,负责将原始输入示例转换为特征嵌入。但是,我们对其进行了轻微更改,以便我们可以打开或关闭时间戳特征。这将使我们能够更轻松地展示时间戳特征对模型的影响。在下面的代码中,use_timestamps 参数使我们能够控制是否使用时间戳特征。

class UserModel(tf.keras.Model):

  def __init__(self, use_timestamps):
    super().__init__()

    self._use_timestamps = use_timestamps

    self.user_embedding = tf.keras.Sequential([
        tf.keras.layers.StringLookup(
            vocabulary=unique_user_ids, mask_token=None),
        tf.keras.layers.Embedding(len(unique_user_ids) + 1, 32),
    ])

    if use_timestamps:
      self.timestamp_embedding = tf.keras.Sequential([
          tf.keras.layers.Discretization(timestamp_buckets.tolist()),
          tf.keras.layers.Embedding(len(timestamp_buckets) + 1, 32),
      ])
      self.normalized_timestamp = tf.keras.layers.Normalization(
          axis=None
      )

      self.normalized_timestamp.adapt(timestamps)

  def call(self, inputs):
    if not self._use_timestamps:
      return self.user_embedding(inputs["user_id"])

    return tf.concat([
        self.user_embedding(inputs["user_id"]),
        self.timestamp_embedding(inputs["timestamp"]),
        tf.reshape(self.normalized_timestamp(inputs["timestamp"]), (-1, 1)),
    ], axis=1)

请注意,在本教程中,我们对时间戳特征的使用与我们选择训练-测试拆分的方式之间存在不希望有的交互。因为我们随机拆分数据而不是按时间顺序拆分(以确保测试数据集中的事件发生在训练集中的事件之后),我们的模型实际上可以从未来学习。这是不现实的:毕竟,我们无法在今天使用明天的数据训练模型。

这意味着将时间特征添加到模型中,可以让它学习未来的交互模式。我们这样做仅仅是为了说明目的:MovieLens 数据集本身非常密集,与许多现实世界的数据集不同,它不会从除用户 ID 和电影标题之外的特征中获益良多。

除了这个警告之外,现实世界中的模型可能会从其他基于时间的功能中获益,例如一天中的时间或一周中的某一天,尤其是在数据具有很强的季节性模式的情况下。

候选模型

为简单起见,我们将候选模型保持不变。同样,我们从 特征化 教程中复制它

class MovieModel(tf.keras.Model):

  def __init__(self):
    super().__init__()

    max_tokens = 10_000

    self.title_embedding = tf.keras.Sequential([
      tf.keras.layers.StringLookup(
          vocabulary=unique_movie_titles, mask_token=None),
      tf.keras.layers.Embedding(len(unique_movie_titles) + 1, 32)
    ])

    self.title_vectorizer = tf.keras.layers.TextVectorization(
        max_tokens=max_tokens)

    self.title_text_embedding = tf.keras.Sequential([
      self.title_vectorizer,
      tf.keras.layers.Embedding(max_tokens, 32, mask_zero=True),
      tf.keras.layers.GlobalAveragePooling1D(),
    ])

    self.title_vectorizer.adapt(movies)

  def call(self, titles):
    return tf.concat([
        self.title_embedding(titles),
        self.title_text_embedding(titles),
    ], axis=1)

组合模型

定义了 UserModelMovieModel 后,我们可以将它们组合成一个组合模型,并实现我们的损失和指标逻辑。

这里我们正在构建一个检索模型。有关其工作原理的回顾,请参阅 基本检索 教程。

请注意,我们还需要确保查询模型和候选模型输出大小兼容的嵌入。因为我们将通过添加更多特征来改变它们的大小,所以最简单的方法是在每个模型之后使用一个密集投影层

class MovielensModel(tfrs.models.Model):

  def __init__(self, use_timestamps):
    super().__init__()
    self.query_model = tf.keras.Sequential([
      UserModel(use_timestamps),
      tf.keras.layers.Dense(32)
    ])
    self.candidate_model = tf.keras.Sequential([
      MovieModel(),
      tf.keras.layers.Dense(32)
    ])
    self.task = tfrs.tasks.Retrieval(
        metrics=tfrs.metrics.FactorizedTopK(
            candidates=movies.batch(128).map(self.candidate_model),
        ),
    )

  def compute_loss(self, features, training=False):
    # We only pass the user id and timestamp features into the query model. This
    # is to ensure that the training inputs would have the same keys as the
    # query inputs. Otherwise the discrepancy in input structure would cause an
    # error when loading the query model after saving it.
    query_embeddings = self.query_model({
        "user_id": features["user_id"],
        "timestamp": features["timestamp"],
    })
    movie_embeddings = self.candidate_model(features["movie_title"])

    return self.task(query_embeddings, movie_embeddings)

实验

准备数据

我们首先将数据拆分为训练集和测试集。

tf.random.set_seed(42)
shuffled = ratings.shuffle(100_000, seed=42, reshuffle_each_iteration=False)

train = shuffled.take(80_000)
test = shuffled.skip(80_000).take(20_000)

cached_train = train.shuffle(100_000).batch(2048)
cached_test = test.batch(4096).cache()

基线:没有时间戳特征

我们准备尝试第一个模型:让我们从不使用时间戳特征开始,以建立我们的基线。

model = MovielensModel(use_timestamps=False)
model.compile(optimizer=tf.keras.optimizers.Adagrad(0.1))

model.fit(cached_train, epochs=3)

train_accuracy = model.evaluate(
    cached_train, return_dict=True)["factorized_top_k/top_100_categorical_accuracy"]
test_accuracy = model.evaluate(
    cached_test, return_dict=True)["factorized_top_k/top_100_categorical_accuracy"]

print(f"Top-100 accuracy (train): {train_accuracy:.2f}.")
print(f"Top-100 accuracy (test): {test_accuracy:.2f}.")
Epoch 1/3
WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API.
WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API.
WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API.
WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API.
40/40 [==============================] - 13s 215ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0097 - factorized_top_k/top_5_categorical_accuracy: 0.0208 - factorized_top_k/top_10_categorical_accuracy: 0.0310 - factorized_top_k/top_50_categorical_accuracy: 0.0937 - factorized_top_k/top_100_categorical_accuracy: 0.1620 - loss: 14575.4544 - regularization_loss: 0.0000e+00 - total_loss: 14575.4544
Epoch 2/3
40/40 [==============================] - 9s 194ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0027 - factorized_top_k/top_5_categorical_accuracy: 0.0145 - factorized_top_k/top_10_categorical_accuracy: 0.0274 - factorized_top_k/top_50_categorical_accuracy: 0.1229 - factorized_top_k/top_100_categorical_accuracy: 0.2267 - loss: 14112.4198 - regularization_loss: 0.0000e+00 - total_loss: 14112.4198
Epoch 3/3
40/40 [==============================] - 9s 180ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0024 - factorized_top_k/top_5_categorical_accuracy: 0.0159 - factorized_top_k/top_10_categorical_accuracy: 0.0319 - factorized_top_k/top_50_categorical_accuracy: 0.1421 - factorized_top_k/top_100_categorical_accuracy: 0.2579 - loss: 13928.9752 - regularization_loss: 0.0000e+00 - total_loss: 13928.9752
WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API.
WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API.
40/40 [==============================] - 8s 155ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0035 - factorized_top_k/top_5_categorical_accuracy: 0.0226 - factorized_top_k/top_10_categorical_accuracy: 0.0427 - factorized_top_k/top_50_categorical_accuracy: 0.1742 - factorized_top_k/top_100_categorical_accuracy: 0.2950 - loss: 13707.4737 - regularization_loss: 0.0000e+00 - total_loss: 13707.4737
5/5 [==============================] - 3s 212ms/step - factorized_top_k/top_1_categorical_accuracy: 5.5000e-04 - factorized_top_k/top_5_categorical_accuracy: 0.0071 - factorized_top_k/top_10_categorical_accuracy: 0.0166 - factorized_top_k/top_50_categorical_accuracy: 0.1054 - factorized_top_k/top_100_categorical_accuracy: 0.2069 - loss: 31034.1432 - regularization_loss: 0.0000e+00 - total_loss: 31034.1432
Top-100 accuracy (train): 0.30.
Top-100 accuracy (test): 0.21.

这使我们获得了大约 0.2 的基线前 100 精度。

使用时间特征捕捉时间动态

如果我们添加时间特征,结果会改变吗?

model = MovielensModel(use_timestamps=True)
model.compile(optimizer=tf.keras.optimizers.Adagrad(0.1))

model.fit(cached_train, epochs=3)

train_accuracy = model.evaluate(
    cached_train, return_dict=True)["factorized_top_k/top_100_categorical_accuracy"]
test_accuracy = model.evaluate(
    cached_test, return_dict=True)["factorized_top_k/top_100_categorical_accuracy"]

print(f"Top-100 accuracy (train): {train_accuracy:.2f}.")
print(f"Top-100 accuracy (test): {test_accuracy:.2f}.")
Epoch 1/3
WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API.
WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API.
WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API.
WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API.
40/40 [==============================] - 12s 227ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0142 - factorized_top_k/top_5_categorical_accuracy: 0.0285 - factorized_top_k/top_10_categorical_accuracy: 0.0393 - factorized_top_k/top_50_categorical_accuracy: 0.1045 - factorized_top_k/top_100_categorical_accuracy: 0.1739 - loss: 14540.3001 - regularization_loss: 0.0000e+00 - total_loss: 14540.3001
Epoch 2/3
40/40 [==============================] - 8s 172ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0037 - factorized_top_k/top_5_categorical_accuracy: 0.0178 - factorized_top_k/top_10_categorical_accuracy: 0.0340 - factorized_top_k/top_50_categorical_accuracy: 0.1443 - factorized_top_k/top_100_categorical_accuracy: 0.2600 - loss: 13946.6094 - regularization_loss: 0.0000e+00 - total_loss: 13946.6094
Epoch 3/3
40/40 [==============================] - 9s 172ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0026 - factorized_top_k/top_5_categorical_accuracy: 0.0200 - factorized_top_k/top_10_categorical_accuracy: 0.0412 - factorized_top_k/top_50_categorical_accuracy: 0.1784 - factorized_top_k/top_100_categorical_accuracy: 0.3091 - loss: 13681.3415 - regularization_loss: 0.0000e+00 - total_loss: 13681.3415
WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API.
WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API.
40/40 [==============================] - 8s 157ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0057 - factorized_top_k/top_5_categorical_accuracy: 0.0344 - factorized_top_k/top_10_categorical_accuracy: 0.0638 - factorized_top_k/top_50_categorical_accuracy: 0.2331 - factorized_top_k/top_100_categorical_accuracy: 0.3749 - loss: 13357.4951 - regularization_loss: 0.0000e+00 - total_loss: 13357.4951
5/5 [==============================] - 1s 225ms/step - factorized_top_k/top_1_categorical_accuracy: 9.0000e-04 - factorized_top_k/top_5_categorical_accuracy: 0.0093 - factorized_top_k/top_10_categorical_accuracy: 0.0228 - factorized_top_k/top_50_categorical_accuracy: 0.1311 - factorized_top_k/top_100_categorical_accuracy: 0.2531 - loss: 30674.3815 - regularization_loss: 0.0000e+00 - total_loss: 30674.3815
Top-100 accuracy (train): 0.37.
Top-100 accuracy (test): 0.25.

这好多了:不仅训练精度高得多,而且测试精度也得到了显著提高。

后续步骤

本教程表明,即使是简单的模型,在合并更多特征后也能变得更加准确。但是,要充分利用您的特征,通常需要构建更大、更深的模型。查看 深度检索教程,以更详细地探索这一点。