在 TensorFlow.org 上查看 | 在 Google Colab 中运行 | 在 GitHub 上查看源代码 | 下载笔记本 |
在 特征化教程 中,我们将多个特征(不仅仅是用户和电影标识符)纳入我们的模型,但我们尚未探索这些特征是否提高了模型精度。
许多因素会影响在推荐器模型中使用除 ID 之外的特征是否有效
- 上下文的重要性:如果用户偏好在不同上下文和时间跨度内相对稳定,上下文特征可能不会带来太多好处。但是,如果用户的偏好高度依赖于上下文,添加上下文将显著提高模型精度。例如,星期几可能是决定是否推荐短片或电影的重要特征:用户可能只有在工作日才有时间观看短片,但在周末可以放松身心,观看完整电影。同样,查询时间戳可能在建模流行度动态方面发挥重要作用:一部电影可能在上映时非常受欢迎,但之后很快就会衰退。相反,其他电影可能是常青树,人们乐于一遍又一遍地观看。
- 数据稀疏性:如果数据稀疏,使用非 ID 特征可能至关重要。如果对给定用户或项目的观测结果很少,模型可能难以估计出良好的用户或项目表示。为了构建准确的模型,必须使用其他特征(例如项目类别、描述和图像)来帮助模型泛化到训练数据之外。这在 冷启动 情况下尤其重要,在这种情况下,某些项目或用户可用的数据相对较少。
在本教程中,我们将尝试在 MovieLens 模型中使用除电影标题和用户 ID 之外的特征。
准备工作
我们首先导入必要的包。
pip install -q tensorflow-recommenders
pip install -q --upgrade tensorflow-datasets
import os
import tempfile
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_recommenders as tfrs
2022-12-14 12:11:12.860829: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory 2022-12-14 12:11:12.860920: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory 2022-12-14 12:11:12.860930: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
我们遵循 特征化教程,保留用户 ID、时间戳和电影标题特征。
ratings = tfds.load("movielens/100k-ratings", split="train")
movies = tfds.load("movielens/100k-movies", split="train")
ratings = ratings.map(lambda x: {
"movie_title": x["movie_title"],
"user_id": x["user_id"],
"timestamp": x["timestamp"],
})
movies = movies.map(lambda x: x["movie_title"])
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/autograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23. Instructions for updating: Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089 WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/autograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23. Instructions for updating: Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089
我们还进行一些整理工作,以准备特征词汇表。
timestamps = np.concatenate(list(ratings.map(lambda x: x["timestamp"]).batch(100)))
max_timestamp = timestamps.max()
min_timestamp = timestamps.min()
timestamp_buckets = np.linspace(
min_timestamp, max_timestamp, num=1000,
)
unique_movie_titles = np.unique(np.concatenate(list(movies.batch(1000))))
unique_user_ids = np.unique(np.concatenate(list(ratings.batch(1_000).map(
lambda x: x["user_id"]))))
模型定义
查询模型
我们从 特征化教程 中定义的用户模型开始,作为我们模型的第一层,负责将原始输入示例转换为特征嵌入。但是,我们对其进行了轻微更改,以便我们可以打开或关闭时间戳特征。这将使我们能够更轻松地展示时间戳特征对模型的影响。在下面的代码中,use_timestamps
参数使我们能够控制是否使用时间戳特征。
class UserModel(tf.keras.Model):
def __init__(self, use_timestamps):
super().__init__()
self._use_timestamps = use_timestamps
self.user_embedding = tf.keras.Sequential([
tf.keras.layers.StringLookup(
vocabulary=unique_user_ids, mask_token=None),
tf.keras.layers.Embedding(len(unique_user_ids) + 1, 32),
])
if use_timestamps:
self.timestamp_embedding = tf.keras.Sequential([
tf.keras.layers.Discretization(timestamp_buckets.tolist()),
tf.keras.layers.Embedding(len(timestamp_buckets) + 1, 32),
])
self.normalized_timestamp = tf.keras.layers.Normalization(
axis=None
)
self.normalized_timestamp.adapt(timestamps)
def call(self, inputs):
if not self._use_timestamps:
return self.user_embedding(inputs["user_id"])
return tf.concat([
self.user_embedding(inputs["user_id"]),
self.timestamp_embedding(inputs["timestamp"]),
tf.reshape(self.normalized_timestamp(inputs["timestamp"]), (-1, 1)),
], axis=1)
请注意,在本教程中,我们对时间戳特征的使用与我们选择训练-测试拆分的方式之间存在不希望有的交互。因为我们随机拆分数据而不是按时间顺序拆分(以确保测试数据集中的事件发生在训练集中的事件之后),我们的模型实际上可以从未来学习。这是不现实的:毕竟,我们无法在今天使用明天的数据训练模型。
这意味着将时间特征添加到模型中,可以让它学习未来的交互模式。我们这样做仅仅是为了说明目的:MovieLens 数据集本身非常密集,与许多现实世界的数据集不同,它不会从除用户 ID 和电影标题之外的特征中获益良多。
除了这个警告之外,现实世界中的模型可能会从其他基于时间的功能中获益,例如一天中的时间或一周中的某一天,尤其是在数据具有很强的季节性模式的情况下。
候选模型
为简单起见,我们将候选模型保持不变。同样,我们从 特征化 教程中复制它
class MovieModel(tf.keras.Model):
def __init__(self):
super().__init__()
max_tokens = 10_000
self.title_embedding = tf.keras.Sequential([
tf.keras.layers.StringLookup(
vocabulary=unique_movie_titles, mask_token=None),
tf.keras.layers.Embedding(len(unique_movie_titles) + 1, 32)
])
self.title_vectorizer = tf.keras.layers.TextVectorization(
max_tokens=max_tokens)
self.title_text_embedding = tf.keras.Sequential([
self.title_vectorizer,
tf.keras.layers.Embedding(max_tokens, 32, mask_zero=True),
tf.keras.layers.GlobalAveragePooling1D(),
])
self.title_vectorizer.adapt(movies)
def call(self, titles):
return tf.concat([
self.title_embedding(titles),
self.title_text_embedding(titles),
], axis=1)
组合模型
定义了 UserModel
和 MovieModel
后,我们可以将它们组合成一个组合模型,并实现我们的损失和指标逻辑。
这里我们正在构建一个检索模型。有关其工作原理的回顾,请参阅 基本检索 教程。
请注意,我们还需要确保查询模型和候选模型输出大小兼容的嵌入。因为我们将通过添加更多特征来改变它们的大小,所以最简单的方法是在每个模型之后使用一个密集投影层
class MovielensModel(tfrs.models.Model):
def __init__(self, use_timestamps):
super().__init__()
self.query_model = tf.keras.Sequential([
UserModel(use_timestamps),
tf.keras.layers.Dense(32)
])
self.candidate_model = tf.keras.Sequential([
MovieModel(),
tf.keras.layers.Dense(32)
])
self.task = tfrs.tasks.Retrieval(
metrics=tfrs.metrics.FactorizedTopK(
candidates=movies.batch(128).map(self.candidate_model),
),
)
def compute_loss(self, features, training=False):
# We only pass the user id and timestamp features into the query model. This
# is to ensure that the training inputs would have the same keys as the
# query inputs. Otherwise the discrepancy in input structure would cause an
# error when loading the query model after saving it.
query_embeddings = self.query_model({
"user_id": features["user_id"],
"timestamp": features["timestamp"],
})
movie_embeddings = self.candidate_model(features["movie_title"])
return self.task(query_embeddings, movie_embeddings)
实验
准备数据
我们首先将数据拆分为训练集和测试集。
tf.random.set_seed(42)
shuffled = ratings.shuffle(100_000, seed=42, reshuffle_each_iteration=False)
train = shuffled.take(80_000)
test = shuffled.skip(80_000).take(20_000)
cached_train = train.shuffle(100_000).batch(2048)
cached_test = test.batch(4096).cache()
基线:没有时间戳特征
我们准备尝试第一个模型:让我们从不使用时间戳特征开始,以建立我们的基线。
model = MovielensModel(use_timestamps=False)
model.compile(optimizer=tf.keras.optimizers.Adagrad(0.1))
model.fit(cached_train, epochs=3)
train_accuracy = model.evaluate(
cached_train, return_dict=True)["factorized_top_k/top_100_categorical_accuracy"]
test_accuracy = model.evaluate(
cached_test, return_dict=True)["factorized_top_k/top_100_categorical_accuracy"]
print(f"Top-100 accuracy (train): {train_accuracy:.2f}.")
print(f"Top-100 accuracy (test): {test_accuracy:.2f}.")
Epoch 1/3 WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API. WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API. WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API. WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API. 40/40 [==============================] - 13s 215ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0097 - factorized_top_k/top_5_categorical_accuracy: 0.0208 - factorized_top_k/top_10_categorical_accuracy: 0.0310 - factorized_top_k/top_50_categorical_accuracy: 0.0937 - factorized_top_k/top_100_categorical_accuracy: 0.1620 - loss: 14575.4544 - regularization_loss: 0.0000e+00 - total_loss: 14575.4544 Epoch 2/3 40/40 [==============================] - 9s 194ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0027 - factorized_top_k/top_5_categorical_accuracy: 0.0145 - factorized_top_k/top_10_categorical_accuracy: 0.0274 - factorized_top_k/top_50_categorical_accuracy: 0.1229 - factorized_top_k/top_100_categorical_accuracy: 0.2267 - loss: 14112.4198 - regularization_loss: 0.0000e+00 - total_loss: 14112.4198 Epoch 3/3 40/40 [==============================] - 9s 180ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0024 - factorized_top_k/top_5_categorical_accuracy: 0.0159 - factorized_top_k/top_10_categorical_accuracy: 0.0319 - factorized_top_k/top_50_categorical_accuracy: 0.1421 - factorized_top_k/top_100_categorical_accuracy: 0.2579 - loss: 13928.9752 - regularization_loss: 0.0000e+00 - total_loss: 13928.9752 WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API. WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API. 40/40 [==============================] - 8s 155ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0035 - factorized_top_k/top_5_categorical_accuracy: 0.0226 - factorized_top_k/top_10_categorical_accuracy: 0.0427 - factorized_top_k/top_50_categorical_accuracy: 0.1742 - factorized_top_k/top_100_categorical_accuracy: 0.2950 - loss: 13707.4737 - regularization_loss: 0.0000e+00 - total_loss: 13707.4737 5/5 [==============================] - 3s 212ms/step - factorized_top_k/top_1_categorical_accuracy: 5.5000e-04 - factorized_top_k/top_5_categorical_accuracy: 0.0071 - factorized_top_k/top_10_categorical_accuracy: 0.0166 - factorized_top_k/top_50_categorical_accuracy: 0.1054 - factorized_top_k/top_100_categorical_accuracy: 0.2069 - loss: 31034.1432 - regularization_loss: 0.0000e+00 - total_loss: 31034.1432 Top-100 accuracy (train): 0.30. Top-100 accuracy (test): 0.21.
这使我们获得了大约 0.2 的基线前 100 精度。
使用时间特征捕捉时间动态
如果我们添加时间特征,结果会改变吗?
model = MovielensModel(use_timestamps=True)
model.compile(optimizer=tf.keras.optimizers.Adagrad(0.1))
model.fit(cached_train, epochs=3)
train_accuracy = model.evaluate(
cached_train, return_dict=True)["factorized_top_k/top_100_categorical_accuracy"]
test_accuracy = model.evaluate(
cached_test, return_dict=True)["factorized_top_k/top_100_categorical_accuracy"]
print(f"Top-100 accuracy (train): {train_accuracy:.2f}.")
print(f"Top-100 accuracy (test): {test_accuracy:.2f}.")
Epoch 1/3 WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API. WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API. WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API. WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API. 40/40 [==============================] - 12s 227ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0142 - factorized_top_k/top_5_categorical_accuracy: 0.0285 - factorized_top_k/top_10_categorical_accuracy: 0.0393 - factorized_top_k/top_50_categorical_accuracy: 0.1045 - factorized_top_k/top_100_categorical_accuracy: 0.1739 - loss: 14540.3001 - regularization_loss: 0.0000e+00 - total_loss: 14540.3001 Epoch 2/3 40/40 [==============================] - 8s 172ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0037 - factorized_top_k/top_5_categorical_accuracy: 0.0178 - factorized_top_k/top_10_categorical_accuracy: 0.0340 - factorized_top_k/top_50_categorical_accuracy: 0.1443 - factorized_top_k/top_100_categorical_accuracy: 0.2600 - loss: 13946.6094 - regularization_loss: 0.0000e+00 - total_loss: 13946.6094 Epoch 3/3 40/40 [==============================] - 9s 172ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0026 - factorized_top_k/top_5_categorical_accuracy: 0.0200 - factorized_top_k/top_10_categorical_accuracy: 0.0412 - factorized_top_k/top_50_categorical_accuracy: 0.1784 - factorized_top_k/top_100_categorical_accuracy: 0.3091 - loss: 13681.3415 - regularization_loss: 0.0000e+00 - total_loss: 13681.3415 WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API. WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs={'user_id': <tf.Tensor 'IteratorGetNext:2' shape=(None,) dtype=string>, 'timestamp': <tf.Tensor 'IteratorGetNext:1' shape=(None,) dtype=int64>}. Consider rewriting this model with the Functional API. 40/40 [==============================] - 8s 157ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0057 - factorized_top_k/top_5_categorical_accuracy: 0.0344 - factorized_top_k/top_10_categorical_accuracy: 0.0638 - factorized_top_k/top_50_categorical_accuracy: 0.2331 - factorized_top_k/top_100_categorical_accuracy: 0.3749 - loss: 13357.4951 - regularization_loss: 0.0000e+00 - total_loss: 13357.4951 5/5 [==============================] - 1s 225ms/step - factorized_top_k/top_1_categorical_accuracy: 9.0000e-04 - factorized_top_k/top_5_categorical_accuracy: 0.0093 - factorized_top_k/top_10_categorical_accuracy: 0.0228 - factorized_top_k/top_50_categorical_accuracy: 0.1311 - factorized_top_k/top_100_categorical_accuracy: 0.2531 - loss: 30674.3815 - regularization_loss: 0.0000e+00 - total_loss: 30674.3815 Top-100 accuracy (train): 0.37. Top-100 accuracy (test): 0.25.
这好多了:不仅训练精度高得多,而且测试精度也得到了显著提高。
后续步骤
本教程表明,即使是简单的模型,在合并更多特征后也能变得更加准确。但是,要充分利用您的特征,通常需要构建更大、更深的模型。查看 深度检索教程,以更详细地探索这一点。