推荐电影:使用顺序模型进行检索

在 TensorFlow.org 上查看 在 Google Colab 中运行 在 GitHub 上查看源代码 下载笔记本

在本教程中,我们将构建一个顺序检索模型。顺序推荐是一种流行的模型,它查看用户之前交互过的项目序列,然后预测下一个项目。这里每个序列中项目的顺序很重要,因此我们将使用循环神经网络来模拟顺序关系。有关更多详细信息,请参阅此 GRU4Rec 论文

导入

首先,让我们解决依赖项和导入问题。

pip install -q tensorflow-recommenders
pip install -q --upgrade tensorflow-datasets
import os
import pprint
import tempfile

from typing import Dict, Text

import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_recommenders as tfrs
2022-12-14 12:39:47.708842: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2022-12-14 12:39:47.708940: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2022-12-14 12:39:47.708949: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

准备数据集

接下来,我们需要准备数据集。我们将利用此 TensorFlow Lite 设备端推荐参考应用程序 中的 数据生成实用程序

MovieLens 1M 数据包含 ratings.dat(列:UserID、MovieID、Rating、Timestamp)和 movies.dat(列:MovieID、Title、Genres)。示例生成脚本下载 1M 数据集,获取这两个文件,仅保留评分高于 2 的评分,形成用户电影交互时间线,将活动作为标签进行采样,并将 10 个之前的用户活动作为预测的上下文。

wget -nc https://raw.githubusercontent.com/tensorflow/examples/master/lite/examples/recommendation/ml/data/example_generation_movielens.py
python -m example_generation_movielens  --data_dir=data/raw  --output_dir=data/examples  --min_timeline_length=3  --max_context_length=10  --max_context_movie_genre_length=10  --min_rating=2  --train_data_fraction=0.9  --build_vocabs=False
--2022-12-14 12:39:49--  https://raw.githubusercontent.com/tensorflow/examples/master/lite/examples/recommendation/ml/data/example_generation_movielens.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18041 (18K) [text/plain]
Saving to: ‘example_generation_movielens.py’

example_generation_ 100%[===================>]  17.62K  --.-KB/s    in 0.001s  

2022-12-14 12:39:49 (18.6 MB/s) - ‘example_generation_movielens.py’ saved [18041/18041]

2022-12-14 12:39:50.711022: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2022-12-14 12:39:50.711113: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2022-12-14 12:39:50.711126: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
I1214 12:39:51.542600 139789676263232 example_generation_movielens.py:460] Downloading and extracting data.
Downloading data from https://files.grouplens.org/datasets/movielens/ml-1m.zip
1826816/5917549 [========>.....................] - ETA: 0s5917549/5917549 [==============================] - 0s 0us/step
I1214 12:39:52.073689 139789676263232 example_generation_movielens.py:406] Reading data to dataframes.
/tmpfs/src/temp/docs/examples/example_generation_movielens.py:132: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  ratings_df = pd.read_csv(
/tmpfs/src/temp/docs/examples/example_generation_movielens.py:140: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  movies_df = pd.read_csv(
I1214 12:39:56.795858 139789676263232 example_generation_movielens.py:408] Generating movie rating user timelines.
I1214 12:39:59.942978 139789676263232 example_generation_movielens.py:410] Generating train and test examples.
6040/6040 [==============================] - 72s 12ms/step
93799/93799 [==============================] - 2s 17us/step
I1214 12:41:38.911691 139789676263232 example_generation_movielens.py:473] Generated dataset: {'train_size': 844195, 'test_size': 93799, 'train_file': 'data/examples/train_movielens_1m.tfrecord', 'test_file': 'data/examples/test_movielens_1m.tfrecord'}

以下是生成数据集的示例。

0 : {
  features: {
    feature: {
      key  : "context_movie_id"
      value: { int64_list: { value: [ 1124, 2240, 3251, ..., 1268 ] } }
    }
    feature: {
      key  : "context_movie_rating"
      value: { float_list: {value: [ 3.0, 3.0, 4.0, ..., 3.0 ] } }
    }
    feature: {
      key  : "context_movie_year"
      value: { int64_list: { value: [ 1981, 1980, 1985, ..., 1990 ] } }
    }
    feature: {
      key  : "context_movie_genre"
      value: { bytes_list: { value: [ "Drama", "Drama", "Mystery", ..., "UNK" ] } }
    }
    feature: {
      key  : "label_movie_id"
      value: { int64_list: { value: [ 3252 ] }  }
    }
  }
}

您可以看到它包含上下文电影 ID 序列和标签电影 ID(下一部电影),以及电影年份、评分和类型等上下文特征。

在本例中,我们只使用上下文电影 ID 序列和标签电影 ID。您可以参考 利用上下文特征教程,了解有关添加其他上下文特征的更多信息。

train_filename = "./data/examples/train_movielens_1m.tfrecord"
train = tf.data.TFRecordDataset(train_filename)

test_filename = "./data/examples/test_movielens_1m.tfrecord"
test = tf.data.TFRecordDataset(test_filename)

feature_description = {
    'context_movie_id': tf.io.FixedLenFeature([10], tf.int64, default_value=np.repeat(0, 10)),
    'context_movie_rating': tf.io.FixedLenFeature([10], tf.float32, default_value=np.repeat(0, 10)),
    'context_movie_year': tf.io.FixedLenFeature([10], tf.int64, default_value=np.repeat(1980, 10)),
    'context_movie_genre': tf.io.FixedLenFeature([10], tf.string, default_value=np.repeat("Drama", 10)),
    'label_movie_id': tf.io.FixedLenFeature([1], tf.int64, default_value=0),
}

def _parse_function(example_proto):
  return tf.io.parse_single_example(example_proto, feature_description)

train_ds = train.map(_parse_function).map(lambda x: {
    "context_movie_id": tf.strings.as_string(x["context_movie_id"]),
    "label_movie_id": tf.strings.as_string(x["label_movie_id"])
})

test_ds = test.map(_parse_function).map(lambda x: {
    "context_movie_id": tf.strings.as_string(x["context_movie_id"]),
    "label_movie_id": tf.strings.as_string(x["label_movie_id"])
})

for x in train_ds.take(1).as_numpy_iterator():
  pprint.pprint(x)
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/autograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23.
Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089
{'context_movie_id': array([b'908', b'1086', b'1252', b'2871', b'3551', b'593', b'247', b'608',
       b'1358', b'866'], dtype=object),
 'label_movie_id': array([b'190'], dtype=object)}

现在,我们的训练/测试数据集仅包含历史电影 ID 序列和下一部电影 ID 的标签。请注意,我们在 tf.Example 解析期间使用 [10] 作为特征的形状,因为我们在示例生成步骤中将上下文特征的长度指定为 10。

在开始构建模型之前,我们还需要做一件事 - 电影 ID 的词汇表。

movies = tfds.load("movielens/1m-movies", split='train')
movies = movies.map(lambda x: x["movie_id"])
movie_ids = movies.batch(1_000)
unique_movie_ids = np.unique(np.concatenate(list(movie_ids)))

实现顺序模型

在我们的 基本检索教程 中,我们为用户使用一个查询塔,为候选电影使用候选塔。但是,双塔架构是通用的,并不局限于对。您也可以使用它来进行项目到项目的推荐,正如我们在 基本检索教程 中提到的那样。

这里,我们仍然将使用双塔架构。具体来说,我们使用带有 门控循环单元 (GRU) 层 的查询塔来编码历史电影序列,并保留相同的候选塔来表示候选电影。

embedding_dimension = 32

query_model = tf.keras.Sequential([
    tf.keras.layers.StringLookup(
      vocabulary=unique_movie_ids, mask_token=None),
    tf.keras.layers.Embedding(len(unique_movie_ids) + 1, embedding_dimension), 
    tf.keras.layers.GRU(embedding_dimension),
])

candidate_model = tf.keras.Sequential([
  tf.keras.layers.StringLookup(
      vocabulary=unique_movie_ids, mask_token=None),
  tf.keras.layers.Embedding(len(unique_movie_ids) + 1, embedding_dimension)
])

指标、任务和完整模型的定义与基本检索模型类似。

metrics = tfrs.metrics.FactorizedTopK(
  candidates=movies.batch(128).map(candidate_model)
)

task = tfrs.tasks.Retrieval(
  metrics=metrics
)

class Model(tfrs.Model):

    def __init__(self, query_model, candidate_model):
        super().__init__()
        self._query_model = query_model
        self._candidate_model = candidate_model

        self._task = task

    def compute_loss(self, features, training=False):
        watch_history = features["context_movie_id"]
        watch_next_label = features["label_movie_id"]

        query_embedding = self._query_model(watch_history)       
        candidate_embedding = self._candidate_model(watch_next_label)

        return self._task(query_embedding, candidate_embedding, compute_metrics=not training)

拟合和评估

现在,我们可以编译、训练和评估我们的顺序检索模型。

model = Model(query_model, candidate_model)
model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.1))
cached_train = train_ds.shuffle(10_000).batch(12800).cache()
cached_test = test_ds.batch(2560).cache()
model.fit(cached_train, epochs=3)
Epoch 1/3
67/67 [==============================] - 18s 220ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0000e+00 - factorized_top_k/top_5_categorical_accuracy: 0.0000e+00 - factorized_top_k/top_10_categorical_accuracy: 0.0000e+00 - factorized_top_k/top_50_categorical_accuracy: 0.0000e+00 - factorized_top_k/top_100_categorical_accuracy: 0.0000e+00 - loss: 108359.6299 - regularization_loss: 0.0000e+00 - total_loss: 108359.6299
Epoch 2/3
67/67 [==============================] - 3s 38ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0000e+00 - factorized_top_k/top_5_categorical_accuracy: 0.0000e+00 - factorized_top_k/top_10_categorical_accuracy: 0.0000e+00 - factorized_top_k/top_50_categorical_accuracy: 0.0000e+00 - factorized_top_k/top_100_categorical_accuracy: 0.0000e+00 - loss: 101734.1007 - regularization_loss: 0.0000e+00 - total_loss: 101734.1007
Epoch 3/3
67/67 [==============================] - 3s 38ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0000e+00 - factorized_top_k/top_5_categorical_accuracy: 0.0000e+00 - factorized_top_k/top_10_categorical_accuracy: 0.0000e+00 - factorized_top_k/top_50_categorical_accuracy: 0.0000e+00 - factorized_top_k/top_100_categorical_accuracy: 0.0000e+00 - loss: 99763.0675 - regularization_loss: 0.0000e+00 - total_loss: 99763.0675
<keras.callbacks.History at 0x7fe89c26edf0>
model.evaluate(cached_test, return_dict=True)
37/37 [==============================] - 10s 221ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0144 - factorized_top_k/top_5_categorical_accuracy: 0.0775 - factorized_top_k/top_10_categorical_accuracy: 0.1347 - factorized_top_k/top_50_categorical_accuracy: 0.3712 - factorized_top_k/top_100_categorical_accuracy: 0.5030 - loss: 15530.8248 - regularization_loss: 0.0000e+00 - total_loss: 15530.8248
{'factorized_top_k/top_1_categorical_accuracy': 0.014403138309717178,
 'factorized_top_k/top_5_categorical_accuracy': 0.07749549299478531,
 'factorized_top_k/top_10_categorical_accuracy': 0.13472424447536469,
 'factorized_top_k/top_50_categorical_accuracy': 0.37120863795280457,
 'factorized_top_k/top_100_categorical_accuracy': 0.5029690861701965,
 'loss': 9413.7470703125,
 'regularization_loss': 0,
 'total_loss': 9413.7470703125}

顺序检索教程到此结束。