RecLearn(Recommender Learning)对Recommender System with TF2.0
中 master 分支的内容进行了归纳、整理,是一个基于Python和Tensorflow2.x开发的推荐学习框架,适合学生、初学者研究使用。当然如果你更习惯master分支中的内容,并希望对其中的内容进行修改、更新,可以直接clone整个包的内容进行使用。实现的推荐算法按照工业界的两个应用阶段进行分类:
- matching recommendation stage
- ranking recommendeation stage
23/04.2022:更新了所有的召回模型。
RecLearn已经上传在pypi上,可以使用pip
进行安装:
pip install reclearn
所依赖的环境:
- python3.8+
- Tensorflow2.5-GPU+/Tensorflow2.5-CPU+
- sklearn0.23+
也可以直接clone Reclearn到本地:
git clone -b reclearn [email protected]:ZiyaoGeng/RecLearn.git
在example
中,给出了每一个推荐模型的demo。
1、划分数据集
给定数据集的路径:
file_path = 'data/ml-1m/ratings.dat'
划分当前数据集为训练集、验证集、测试集。如果你使用了movielens-1m
、Amazon-Beauty
、Amazon-Games
、STEAM
数据集的话,也可以直接调用Reclearn中data/datasets/*
的方法,完成划分:
train_path, val_path, test_path, meta_path = ml.split_seq_data(file_path=file_path)
其中meta_path
为元文件的路径,元文件保存了用户、物品索引的最大值。
2、加载数据
完成对训练集、验证集、测试集的读取,并且对每一个正样本分别生成若干个负样本(随即采样),数据的格式为字典:
data = {'pos_item':, 'neg_item': , ['user': , 'click_seq': ,...]}
如果你构建的模型为序列推荐模型,需要引入点击序列。对于上述4个数据集,Reclearn提供了加载数据的方法:
# general recommendation model
train_data = ml.load_data(train_path, neg_num, max_item_num)
# sequence recommendation model, and use the user feature.
train_data = ml.load_seq_data(train_path, "train", seq_len, neg_num, max_item_num, contain_user=True)
3、给定超参数
模型需要指定所需的超参数,以BPR
模型为例:
model_params = {
'user_num': max_user_num + 1,
'item_num': max_item_num + 1,
'embed_dim': FLAGS.embed_dim,
'use_l2norm': FLAGS.use_l2norm,
'embed_reg': FLAGS.embed_reg
}
4、构建模型、编译
选择或构建你需要的模型,并进行编译。以BPR
为例:
model = BPR(**model_params)
model.compile(optimizer=Adam(learning_rate=FLAGS.learning_rate))
如果你对模型的结构存在问题的话,编译之后可以调用summary
方法打印查看:
model.summary()
5、学习以及预测。
for epoch in range(1, epochs + 1):
t1 = time()
model.fit(
x=train_data,
epochs=1,
validation_data=val_data,
batch_size=batch_size
)
t2 = time()
eval_dict = eval_pos_neg(model, test_data, ['hr', 'mrr', 'ndcg'], k, batch_size)
print('Iteration %d Fit [%.1f s], Evaluate [%.1f s]: HR = %.4f, MRR = %.4f, NDCG = %.4f'
% (epoch, t2 - t1, time() - t2, eval_dict['hr'], eval_dict['mrr'], eval_dict['ndcg']))
针对Criteo数据集,采用了两种数据处理方法:加载部分数据训练模型或者通过分割数据集的方法使用全部数据训练。第一种方法参考example/train_small_criteo_demo.py
。第二种方法参考example/r_deepfm_demo.py
文件,具体如下所示:
1、分割数据集
调用reclearn.data.datasets.criteo.get_split_file_path(parent_path, dataset_path, sample_num)
方法可以将数据集分割,sample_num
确定每一个子集样本数量,所以子集保存在数据集对应的路径。若之前已经分割完成,没有改变子数据集路径可以直接读取,或者可以赋值parent_path
。
sample_num = 4600000
split_file_list = get_split_file_path(dataset_path=file, sample_num=sample_num)
2、建立特征映射
分割数据集后,在整个数据集下对所有的特征进行映射(静态Embedding层需要确定大小),并且密集数据类型进行分桶处理转化为离散数据类型。调用get_fea_map(fea_map_path, split_file_list)
方法,最后保存为映射文件保存为fea_map.pkl
。若之前已经完成该步骤,可以赋值fea_map_path
参数。
# If you want to make feature map.
fea_map = get_fea_map(split_file_list=split_file_list)
# Or if you want to load feature map.
# fea_map = get_fea_map(fea_map_path='data/criteo/split/fea_map.pkl')
3、加载测试集
选择最后一个子数据集作为测试集。
feature_columns, test_data = create_criteo_dataset(split_file_list[-1], fea_map)
4、构建模型
model = FM(feature_columns=feature_columns, **model_params)
model.summary()
model.compile(loss=binary_crossentropy, optimizer=Adam(learning_rate=learning_rate),
metrics=[AUC()])
5、迭代训练,并验证
for file in split_file_list[:-1]:
print("load %s" % file)
_, train_data = create_criteo_dataset(file, fea_map)
# TODO: Fit
model.fit(
x=train_data[0],
y=train_data[1],
epochs=1,
batch_size=batch_size,
validation_split=0.1
)
# TODO: Test
print('test AUC: %f' % model.evaluate(x=test_data[0], y=test_data[1], batch_size=batch_size)[1])
Reclearn所设计的实验环境与部分论文不同,所以结果可能会存在一定偏差,具体请参考experiement。
Model | ml-1m | Beauty | STEAM | ||||||
---|---|---|---|---|---|---|---|---|---|
HR@10 | MRR@10 | NDCG@10 | HR@10 | MRR@10 | NDCG@10 | HR@10 | MRR@10 | NDCG@10 | |
BPR | 0.5768 | 0.2392 | 0.3016 | 0.3708 | 0.2108 | 0.2485 | 0.7728 | 0.4220 | 0.5054 |
NCF | 0.5834 | 0.2219 | 0.3060 | 0.5448 | 0.2831 | 0.3451 | 0.7768 | 0.4273 | 0.5103 |
DSSM | 0.5498 | 0.2148 | 0.2929 | - | - | - | - | - | - |
YoutubeDNN | 0.6737 | 0.3414 | 0.4201 | - | - | - | - | - | - |
GRU4Rec | 0.7969 | 0.4698 | 0.5483 | 0.5211 | 0.2724 | 0.3312 | 0.8501 | 0.5486 | 0.6209 |
Caser | 0.7916 | 0.4450 | 0.5280 | 0.5487 | 0.2884 | 0.3501 | 0.8275 | 0.5064 | 0.5832 |
SASRec | 0.8103 | 0.4812 | 0.5605 | 0.5230 | 0.2781 | 0.3355 | 0.8606 | 0.5669 | 0.6374 |
AttRec | 0.7873 | 0.4578 | 0.5363 | 0.4995 | 0.2695 | 0.3229 | - | - | - |
FISSA | 0.8106 | 0.4953 | 0.5713 | 0.5431 | 0.2851 | 0.3462 | 0.8635 | 0.5682 | 0.6391 |
Model | 500w(Criteo) | Criteo | ||
---|---|---|---|---|
Log Loss | AUC | Log Loss | AUC | |
FM | 0.4765 | 0.7783 | 0.4762 | 0.7875 |
FFM | - | - | - | - |
WDL | 0.4684 | 0.7822 | 0.4692 | 0.7930 |
Deep Crossing | 0.4670 | 0.7826 | 0.4693 | 0.7935 |
PNN | - | 0.7847 | - | - |
DCN | - | 0.7823 | 0.4691 | 0.7929 |
NFM | 0.4773 | 0.7762 | 0.4723 | 0.7889 |
AFM | 0.4819 | 0.7808 | 0.4692 | 0.7871 |
DeepFM | - | 0.7828 | 0.4650 | 0.8007 |
xDeepFM | 0.4690 | 0.7839 | 0.4696 | 0.7919 |
Paper|Model | Published | Author |
---|---|---|
BPR: Bayesian Personalized Ranking from Implicit Feedback|MF-BPR | UAI, 2009 | Steffen Rendle |
Neural network-based Collaborative Filtering|NCF | WWW, 2017 | Xiangnan He |
Learning Deep Structured Semantic Models for Web Search using Clickthrough Data|DSSM | CIKM, 2013 | Po-Sen Huang |
Deep Neural Networks for YouTube Recommendations| YoutubeDNN | RecSys, 2016 | Paul Covington |
Session-based Recommendations with Recurrent Neural Networks|GUR4Rec | ICLR, 2016 | Balázs Hidasi |
Self-Attentive Sequential Recommendation|SASRec | ICDM, 2018 | UCSD |
Personalized Top-N Sequential Recommendation via Convolutional Sequence Embedding|Caser | WSDM, 2018 | Jiaxi Tang |
Next Item Recommendation with Self-Attentive Metric Learning|AttRec | AAAAI, 2019 | Shuai Zhang |
FISSA: Fusing Item Similarity Models with Self-Attention Networks for Sequential Recommendation|FISSA | RecSys, 2020 | Jing Lin |
Paper|Model | Published | Author |
---|---|---|
Factorization Machines|FM | ICDM, 2010 | Steffen Rendle |
Field-aware Factorization Machines for CTR Prediction|FFM | RecSys, 2016 | Criteo Research |
Wide & Deep Learning for Recommender Systems|WDL | DLRS, 2016 | Google Inc. |
Deep Crossing: Web-Scale Modeling without Manually Crafted Combinatorial Features|Deep Crossing | KDD, 2016 | Microsoft Research |
Product-based Neural Networks for User Response Prediction|PNN | ICDM, 2016 | Shanghai Jiao Tong University |
Deep & Cross Network for Ad Click Predictions|DCN | ADKDD, 2017 | Stanford University|Google Inc. |
Neural Factorization Machines for Sparse Predictive Analytics|NFM | SIGIR, 2017 | Xiangnan He |
Attentional Factorization Machines: Learning the Weight of Feature Interactions via Attention Networks|AFM | IJCAI, 2017 | Zhejiang University|National University of Singapore |
DeepFM: A Factorization-Machine based Neural Network for CTR Prediction|DeepFM | IJCAI, 2017 | Harbin Institute of Technology|Noah’s Ark Research Lab, Huawei |
xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems|xDeepFM | KDD, 2018 | University of Science and Technology of China |
Deep Interest Network for Click-Through Rate Prediction|DIN | KDD, 2018 | Alibaba Group |
对于项目有任何建议或问题,可以在Issue
留言。