code and solution for kaggle: Avito Duplicate Ads Detection (team luoq)
Please read solution.md
A slide to discuss this solution
The base environment is linux with Anaconda3
A lot of extra libraries are needed to run this code, an incomprehensive list is
- python library
- opencv3
- imagehash
- gensim
- nltk
- pystemmer
- python-Levenshtein
- datatrek: some self made utility code
- xgboost with python interface
- mxnet with python interface
A GPU is highly recommended to run mxnet. It takes about 5 days to generate the features.
- extract data(except image) to data/data_files
- cp config.example.json to config.json; change the config to match the data dir
- change working dir to root of this repo
- run prepare_data.sh to generate features
- run leaderboad_solution.py to generate final solution