Link_prediction

Link prediction using proximity-based methods

This project was done in the subject, COMP90051 (Statistical Machine learning) taken in Semester2, 2020 in the University of Melbourne.

Ranked 14th out of 132 teams. https://www.kaggle.com/c/comp90051-2020-sem2-proj1/leaderboard

Features

Among numerous approaches we took, this is about our final approach. For features, we implemented methods for getting features below.

jaccard distance
cosine distance
adamic-adar index
preferential attachment
Resource allocation
Other features: followers/followees of source/sink each and their common followers/followees

[feature importance]

Implementation:

We referred to some implemented codes in the github but mostly it was easy to implement according to the formula just by using python dictionary. Mainly two types of dictionary which: 1) stores nodes that are followed by a node 2) stores nodes that follows a node

Model:

XG boost: Powerful for classification problems. directly output the probability of being a positive label using an objective set to ‘binary:logistic’.

Sampling data:

50k pos/50k neg random sampling

RUN

To run quickly, change params of def get_trained() to smaller size:

ex) get_trainset(500, 500) // instead of (50000, 50000)

Result:

Final (Private leaderboard) score: 0.89480

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Link_prediction

Features

Implementation:

Model:

Sampling data:

RUN

Result:

Files

README.md

Latest commit

History

README.md

File metadata and controls

Link_prediction

Features

Implementation:

Model:

Sampling data:

RUN

Result: