Quora Question Pair Similarity

Quora is a place to gain and share knowledge - about anything. It’s a platform to ask questions and connect with people who contribute unique insights and quality answers. This empowers people to learn from each other and to better understand the world.

Over 100 million people visit Quora every month, so it's no surprise that many people ask similarly worded questions. Multiple questions with the same intent can cause seekers to spend more time finding the best answer to their question, and make writers feel they need to answer multiple versions of the same question. Quora values canonical questions because they provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term.

Credits: Kaggle

Problem Statement :

Identify which questions asked on Quora are duplicates of questions that have already been asked.

Real World/Business Objectives and Constraints:

The cost of a mis-classification can be very high.
You would want a probability of a pair of questions to be duplicates so that you can choose any threshold of choice.
No strict latency concerns.
Interpretability is partially important.

Performance Metric:

log-loss
Binary Confusion Matrix

Data Overview:

Train.csv contains 5 columns : qid1, qid2, question1, question2, is_duplicate. Total we have 404290 entries. I worked on 50k entries. Splitted data into train and test with 70% and 30%.

I derived some features from questions like number_of_common_words, word_share and some distances_between_questions with the help of word vectors. Will discuss those below. You can check my total work here.

Some Analysis:

Distribution of Data Points Among Output Classes
Number of Unique Questions
Number of Occurrences of Each Question

Feature Extraction:

Basic Features - Extract Some Features Before Cleaning of Data -
- freq_qid1 = Frequency of qid1's
- freq_qid2 = Frequency of qid2's
- q1len = Length of q1
- q2len = Length of q2
- q1_n_words = Number of words in Question 1
- q2_n_words = Number of words in Question 2
- word_Common = (Number of common unique words in Question 1 and Question 2)
- word_Total =(Total num of words in Question 1 + Total num of words in Question 2)
- word_share = (word_common)/(word_Total)
- freq_q1+freq_q2 = sum total of frequency of qid1 and qid2
- freq_q1-freq_q2 = absolute difference of frequency of qid1 and qid2
Advance Features - Preprocessing of texts and extract some other features.
- cwc_min = common_word_count / (min(len(q1_words), len(q2_words))
- cwc_max = common_word_count / (max(len(q1_words), len(q2_words))
- csc_min = common_stop_count / (min(len(q1_stops), len(q2_stops))
- csc_max = common_stop_count / (max(len(q1_stops), len(q2_stops))
- ctc_min = common_token_count / (min(len(q1_tokens), len(q2_tokens))
- ctc_max = common_token_count / (max(len(q1_tokens), len(q2_tokens))
- last_word_eq = Check if Last word of both questions is equal or not (int(q1_tokens[-1] == q2_tokens[-1]))
- first_word_eq = Check if First word of both questions is equal or not (int(q1_tokens[0] == q2_tokens[0]) )
- abs_len_diff = abs(len(q1_tokens) - len(q2_tokens))
- mean_len = (len(q1_tokens) + len(q2_tokens))/2
- fuzz_ratio = How much percentage these two strings are similar, measured with edit distance.
- fuzz_partial_ratio = if two strings are of noticeably different lengths, we are getting the score of the best matching lowest length substring.
- token_sort_ratio = sorting the tokens in string and then scoring fuzz_ratio.
- longest_substr_ratio = len(longest common substring) / (min(len(q1_tokens), len(q2_tokens))

Some Features analysis and visualizations:

Word Share - We can check from below that it is overlaping a bit, but it is giving some classifiable score for disimilar questions.
Word Total -
Bivariate analysis of features 'ctc_min', 'cwc_min', 'csc_min', 'token_sort_ratio'. We can observe that we can divide duplicate and non duplicate with some of these features with some patterns.
Fuzz Ratio -

Machine Learning Models:

Trained a random model to check worst case log loss and got log loss as 0.8826
Trained some models and also tuned hyperparameters.

Model	Log Loss
Logistic Regression	0.4829
Linear SVM	0.5071
XG Boost	0.4050

References:

Source: https://www.kaggle.com/c/quora-question-pairs
Discussions : https://www.kaggle.com/anokas/data-analysis-xgboost-starter-0-35460-lb/comments
Kaggle Winning Solution and other approaches: https://www.dropbox.com/sh/93968nfnrzh8bp5/AACZdtsApc1QSTQc7X0H3QZ5a?dl=0
Blog 1 : https://engineering.quora.com/Semantic-Question-Matching-with-Deep-Learning
Blog 2 : https://towardsdatascience.com/identifying-duplicate-questions-on-quora-top-12-on-kaggle-4c1cf93f1c30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Quora Question Pair Similarity

Problem Statement :

Real World/Business Objectives and Constraints:

Performance Metric:

Data Overview:

Some Analysis:

Distribution of Data Points Among Output Classes

Number of Unique Questions

Number of Occurrences of Each Question

Feature Extraction:

Basic Features - Extract Some Features Before Cleaning of Data -

Advance Features - Preprocessing of texts and extract some other features.

Some Features analysis and visualizations:

Word Share - We can check from below that it is overlaping a bit, but it is giving some classifiable score for disimilar questions.

Word Total -

Bivariate analysis of features 'ctc_min', 'cwc_min', 'csc_min', 'token_sort_ratio'. We can observe that we can divide duplicate and non duplicate with some of these features with some patterns.

Fuzz Ratio -

Machine Learning Models:

References:

Files

README.md

Latest commit

History

README.md

File metadata and controls

Quora Question Pair Similarity

Problem Statement :

Real World/Business Objectives and Constraints:

Performance Metric:

Data Overview:

Some Analysis:

Distribution of Data Points Among Output Classes

Number of Unique Questions

Number of Occurrences of Each Question

Feature Extraction:

Basic Features - Extract Some Features Before Cleaning of Data -

Advance Features - Preprocessing of texts and extract some other features.

Some Features analysis and visualizations:

Word Share - We can check from below that it is overlaping a bit, but it is giving some classifiable score for disimilar questions.

Word Total -

Bivariate analysis of features 'ctc_min', 'cwc_min', 'csc_min', 'token_sort_ratio'. We can observe that we can divide duplicate and non duplicate with some of these features with some patterns.

Fuzz Ratio -

Machine Learning Models:

References: