Code Classifier App

Description

The aim is to develop an application which will detect the programming language used in a snippet of code using machine learning. It will add to the flavor if we can add some more interpretation, like displaying some confidence score and some key features which lead to the classification.

Initial Thoughts and Ideas

All programming languages are written following some structures and rules, so writing some algorithm to detect those rules or heuristics can be an approach, but then writing rules for 30+ languages seems to be a very unfeasible task. Detecting languages based on keywords or file extensions are also ruled out as somebody might pass python code with .txt extension.
Data credits - We got data thanks to a similar work of @andreasjansson Refer Link

Experiments Details

Refer IPython notebook

Started with relatively smaller snippets(code within range of len 250-500 chars) to keep the data balanced and do some fast parameter tunings. After that we fed the whole training data for all the languages (around 10000 snippets) to the selected classifier.
We used TfIdf bi-gram vectors and features capped to 120000. While fitting TfIdf vectors, we clubbed the training data(code snippets) together grouped on languages, instead of keeping all snippets as separate docs.
Since all symbols and punctuations can be important features in code, so we coded a custom tokenizer which tokenizes words and symbols, but also concatenates repeating symbols, like === or ++

Results

Classifier Name	Mean Fit Time(s)	Mean Test Time(s)	Mean Train Score	Mean CV Score	Best Test Score
Logistic Regression	44.128	0.102	0.875	0.738	0.824
KNeighbors Classifier	0.037	21.517	0.908	0.788	0.798
SVC Classifier	856.900	244.200	0.585	0.481	0.814
SGD Classifier	6.500	0.200	0.742	0.677	0.829
MultinomialNB Classifier	0.400	0.100	0.892	0.801	0.810
XGB Classifier**	1895.000	21.600	0.938	0.785	0.797

Due to less Fit and Train time and a reasonably good Test score we selected the good old SGD Classifier

Close analysis of the confusion matrix shows the model working poorly to classify highly closely related programming languages like CPP and C, Java and Csharp. If we remove the import statements and just keep few lines of code it is also non trivial to quickly identify between such close languages. Also snippets like Javascript embedded with HTML or CSS embedded with HTML in the training data are other causes of misclassification.

Usage

Use conda and install the packages mentioned in requirements.txt file, they will include all other required packages.
Activate the conda environment, clone the repo and cd into the directory
Extract the file .\data\snippets.zip to get the pickle form. Run python CodeClassifierTrain.py if you want to re-train the classifier locally, else you use the existing model as it is.
Run python CodeClassifier.py to test if the classifier and models are working fine.
Run python mainApp.py to run the python GUI application.

App Preview

TODO

Considering the current results as baseline, explore deep learning approaches
Better UI/UX and added functionality in the app

Please feel free to add and contribute

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code Classifier App

Description

Initial Thoughts and Ideas

Experiments Details

Usage

App Preview

TODO

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
models		models
results		results
.gitattributes		.gitattributes
.gitignore		.gitignore
CodeClassifier.py		CodeClassifier.py
CodeClassifierTrain.py		CodeClassifierTrain.py
LICENSE		LICENSE
ProgLang_Classification_ML.ipynb		ProgLang_Classification_ML.ipynb
README.md		README.md
mainApp.py		mainApp.py
requirements.txt		requirements.txt

License

kr-prince/Prog-Language-Detection-Using-ML

Folders and files

Latest commit

History

Repository files navigation

Code Classifier App

Description

Initial Thoughts and Ideas

Experiments Details

Usage

App Preview

TODO

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages