-
Introduction
LeetCode is a popular online platform that offers a wide range of coding problems. One of the best things about LeetCode is that it provides a forum where users can discuss their solutions and learn from each other. Users can share their code, ask questions, and provide feedback on each other's solutions. Overall, LeetCode is an great site for anyone who wants to improve their coding skills and prepare for coding interviews. However, sometimes they do mistakes in determining the algorithm classification of each problem. To address this problem, I propose to use the problem title and description to predict the algorithm for each problem.
-
Related Work
BERT(Bidirectional Encoder Representations from Transformers) is a language representation model which pre-learns deep bidirectional representations from unlabeled text by considering left and right contexts at all layers. Pre-trained BERT model can be fine-tuned without significant task-specific architectural modifications to generate state-of-the-art models for various tasks with just one additional output layer. BERT achieves the best performance on a variety of 11 natural language processing tasks. BERT works well in text classification too. There are many text classification examples already. However, there is no project about predicting algorithm in leetcode problem. We can also solve this problem with large language model such as ChatGPT, but it is not specialized for algorithm prediction and need too much computing power. This project is not about normal classification, but multi label classification because one problem can be solved by many different algorithms. In this project, I will tine-tune pretrained BERT to predict probability that the problem is associated with the label. Because there is no test dataset for this problem, I have to crawl data from web during crawling, by using techniques from the lecture. Managing the obtained data is related to lecture too.
-
System Description
3.1. Data Crawling
On the Leetcode website, problems do not exist in html code, but are dynamically generated. Therefore, selenium was used. Two web pages were floated and one page is for finding list of problems and the other page is for crawling problem data. The gained data is problem title, problem description, and algorithm class.
3.2. Model Selection
Bert does not use text data as it is, but uses data made by tokenizer which is a list of tokens. For example, if tokenize the sentence "Here is the sentence I want embeddings for.", it is divided into "here, is, the, sentence, I, want, em, ##bed, ##ding, ##s, for, ." For the normal Bert model, only 512 tokens can be accepted as input. When more tokens come in, it usually just truncates the back part. To solve algorithmic problems well, the model has to read the problem until the end so model can receive long text is needed. Therefore, longformer model, which can receive up to 1024 tokens as input, was used. Below code in figure 1 is for download yikuan8 clinical longformer model.
Figure 1 Code for download pretrained model
3.3. Label Selection
In left graph in figure 2, there were many labels with only a few problems data. Since the numbers of data on labels must be secured to some extent for learning, only labels with more than 150 data were used for learning and evaluation. As a result, the proportion of the selected labels is as right graph in figure 2.
Figure 2 Label number before and after label selection
3.4. Data Augmentation
In left graph in figure 3, there were too many data with 'Array' label compared to other labels. This data bias can adversely affect to learning. In addition, the current number of labels is 12, but the number of data is very small at 2001. More data is needed for successful train. Data augmentation was applied to solve these two problems. Data augmentation is conducted by only changing the description because title contained important meaning. Selected data augmentation method is replacing. Change one token to mask token and fill mask using pre-trained longformer model. By doing so, modified sentences could be obtained without damaging context of the original sentence. To solve ‘Array’ label imbalance, augmentation is performed only if the data does not have an array label. I wanted to increase the amount of data through more diverse augmentation, but I had no choice but to use only one due to the performance and time limit of Colab. As seen in the right graph in figure 3, after the augmentation, label imbalance is relieved and number of data is increased.
Figure 3 Label number before and after data augmentation
3.5. Training
With the augmented data, the pretrained longformer model is fine-tuned only 3 epoches. The paper of BERT model recommended 3 to 5 epoches for fine-tuning. Various parameters are applied to get high accuracy. As a result, I got 40% accuracy for the test dataset. Considering that there are 12 classes, the accuracy of 40% is quite high.
Table 1 Training process
Figure 4 Result for test dataset
-
User Interaction
The user enters the title and description of the problem in a string. Tokenize the string using the pre-trained tokenizer used in learning. When the tokenized value is entered into the learned model, the probability of having the corresponding label is output for each algorithm label. If the probability is more than 50%, it is judged to have the label. The model prints the predicted label.
-
Discussion
There are three factors that lower the performance of the model. First, there is too few Data. Usually, most datasets for practice have at least 3000 pieces of data per label. Now, paragraphs are divided into 12 classes with 2001 data. This problem have only less than 200 data per a class. These are most data in Leetcode, so we need another source to get more data. I think Baekjoon Online Judge and Codeforce can be used. Second, solving algorithm problem is too complex. Reading problems and guessing algorithms is not just about understanding sentences, but also about algorithms. This means more data is needed than usual practice problem. Lastly, computing power was too low. at the augmentation section, I wanted to apply various methods, such as back translation and summary. However, these methods need too much time and computing resources and because colab has time, computing limit, I cannot apply those methods.
-
Conclusion
In conclusion, LeetCode is an excellent platform for practicing coding skills and preparing for coding interviews. However, the algorithm classifications of problems may not always be accurate, and it is essential to analyze each problem carefully to determine the appropriate classification. To address this problem, I presented BERT based model to classify problem’s algorithm by using problem’s title and description. The more accurate algorithm classification would improve our problem-solving skills and increase our chances of success in coding interviews.