Internet Traffic Classification (ITC) is a vital research area in the era of increasing online services. However, the evolution of internet protocols and encryption methods poses challenges for the classification of encrypted Internet traffic. One of the main challenges is the lack of open-source datasets and the shortcomings of existing ones. This thesis tackles this challenge by proposing 3 data augmentation techniques: LSTM, Average, and MTU, which have different advantages and drawbacks.
This repository consist of 4 directories:
- code - contains the scripts for the project
- data - provides sample data from open source
- data converter - includes code to transform the data to the required format
- images - stores the images used in the README file
To run the code provided in this project, you need to run the Main.py
file with the following arguments:
Required arguments:
data_dir
: The directory containing the training data.augmentation
: The augmentation method to use. Supported values arelstm
,average
, andmtu
.
Optional arguments:
--test_split
: The fraction of the data to use for the test set. Default is 0.2.--val_split
: The fraction of the data to use for the validation set. Default is 0.2.--batch_size
: The batch size to use for training. Default is 32.--split
: The split of the flow. Default is 16.--max_len
: The maximum length of a flow. Default is 32.--data_max_size
: The maximum number of data points to use. Default is -1 (use all data points).--avg_n
: The number of data points to average for the Average augmentation. Default is 2.--th_min
: The minimum threshold for the MTU augmentation. Default is 750.--th_max
: The maximum threshold for the MTU augmentation. Default is 1200.
To train a model using the LSTM augmentation on the provided data, run the following command:
python Main.py ../data/ lstm
This will train a model using the default batch size (32) and data_max_size (-1). To use different values, specify them as command line arguments. For example, to train a model using a batch size of 64 and data_max_size of 100, run the following command:
python Main.py ../data/ lstm --batch_size 64 --data_max_size 100
To run the tests provided in this project, you need to run the Tests.py
file with the following arguments:
Required arguments:
data_dir
: The directory containing the test data.augmentation
: The augmentation method to test. Supported values arelstm
,average
, andmtu
.
Optional arguments:
--split
: The split of the flow. Default is 16.--max_len
: The maximum length of a flow. Default is 32.--avg_n
: The number of data points to average for the Average augmentation. Default is 2.
To test the LSTM augmentation generated model on the provided data, run the following command:
python Tests.py ../data/ lstm
In the paper we tested our data augmentation on three different datasets.
-
Extracted from: Data
-
The data was generated by: Tong, V., Tran, H. A., Souihi, S., & Mellouk, A. (2018, December). A novel QUIC traffic classifier based on convolutional neural networks. In 2018 IEEE Global Communications Conference (GLOBECOM) (pp. 1-6). IEEE. Article
- Extracted from: Data
- The data was generated by: Rezaei, S., & Liu, X. (2018). How to achieve high classification accuracy with just a few labels: A semi-supervised approach using sampled packets. arXiv preprint arXiv:1812.09761. Article
- consists of real-world data captured in 2023 and is a commercial dataset
- To request this dataset, please email me at [email protected].
This code is used to convert data to fit the code required format. The code was originally implemented by FlowPic, and has been modified to fit our specific needs.
For more details on the code, please visit the provided link.
- keras>=2.13.1
- numpy>=1.24.3
- scikit_learn>=1.3.0
- tensorflow>=2.13.0
To install the required modules using pip:
pip install -r requirements.txt