We provide a Docker Image to support artifact for our paper (COMFORT) on PLDI 2021 paper on Javascript Conformance Testing.
Our docker image contains reduced-size data sets for evaluating our GPT-2 based test program generator, test case mutation and reduction, and differential tester. The full dataset is quite large (>100 GB uncompressed), and we are working on finding a method for sharing it with the community. The idea is that this directory contains minimal working examples which can be evaluated in a reasonable amount of time. All of our code and data will be open-sourced upon publication and has been developed with extensibility as a primary goal.
Disclaim: Although we have worked hard to ensure our AE scripts are robust, our tool remains a research prototype. It can still have glitches when using in complex, real-life settings. If you discover any bugs, please raise an issue, describing how you ran the program and what problem you encountered. We will get back to you ASAP. Thank you.
The main results of the paper are a list of bugs exposed by COMFORT-generated test cases and other competing methods.
We prepare our artifact within a Docker image to run "out of the box". A reduced-sized Docker image can be downloaded from here. Our docker image was tested on a host machine running Ubuntu 18.04 and Windows 10.
The Docker image contains the following scripts for evaluation.
- 01_evaluate_generator.py: A demonstration of training a GPT-2 model to generate JS programs.
- 02_evaluate_mutator.py: A demonstration of test program mutation. This demonstrates how we mutate the test JS programs generated by the previous demonstration.
- 03_evaluate_harness.py: A demonstration of our differential testing approach. The results from the prior steps are used to perform differential testing on one JS testbed (engine).
- 04_coverage_calculate.py: A demonstration of the code coverage statistics, showing the quality of the test cases generated by different fuzzers.
- 05_testcase_reducing.py: A demonstration of test case reduction.
If you wish to use an NVIDIA GPU on the host machine (running Ubuntu 18.04) to execute the AE, please follow the instructions below to setup the GPU execution environment:
Copy this bash script and run the following command in the host environment with sudo permission:
bash nvidia-container-runtime-script.sh
Note that this step may break the existing GPU and docker setup of the host machine.
Next, test if the GPU running environment is successfully configured:
docker run --help | grep -i gpus
You should be able to see the the GPU information if it is successfully configured.
Please note that the above steps for configuring the GPU environment were only tested on a host machine running Ubuntu 18.04. It may throw exceptions or errors when configuring in other Linux distributions. If you have difficulties in setting up the GPU, you can opt to use the CPU for AE testing or use the pre-configured, live server given in the getting start guide to go through the setps.
Follow the instructions below to use our AE evaluation scripts.
After downloading the docker image, using the following commands to load the docker image (~30 minutes on a laptop for the reduced sized image) on the host machine:
unzip 53.zip
cd 53
docker load -i 53.tar
Then, choose one of the following options depending if you have setup the NVIDIA GPU execution environment.
-
Using CPU: Using the following command to import the docker container and use the CPU for testing:
docker run -it --name comfort pldi2021:comfort /bin/bash
-
Using GPU: Run the following command to import the docker container with GPU support (Make sure you have setup the GPU environment - see here):
docker run -it --name comfort --gpus all pldi2021:comfort /bin/bash
After importing the docker container and getting into bash in the container, make sure you run the below command to setup the environmental variables, before using any of the AE scripts:
source /root/.bash_profile
This script will also start a MYSQL database deamon needed for program mutations and differentiated testing.
(approximate runtime: 20 minutes for using a GPU, ~4 hours when using a CPU)
We provide a pre-trained GPT-2 JS program generator used by our paper for test program generation.
Using the following command to generate about 512 (defined by nsamples
) test programs running on the CPU (set --multi_gpu=1
to run on the GPU):
python /root/src/01_evaluate_generator.py --mode=generate --use_nisl_model=1 --multi_gpu=0 --nsamples=512
This step takes around 4 hours to generate 512 test programs using a laptop CPU and the model loading stage may take around 30 minutes. The --nsamples
parameter controls how many test programs to generate, which must be a multiply of the default bach size of 16 (e.g., 16, 32, 64, 128, 512, etc.). We suggest setting nsamples to be at least >= 64 for effective differntial testing.
All generated test cases are written to directory /root/data/generated_data/complete_testcases/
. Note that the number of test cases generated can vary depending on the number of syntatically valid test programs generated and if they contain a JS API.
This option involves two steps: (1) first fine-tune a GPT-2 model locally and then (2) use the trained model for test program generation.
(approximate runtime: ~1 hour using a GPU, 5 hours using a CPU)
-
Evaluate GPT-2 program synthesizer by running the following command (set
--multi_gpu=1
for using a GPU for training):python /root/src/01_evaluate_generator.py --mode=finetune --multi_gpu=0
The program uses a small JS corpus of 10,000 JS programs randomly selected from our entire training corpus to refine a scale-downed, pre-trained GPT-2 model (that was trained on natural language texts) on the JS corpus.
We have reduced the size of the corpus so that it takes around 5 hours to train on a multi-core CPU (~1 hour on a GPU). For our paper, we trained our model on more data (140,000 JS programs rather than 10,000) for longer (~150,000 iterations rather than 1,000). As such, the quality of output of this model is lower, which is likely to produce shorter and fewer syntactically correct programs.
Training the model can be interrupted after the first training iteration (by pressing Ctrl + C). Once trained, the model does not need to be re-trained.
(approximate runtime: 20 minutes for using a GPU, ~4 hours when using a CPU)
!Important: To run this script, make sure you have trained a model using 01_evaluate_generator.py
dsescribed in the previous step.
-
To use the trained model to generate the test programs, run the following command (set
--multi_gpu=1
for using a GPU for inference):python /root/src/01_evaluate_generator.py --mode=generate --use_nisl_model=0 --multi_gpu=0 --nsamples=512
The --nsamples
parameter controls how many test programs to generate, which must be a multiply of the default bach size of 16 (e.g., 16, 32, 64, etc.).
Note that it takes around 30 minutes to load the model using a laptop CPU.
(approximate runtime: 40 minutes per fuzzer)
-
Using the following command to compute the percentage of the generated test programs passed JSHint (a static JS syntax checker), and the coverage repored by Istanbul for Comfort. This script runs on 1,000 randomly chosen test programs from our full test dataset.
python /root/src/04_coverage_calculate.py --fuzzer=comfort --reporter_dir=/root/data/codeCoverage/coverageReporters
Other fuzzers: Change the value of the parameter --fuzzer
to be codealchemist, deepsmith, die, fuzzilli or montage
, to calculate the passing rate and code coverage of other fuzzers.
This data corresponds to Figure 8 in our paper. Note that since the test programs are randonmly chosen for sampling on a smaller dataset, the numbers could be different from the ones reported in the paper for all fuzzers.
You can use the following command to test the code coverage on 10,000 randomly chosen Comfort-generated test programs for longer run (12+ hours per fuzzer):
python /root/src/04_coverage_calculate.py --coverage_files=/root/data/codeCoverage/totalFiles/comfort_generate --reporter_dir=/root/data/codeCoverage/coverageReporters
Other fuzzers: Replace the comfort_generate
directory in option --coverage_files=/root/data/codeCoverage/totalFiles/comfort_generate
with codealchemist_generate
, deepsmith_generate
, die_generate
, fuzzilli_generate
, or montage_generate
to compute the coverage on the 10,000 test programs generated by other fuzzers in Figure 8. Once again, the number may be different due to random sampling.
(approximate runtime: 20 minutes)
- Evaluate our ECMAScript-guided test data generator by running the following command:
python /root/src/02_evaluate_mutator.py --input_path=/root/data/generated_data/complete_testcases --save_path=/root/data/mutation_result
Note that our tool can only mutate test programs with a JS API. If the test program does not contain a JS API, it will yield a warning This test case fails to be mutated as it does not contain any API.
When all test cases have processed, you can see the number of test cases that were mutated successfully. Note that most of the GPT-2 generated test programs will not be mutated, but they were still used in our fuzzing test; hence, are not wasted.
(approximate runtime: 1 min per 120 test cases)
[!Important]: This step must run after the program mutation step described in the previous section.
-
Evaluate our differential fuzzer on ten JS test beds by running the following command (In our paper, we tested 102 testbeds on a much larger dataset for 200 hours):
python /root/src/03_evaluate_harness.py --testsuite=/root/data/mutation_result/ --clear_classifier=False
During the differential testing, inconsistent testing outcomes will be printed out on the screen. Since we generate multiple test cases from the same test program through mutation, your are likely to get identical inconsistent results across test cases. We apply our filtering script to filter these identical testing outcomes.
Once executed and if buggy behaviour is detected, you will see an output similar to the one given below.
.......
The number of deviated test cases that were filtered out by our filtring scheme is: 4
The number of test cases required manual analysis is: 1
Summary of test cases required manual inspect:
=================================================
Test Cases JS Testbed
1.js Jerryscript-7df87b7
=================================================
All differential testing results are saved to /root/data/mutation_result/log/*.log. Check the log file for deviated execution outputs
Since we only test on a relatively smaller number of test cases (nsamples <= 512), it is likely that none of the test cases triggers a potential bug.
Known Issue: If you get a mysql connection exception (e.g., pymysql.err.OperationError) error, make sure you have run the setup script as:
source /root/.bash_profile
[!important] This step must run after the differential testing stage.
Use the following command running in our docker container to evaluate our test case reducer:
python /root/src/05_testcase_reducing.py --file_dir=/root/data/interesting_testcases
Note that the test cases stored in the interesting_testcases
folder are renamed with indices starting from 1.
Known issue To demonstrate the test case reduction, we always include a randomly chosen test case (in addition to the bug-exposing ones, if any) in the interesting_testcases
folder. This is to prevent the issue of not able to run the test case reducer if no test case triggered buggy behaviour during differential testing. If the randomly chosen test case does not trigger buggy behaviour, the test case reducer will return an empty result for it. This is a work-around technique used for AE only.
(~20+ hours)
For convience, we have provided the test data generated by other fuzzers. But you can still check this document for how to use other fuzzers (CodeAlchemist, DeepSmith, Fuzzilli, Montage, DIE) for test program generation and differential testing. This evaluation requires using a larger docker image (80+GB uncompressed) that can be downloaded from here. This process takes 20+ hours on a laptop/PC.
The docker image provides a small-scale experiment to showcase the working mechanism of our work. Our main results (that run much longer – 200 hours per JS testbed on a larger test dataset) can be found at the Bug List section.
In Figure 6 of the submitted manuscript, we attribute the discovered bugs to the general components of JS engines. This grouping is subjective as JS engine implementations do not follow the same structure, and we refer the reviewers to the live server to check the results.
Notes for how to resue our AE can be found at this document.