Skip to content

Latest commit

 

History

History

artifact_evaluation

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

Automated Conformance Testing for JavaScript Engines via Deep Compiler Fuzzing: Artifact

We provide a Docker Image to support artifact for our paper (COMFORT) on PLDI 2021 paper on Javascript Conformance Testing.

Our docker image contains reduced-size data sets for evaluating our GPT-2 based test program generator, test case mutation and reduction, and differential tester. The full dataset is quite large (>100 GB uncompressed), and we are working on finding a method for sharing it with the community. The idea is that this directory contains minimal working examples which can be evaluated in a reasonable amount of time. All of our code and data will be open-sourced upon publication and has been developed with extensibility as a primary goal.

Step-by-Step Instructions

Disclaim: Although we have worked hard to ensure our AE scripts are robust, our tool remains a research prototype. It can still have glitches when using in complex, real-life settings. If you discover any bugs, please raise an issue, describing how you ran the program and what problem you encountered. We will get back to you ASAP. Thank you.

★ Main Results

The main results of the paper are a list of bugs exposed by COMFORT-generated test cases and other competing methods.

★ Docker Image

We prepare our artifact within a Docker image to run "out of the box". A reduced-sized Docker image can be downloaded from here. Our docker image was tested on a host machine running Ubuntu 18.04 and Windows 10.

★ Artifact Contents

The Docker image contains the following scripts for evaluation.

  • 01_evaluate_generator.py: A demonstration of training a GPT-2 model to generate JS programs.
  • 02_evaluate_mutator.py: A demonstration of test program mutation. This demonstrates how we mutate the test JS programs generated by the previous demonstration.
  • 03_evaluate_harness.py: A demonstration of our differential testing approach. The results from the prior steps are used to perform differential testing on one JS testbed (engine).
  • 04_coverage_calculate.py: A demonstration of the code coverage statistics, showing the quality of the test cases generated by different fuzzers.
  • 05_testcase_reducing.py: A demonstration of test case reduction.

P1 - Preliminary: Configure the GPU Running Environment on the Host Machine (Optional)

If you wish to use an NVIDIA GPU on the host machine (running Ubuntu 18.04) to execute the AE, please follow the instructions below to setup the GPU execution environment:

  • Copy this bash script and run the following command in the host environment with sudo permission:

    bash nvidia-container-runtime-script.sh

    Note that this step may break the existing GPU and docker setup of the host machine.

  • Next, test if the GPU running environment is successfully configured:

    docker run --help | grep -i gpus

    You should be able to see the the GPU information if it is successfully configured.

Please note that the above steps for configuring the GPU environment were only tested on a host machine running Ubuntu 18.04. It may throw exceptions or errors when configuring in other Linux distributions. If you have difficulties in setting up the GPU, you can opt to use the CPU for AE testing or use the pre-configured, live server given in the getting start guide to go through the setps.

AE Evaluation Steps

Follow the instructions below to use our AE evaluation scripts.

1. Setup

1.1 Load the Docker Image

After downloading the docker image, using the following commands to load the docker image (~30 minutes on a laptop for the reduced sized image) on the host machine:

unzip 53.zip
cd 53
docker load -i 53.tar

Then, choose one of the following options depending if you have setup the NVIDIA GPU execution environment.

  • Using CPU: Using the following command to import the docker container and use the CPU for testing:

    docker run -it --name comfort pldi2021:comfort /bin/bash

  • Using GPU: Run the following command to import the docker container with GPU support (Make sure you have setup the GPU environment - see here):

    docker run -it --name comfort --gpus all pldi2021:comfort /bin/bash

1.2 Setup environmental parameters:

After importing the docker container and getting into bash in the container, make sure you run the below command to setup the environmental variables, before using any of the AE scripts:

source /root/.bash_profile

This script will also start a MYSQL database deamon needed for program mutations and differentiated testing.

2. Evaluation of Our JS Program Generator

2.1 Program generation using our pre-trained model

(approximate runtime: 20 minutes for using a GPU, ~4 hours when using a CPU)

We provide a pre-trained GPT-2 JS program generator used by our paper for test program generation.

Using the following command to generate about 512 (defined by nsamples) test programs running on the CPU (set --multi_gpu=1 to run on the GPU):

python /root/src/01_evaluate_generator.py --mode=generate --use_nisl_model=1 --multi_gpu=0 --nsamples=512

This step takes around 4 hours to generate 512 test programs using a laptop CPU and the model loading stage may take around 30 minutes. The --nsamples parameter controls how many test programs to generate, which must be a multiply of the default bach size of 16 (e.g., 16, 32, 64, 128, 512, etc.). We suggest setting nsamples to be at least >= 64 for effective differntial testing.

All generated test cases are written to directory /root/data/generated_data/complete_testcases/. Note that the number of test cases generated can vary depending on the number of syntatically valid test programs generated and if they contain a JS API.

2.2 (Optional) Program generation using a locally trained model

This option involves two steps: (1) first fine-tune a GPT-2 model locally and then (2) use the trained model for test program generation.

2.2.1 Train the JS Program Generator

(approximate runtime: ~1 hour using a GPU, 5 hours using a CPU)

  • Evaluate GPT-2 program synthesizer by running the following command (set --multi_gpu=1 for using a GPU for training):

    python /root/src/01_evaluate_generator.py --mode=finetune --multi_gpu=0

The program uses a small JS corpus of 10,000 JS programs randomly selected from our entire training corpus to refine a scale-downed, pre-trained GPT-2 model (that was trained on natural language texts) on the JS corpus.

We have reduced the size of the corpus so that it takes around 5 hours to train on a multi-core CPU (~1 hour on a GPU). For our paper, we trained our model on more data (140,000 JS programs rather than 10,000) for longer (~150,000 iterations rather than 1,000). As such, the quality of output of this model is lower, which is likely to produce shorter and fewer syntactically correct programs.

Training the model can be interrupted after the first training iteration (by pressing Ctrl + C). Once trained, the model does not need to be re-trained.

2.2.2 Program generation

(approximate runtime: 20 minutes for using a GPU, ~4 hours when using a CPU)

!Important: To run this script, make sure you have trained a model using 01_evaluate_generator.py dsescribed in the previous step.

  • To use the trained model to generate the test programs, run the following command (set --multi_gpu=1 for using a GPU for inference):

    python /root/src/01_evaluate_generator.py --mode=generate --use_nisl_model=0 --multi_gpu=0 --nsamples=512

The --nsamples parameter controls how many test programs to generate, which must be a multiply of the default bach size of 16 (e.g., 16, 32, 64, etc.).

Note that it takes around 30 minutes to load the model using a laptop CPU.

3. Evaluation of test program coverage

(approximate runtime: 40 minutes per fuzzer)

  • Using the following command to compute the percentage of the generated test programs passed JSHint (a static JS syntax checker), and the coverage repored by Istanbul for Comfort. This script runs on 1,000 randomly chosen test programs from our full test dataset.

    python /root/src/04_coverage_calculate.py --fuzzer=comfort --reporter_dir=/root/data/codeCoverage/coverageReporters

Other fuzzers: Change the value of the parameter --fuzzer to be codealchemist, deepsmith, die, fuzzilli or montage, to calculate the passing rate and code coverage of other fuzzers.

This data corresponds to Figure 8 in our paper. Note that since the test programs are randonmly chosen for sampling on a smaller dataset, the numbers could be different from the ones reported in the paper for all fuzzers.

Optional: Code coverage on larger datasets

You can use the following command to test the code coverage on 10,000 randomly chosen Comfort-generated test programs for longer run (12+ hours per fuzzer):

python /root/src/04_coverage_calculate.py --coverage_files=/root/data/codeCoverage/totalFiles/comfort_generate --reporter_dir=/root/data/codeCoverage/coverageReporters

Other fuzzers: Replace the comfort_generate directory in option --coverage_files=/root/data/codeCoverage/totalFiles/comfort_generate with codealchemist_generate, deepsmith_generate, die_generate, fuzzilli_generate, or montage_generate to compute the coverage on the 10,000 test programs generated by other fuzzers in Figure 8. Once again, the number may be different due to random sampling.

4. Demonstration of Test Program Mutation

(approximate runtime: 20 minutes)

  • Evaluate our ECMAScript-guided test data generator by running the following command:

python /root/src/02_evaluate_mutator.py --input_path=/root/data/generated_data/complete_testcases --save_path=/root/data/mutation_result

Note that our tool can only mutate test programs with a JS API. If the test program does not contain a JS API, it will yield a warning This test case fails to be mutated as it does not contain any API. When all test cases have processed, you can see the number of test cases that were mutated successfully. Note that most of the GPT-2 generated test programs will not be mutated, but they were still used in our fuzzing test; hence, are not wasted.

5. Demonstration of Differential Testing

(approximate runtime: 1 min per 120 test cases)

[!Important]: This step must run after the program mutation step described in the previous section.

  • Evaluate our differential fuzzer on ten JS test beds by running the following command (In our paper, we tested 102 testbeds on a much larger dataset for 200 hours):

    python /root/src/03_evaluate_harness.py --testsuite=/root/data/mutation_result/ --clear_classifier=False

During the differential testing, inconsistent testing outcomes will be printed out on the screen. Since we generate multiple test cases from the same test program through mutation, your are likely to get identical inconsistent results across test cases. We apply our filtering script to filter these identical testing outcomes.

Once executed and if buggy behaviour is detected, you will see an output similar to the one given below.

.......

The number of deviated test cases that were filtered out by our filtring scheme is: 4

The number of test cases required manual analysis is: 1

Summary of test cases required manual inspect:

=================================================

Test Cases     JS Testbed

1.js    Jerryscript-7df87b7

=================================================


All differential testing results are saved to /root/data/mutation_result/log/*.log. Check the log file for deviated execution outputs


Since we only test on a relatively smaller number of test cases (nsamples <= 512), it is likely that none of the test cases triggers a potential bug.

Known Issue: If you get a mysql connection exception (e.g., pymysql.err.OperationError) error, make sure you have run the setup script as:

source /root/.bash_profile

6. Demonstration of Test Case Reduction (Optional)

[!important] This step must run after the differential testing stage.

Use the following command running in our docker container to evaluate our test case reducer:

python /root/src/05_testcase_reducing.py --file_dir=/root/data/interesting_testcases

Note that the test cases stored in the interesting_testcases folder are renamed with indices starting from 1.

Known issue To demonstrate the test case reduction, we always include a randomly chosen test case (in addition to the bug-exposing ones, if any) in the interesting_testcases folder. This is to prevent the issue of not able to run the test case reducer if no test case triggered buggy behaviour during differential testing. If the randomly chosen test case does not trigger buggy behaviour, the test case reducer will return an empty result for it. This is a work-around technique used for AE only.

7. Testing other fuzzers (optional)

(~20+ hours)

For convience, we have provided the test data generated by other fuzzers. But you can still check this document for how to use other fuzzers (CodeAlchemist, DeepSmith, Fuzzilli, Montage, DIE) for test program generation and differential testing. This evaluation requires using a larger docker image (80+GB uncompressed) that can be downloaded from here. This process takes 20+ hours on a laptop/PC.

★ Remarks

The docker image provides a small-scale experiment to showcase the working mechanism of our work. Our main results (that run much longer – 200 hours per JS testbed on a larger test dataset) can be found at the Bug List section.

In Figure 6 of the submitted manuscript, we attribute the discovered bugs to the general components of JS engines. This grouping is subjective as JS engine implementations do not follow the same structure, and we refer the reviewers to the live server to check the results.

Reusing Our AE

Notes for how to resue our AE can be found at this document.