-
A Learning-Based Approach to Static Program Slicing, (OOPSLA2024)
- Abstract: Traditional program slicing techniques are crucial for early bug detection and manual/automated debugging of online code snippets. Nevertheless, their inability to handle incomplete code hinders their real-world applicability in such scenarios. To overcome these challenges, we present NS-Slicer, a novel learning-based approach that predicts static program slices for both complete and partial code Our tool leverages a pre-trained language model to exploit its understanding of fine-grained variabl...
- Labels: static analysis, data-flow analysis, code model, code model training, source code model
-
LLMs: Understanding Code Syntax and Semantics for Code Analysis, (arXiv2023)
- Abstract: Large language models~(LLMs) demonstrate significant potential to revolutionize software engineering (SE) by exhibiting outstanding performance in SE tasks such as code and document generation. However, the high reliability and risk control requirements in software engineering raise concerns about the lack of interpretability of LLMs. To address this concern, we conducted a study to evaluate the capabilities of LLMs and their limitations for code analysis in SE. We break down the abilities neede...
- Labels: static analysis, data-flow analysis, call graph analysis, data-flow analysis, code model, code model training, source code model, empirical study
-
LLMs: Understanding Code Syntax and Semantics for Code Analysis, (arXiv2023)
- Abstract: Large language models~(LLMs) demonstrate significant potential to revolutionize software engineering (SE) by exhibiting outstanding performance in SE tasks such as code and document generation. However, the high reliability and risk control requirements in software engineering raise concerns about the lack of interpretability of LLMs. To address this concern, we conducted a study to evaluate the capabilities of LLMs and their limitations for code analysis in SE. We break down the abilities neede...
- Labels: static analysis, data-flow analysis, call graph analysis, data-flow analysis, code model, code model training, source code model, empirical study
-
Program Slicing in the Era of Large Language Models, (arXiv2024)
- Abstract: Program slicing is a critical technique in software engineering, enabling developers to isolate relevant portions of code for tasks such as bug detection, code comprehension, and debugging. In this study, we investigate the application of large language models (LLMs) to both static and dynamic program slicing, with a focus on Java programs. We evaluate the performance of four state-of-the-art LLMs- GPT-4o, GPT-3.5 Turbo, Llama-2, and Gemma-7B leveraging advanced prompting techniques, including f...
- Labels: static analysis, data-flow analysis
-
Programl: A graph-based program representation for data flow analysis and compiler optimizations, (ICML2021)
- Abstract: Machine learning (ML) is increasingly seen as a viable approach for building compiler optimization heuristics, but many ML methods cannot replicate even the simplest of the data flow analyses that are critical to making good optimization decisions. We posit that if ML cannot do that, then it is insufficiently able to reason about programs. We formulate data flow analyses as supervised learning tasks and introduce a large open dataset of programs and their corresponding labels from several analys...
- Labels: static analysis, data-flow analysis, program optimization, code model, code model training, IR code model
-
Revealing the Unseen: AI Chain on LLMs for Predicting Implicit Dataflows to Generate Dataflow Graphs in Dynamically Typed Code, (TOSEM2024)
- Abstract: Dataflow graphs (DFGs) capture definitions (defs) and uses across program blocks, which is a fundamental program representation for program analysis, testing and maintenance. However, dynamically typed programming languages like Python present implicit dataflow issues that make it challenging to determine def-use flow information at compile time. Static analysis methods like Soot and WALA are inadequate for handling these issues, and manually enumerating comprehensive heuristic rules is impracti...
- Labels: static analysis, data-flow analysis
-
Sanitizing Large Language Models in Bug Detection with Data-Flow, (EMNLP2024)
- Abstract: Large language models (LLMs) show potential in code reasoning tasks, facilitating the customization of detecting bugs in software development. However, the hallucination effect can significantly compromise the reliability of bug reports. This work formulates a new schema of bug detection and presents a novel sanitization technique that detects false positives for hallucination mitigation. Our key idea is to enforce LLMs to emit data-flow paths in few-shot chain-of-thought prompting and validate ...
- Labels: static analysis, bug detection, data-flow analysis
-
Unveiling Code Pre-Trained Models: Investigating Syntax and Semantics Capacities, (TOSEM2024)
- Abstract: Code models have made significant advancements in code intelligence by encoding knowledge about programming languages. While previous studies have explored the capabilities of these models in learning code syntax, there has been limited investigation on their ability to understand code semantics. Additionally, existing analyses assume that the number of edges between nodes at the abstract syntax tree (AST) is related to syntax distance, and also often require transforming the high-dimension...
- Labels: static analysis, pointer analysis, data-flow analysis, empirical study