In this tutorial, we will develop a non-trivial plugin that will verify that in a program a certain sequence of calls doesn't happen. We will develop the analysis in both OCaml and Python. You can choose either path or even both paths.
After finishing this tutorial you will learn how to use BAP basic capabilities and how to extend BAP using our plugin system. You will learn how to examine programs, by looking into their intermediate representations (IR) or disassembly. You will also learn how to run binaries in the emulated environment.
- how to build and install plugins
- how to use command-line arguments in a plugin
- how to work with IR and different graph representations.
- how to interact with bap from the Python universe
- how to use IR visitors
- how to create graphs
To complete the OCaml path of the tutorial you will need an OCaml
development environment, bap, and all the necessary libraries. The easiest option is to install BAP from opam. But if you don't want to spend time on that, don't worry. We have prepared a Vagrantfile, that will set up your virtual machine using Vagrant. If you have vagrant
installed on your machine the following will set up and prepare the necessary development environment.
# if you don't want to install bap from opam yourself
wget http://tiny.cc/Vagrantfile
vagrant up
vagrant ssh
The last command will ssh you into the installed virtual machine, that has BAP and all the necessary libraries readily available. It also has Emacs preinstalled and configured to work with OCaml.
If you're going to take only the Python part of the tutorial and do not want to install any virtual machines, then you can use BAP's binary distribution. If you're using some Debian derivative, then the following will install BAP on your system.
wget https://github.com/BinaryAnalysisPlatform/bap/releases/download/v2.3.0/{bap,libbap,libbap-dev}_2.3.0.deb
sudo dpkg -i {bap,libbap,libbap-dev}_2.3.0.deb
sudo install pip
sudo pip install bap networkx
For OCaml, create a new folder somewhere in your home folder, and enter into it, e.g.,
mkdir check_path
cd check_path
Then use Emacs to create an empty file
emacs check_path.ml
Note: it is mandatory to develop each BAP plugin in a separate folder.
After Emacs has started, use M-x merlin-use <RET> bap
command to tell
Merlin that you're going to use the bap
library. See our wiki for more
instructions about Merlin.
If you're developing in Python then do the same but create a file with
the '.py
extension, e.g.,
emacs check_path.py
We will need some binary to play with. Especially for this
tutorial, we created a binary that is small, but still non-trivial. You can get it using wget
wget tiny.cc/bap-echo -O exe
If you're running on an x86 machine you can run the binary by
chmod a+x exe
./exe
The binary will parse and print some verses from Shakespeare. It will also echo all its arguments.
You can also run the binary using bap
. The binary will not be
run on an actual CPU, but emulated using the Primus Framework.
To output the executable in the default assembler use -dasm
option, e.g.,
bap ./exe -dasm
ProTip: if you prefer the intel syntax use --llvm-x86-syntax=intel
.
To get the IR just use the -d
option, e.g.,
bap ./exe -d
You can also use -dcfg
to get the control-flow graphs, and
-dcallgraph
to dump the program call graph. To get the full list of
supported formats for the project
data structure use the
bap list formats -f project
command, e.g.,
$ bap list formats -f project
Bap.Std.Project:
symbols (1.0.0) print symbol table
ogre (1.0.0) print the file specification in the OGRE format
marshal (2.0.0) OCaml standard marshaling format
knowledge (1.0.0) dumps the knowledge base
graph (1.0.0) print unlabeled CFG for each procedure
cfg (1.0.0) print rich CFG for each procedure
callgraph (1.0.0) print program callgraph in DOT format
bir (1.0.0) print program in IR
bil.sexp (1.0.0) print BIL instructions in Sexp format
bil.adt (1.0.0) print BIL instructions in ADT format
bil (1.0.0) print BIL instructions
asm.sexp (1.0.0) print assembly instructions as it was decoded
asm.decoded (1.0.0) print assembly instructions as it was decoded
asm.adt (1.0.0) print assembly instruction endcoded in ADT format
asm (1.0.0) print assembly instructions
adt (1.0.0) print program IR in ADT format
For more options see bap --help
or man bap
. To get information about a plugin
named <PLUGIN
use the bap --<PLUGIN>-help>
option. e.g., for the optimization
plugin, bap --optimization-help
.
ProTip: if you have a GUI setup, then you can use xdot
for interactive
visualization of control flow graphs:
bap exe -dcfg --print-symbol=main | xdot -
The purpose of our analysis is to verify that for all execution paths
in a program, a call to the function f
is not followed by a call to
the function g
, where f
and g
are specified by a user. We can
express this property concisely using Linear Temporal Logic:
f -> G not(g)
where f
and g
are propositions that are true if a call to a
the corresponding function occurred, and G x
is the LTL operator that
requires the predicate x
to be true on the entire subsequent path.
We will write a policy checker, that will verify that this property holds for all execution paths, or will provide a counter-example, proving the opposite. The checker will be static and sound (wrt to the soundness of the CFG).
Now, let's figure out how we will prove this property:
-
If there are no calls to
f
or tog
or to both, then the verification property is trivially true, i.e.,not f \/ not g
This property has a strong "must" modality. Since the call indeed can't happen, so the propertymust
hold for all paths. -
If
f
callsg
, either directly or by calling a function that callsg
, thenf
occurs beforeg
. The call tog
still may not occur, as we ignore path constraints, so this gives us the may modality. We will express this ascalls(f,g)
. -
If there is a subroutine that has two calls that lead to
f
andg
correspondingly, and these calls are reachable in the control flow graph of the subroutine, theng
may be called afterf
with the may modality. (Again, we do not consider the feasibility of the path constraint). More formally, this property can be expressed as:sites(p,q) := calls(p,f) /\ calls(q,g) /\ reaches(p,q)
where
reaches(x,y)
is true if there exists a control flow graph with a path betweenx
andy
.
To summarize, to prove that f -> G not(g)
holds for the static model of our program, we need to prove that the following doesn't hold
f /\ g /\ (calls(f,g) \/ ∃p∃q, sites(p,q))
We will start with several open
and include
statements that will
prepare our development context. Add the following to the top of the
check_path.ml
file.
open Core_kernel
open Poly
open Bap.Std
open Graphlib.Std
open Format
include Self()
After you saved the file with C-x C-s
, the (No errors)
message
should appear in the mini-buffer. If it's not true, then make sure
that you have set up Merlin and didn't make any errors in your code.
We will create a Cmdline
module with the definition of our command line interface. It is not required to put the code in a separate
module, but it looks cleaner and makes it easier for a reader to
identify that this particular section of code deals with the command
line interface
let main src dst proj = ()
module Cmdline = struct
open Config
let src = param string "src" ~doc:"Name of the source function"
let dst = param string "dst" ~doc:"Name of the sink function"
let () = when_ready (fun {get=(!!)} ->
Project.register_pass' (main !!src !!dst))
let () = manpage [
`S "DESCRIPTION";
`P
"Checks whether there is an execution path that contains a call
to the $(v,src) function before a call to the $(v,dst)
parameter. Outputs a possible (shortest) path that can
lead to the policy violation."
]
end
The code above will specify that our plugin has two command line parameters,
one is named src
and another dst
, both having type
string
. It will also register the main
function as a pass in the
analysis framework. The following stanza probably needs some
explanations:
let () = when_ready (fun {get=(!!)} ->
Project.register_pass' (main !!src !!dst))
The when_ready
function takes one argument, that is a function that
will be called as soon as system configuration is processed and
command-line arguments are parsed. The function will take a record,
that has only one field get
, that is a function on itself, that will
take a value of type 'a param
and will extract a value of type 'a
from it. The {get=(!!)}
syntax, binds the get
function to the
(!!)
unary operator that can be used nicely to extract values from
parameters.
To summarize, a user specifies the command-line interface using
the param
function. Then the Config.when_ready
function is used to
bind the get
function, which is a key that will unlock the parameters,
and extract the values that are written there by the command-line
parser. The interface is described thoroughly in the reference
documentation.
The following three commands will build, install, and run our analysis:
bapbuild check_path.plugin
bapbundle install check_path.plugin
bap ./exe --check-path-src=malloc --check-path-dst=free --pass=check-path
The bapbuild check_path.plugin
command will compile the check_path.plugin
from the check_path.ml
. The bapbundle install check_path.plugin
command will install the plugin into the place, where BAP will
automatically load (but not run) it.
The last command will run bap
on a binary and instruct it to process
the binary with a pass named check-path
. By default, a pass name has
the same name as the plugin name, except that all underscores are
substituted with dashes.
The --check-path-src=malloc
will pass malloc
as an argument to the
src
parameter. Note, that to pass an argument to a plugin
parameter you need to prefix the parameter with the plugin name.
When developing, it is convenient to run all three commands at once, with one keystroke. Fortunately, Emacs allows us to do this. To compile for the first time, type
M-x compile <RET> bapbuild check_path.plugin && bapbundle install check_path.plugin && bap ./exe --check-path-src=malloc --check-path-dst=free --pass=check-path
This will run all three commands. You can also create a Makefile, or store this command in some build script.
Whenever you need to recompile, just hit C-c C-c
and Emacs will
recompile using the last compilation command.
See the Emacs manual for more information about compilation.
We will first check the trivial proof, i.e., when one of the functions, or both, are not present. For that we will write a function, that will translate a name of a function, as specified by a user, to the term identifier of this function.
let find_sub prog name =
Term.enum sub_t prog |>
Seq.find_map ~f:(fun s ->
Option.some_if (Sub.name s = name) (Term.tid s))
Now we can extend the main
function:
let main src dst proj =
let prog = Project.program proj in
match Option.both (find_sub prog src) (find_sub prog dst) with
| None -> printf "satisfied (trivially)\n"
| Some (src,dst) -> printf "not implemented\n"
We will now start writing a verify
function, that will try to prove
that the relation doesn't hold by providing a concrete
counter-example. The verify
function will take the program in the
Intermediate Representation (IR) and two term identifiers, one for the source function
and another for the destination function.
The function will try to find a path in the program call graph, that leads from the source function to the destination function. If there is no such path then for each control flow graph we will try to find a path between a callsite that reaches the source function and a callsite that reaches the destination function.
We will work with two different graph representations, one for the call graph, and another for the control flow graph, let's introduce the following abbreviations:
module CG = Graphs.Callgraph
module CFG = Graphs.Tid
Our verify
function return type will be proof option
, i.e., it
will return Some proof
if the proof was found, otherwise an absence of
a proof is the proof that the relation will hold unless there is a not yet
discovered indirect call.
Let's define the proof
type as a variant type with two branches that
correspond to the two cases in our proof strategy:
type proof =
| Calls of CG.edge path
| Sites of CFG.edge path
We will use the path
data structure from the Graphlib library, as a
particular representation of the proof term. Since we would like to output
these paths, we will start with the implementation of two pretty-printing functions for both
variants. Since they share the same polymorphic data
structure, we will first write a polymorphic function that will print
all types of paths, and then we will specialize this function to our
two concrete types:
let pp_path get_src get_dst ppf path =
let print_node n = fprintf ppf "%s" (Tid.name n) in
print_node (get_src (Path.start path));
Seq.iter (Path.edges path) ~f:(fun e ->
pp_print_string ppf " -> ";
print_node (get_dst e))
let pp_calls = pp_path CG.Edge.src CG.Edge.dst
let pp_sites = pp_path CFG.Edge.src CFG.Edge.dst
The pp_path
function is parametrized with the get_src
and
get_dst
functions that will extract, correspondingly, the source node and
the destination node of an edge. Then it will print the source of the first
edge, and iterate over all edges and print their
destinations. Recall, that an edge consists of two endpoints and a
label (we will ignore the label for now). Thus a path is a
sequence of m
edges, connecting m+1
nodes, e.g.,
(s, n0); (n0, n1); ...; (sm, d)
/\ /\ /\
|| || ||
e0 e1 em
Since we want our output to be concise, we will print only m+1
nodes, i.e., in our case s,n0,n1,...,d
.
Now, let's update the main
function:
let main src dst proj =
let prog = Project.program proj in
match Option.both (find_sub prog src) (find_sub prog dst) with
| None -> printf "satisfied (trivially)@\n"
| Some (src,dst) -> match verify src dst prog with
| None -> printf "satisfied (no counter-example was found)@\n"
| Some p ->
printf "unsatisfied by ";
match p with
| Calls p -> printf "calls via %a@\n" pp_calls p
| Sites p -> printf "callsites via %a@\n" pp_sites p
We will also provide a stub for the verify
function, so that we can
compile and run our program (and keep Merlin happy):
let verify src dst prog : proof option = None
First of all, let's try to prove, that there is a path in the program call graph from the source function to the destination function. It would be even easier than the trivial proof:
let verify src dst prog : proof option =
let cg = Program.to_graph prog in
match Graphlib.shortest_path (module CG) cg src dst with
| Some path -> Some (Calls path)
| None -> None
Proving that for all control flow graphs there is no callsite to the destination function that is reachable from a callsite to the source, would be a little bit harder. We need to enumerate all subroutines, obtain their control flow graphs, and then check the connectivity of each pair of calls of the form: call to the destination, call to the source function.
We will address these problems using the bottom-up approach. We will
start with the supporting code, that will enumerate all interesting
callsites from a subroutine term. For that, we need to identify whether
a callsite, that is a call term, may lead to an invocation of a target
function. That would be the reaches
predicate:
let reaches cg callee target =
Graphlib.is_reachable (module CG) cg callee target
Now we are ready to write the callsites
function, which will take the
call graph cg
, the target
function, and the subroutine term sub
,
and return a sequence of calls that has a destination function, that
reaches target
in the call graph.
let callsites cg target sub =
Term.enum blk_t sub |>
Seq.concat_map ~f:(fun blk ->
Term.enum jmp_t blk |> Seq.filter_map ~f:(fun j ->
match Jmp.kind j with
| Goto _ | Ret _ | Int (_,_) -> None
| Call dst -> match Call.target dst with
| Direct tid when reaches cg tid target ->
Some (Term.tid blk)
| _ -> None))
In our implementation, we are ignoring local jumps, as well as return statements and CPU interrupts. We also ignore indirect jumps and calls, expecting that it is the task of another analysis to make all possible control flow explicit in the graph.
Now we have all the necessary tools to finish our analysis, we will
update the verify
function:
let verify src dst prog : proof option =
let cg = Program.to_graph prog in
match Graphlib.shortest_path (module CG) cg src dst with
| Some path -> Some (Calls path)
| None ->
Term.enum sub_t prog |> Seq.find_map ~f:(fun sub ->
let g = Sub.to_graph sub in
Seq.find_map (callsites cg src sub) ~f:(fun sc ->
Seq.find_map (callsites cg dst sub) ~f:(fun dc ->
if Tid.equal sc dc then None
else Graphlib.shortest_path (module CFG) g sc dc))) |>
Option.map ~f:(fun p -> Sites p)
That's it, for each subroutine in the program, we generate the control flow graph, then we iterate over all callsites that reach the source function, and over all callsites that reach the destination function, and check that if these two callsites are not equal, then there should not be a path between them, and if there is, then it is the proof of the policy violation. We added the equality test, because if the same callsite (and we denote a callsite by the basic block that hosts the call) reaches both the destination and the source, then it means that we have two distinct calls in the basic block that are mutually exclusive by the invariant of a control flow graph, or that we have one call where the destination function is invoked strictly before the source function. The only other case is ruled out by the first phase of our analysis, that proved that there is no path between the source function and the destination function in the call graph.
The Python implementation will follow closely the OCaml implementation, so it is recommended to read the latter, to get some insights. The main difference is that in Python we need to build graphs manually. It is quite possible, that we will introduce the code for graph building in our next release of Python bindings, but so far (as of version 1.3) we need to build them manually. Anyway, building a graph is an excellent showcase of how to work with the complex IR structures. But we will start from the easy stuff first.
Unfortunately, currently, we do not provide an interface for writing
real plugins in Python and so far, the only way to use BAP from Python
is to run the bap
executable and to parse its output. Fortunately,
we provide some supporting code for this, so running bap
is as easy
as:
import bap
proj = bap.run('./echo')
We recommend running bap in some interactive REPL, such as IPython, that makes it easy to explore the data structures that are returned from the bap module functions.
We will use the standard argparse
module for argument parsing, and
the Networkx library to work with graphs. So let's start with writing
some boilerplate code:
import bap
import networkx as nx
import argparse
def verify(prog, src_name, dst_name):
pass
description = """
Verifies the safety property that the DST function is never called
after the SRC function is called
"""
def main():
parser = argparse.ArgumentParser(description=description)
parser.add_argument('filename', help='target filename')
parser.add_argument('-s', '--src', required=True, help='the source function')
parser.add_argument('-d', '--dst', required=True, help='the sink function')
args = parser.parse_args()
proj = bap.run(args.filename)
result = verify(proj.program, args.src, args.dst)
if result:
print('unsatisfied ({} are reachable via {})'.format(str(result[0]), str(result[1])))
else:
print('satisfied')
if __name__ == '__main__':
main()
We specify one positional argument for the target binary, as well as
two mandatory arguments, the names of the source and destination
functions. Our verification function is expected to return None
, if
no counter-example was found, or a tuple if we found a proof. The
first element of the tuple represents the kind of the proof, i.e.,
whether we found a relation in a call graph or a control flow
graph, and the second element is the path itself.
BAP represents graphs using [Algebraic Data Types][6]. It is easier
to work with ADT using the Visitor pattern. The idea is simple. A
complex ADT, like the IR in our case, is a tree with many different
kinds of branches and leaves. The base visitor class will traverse the
tree, and every time an element of type Xxx
is entered, a method
named enter_Xxx
will be called (if it exists). For example, when a
basic block is entered, the enter_Blk
method will be invoked with
the particular instance of the block.
We will implement GraphsBuilder
that will descent
through the program IR and build two kinds of graphs simultaneously -
the program callgraph, and a list of control flow graphs (for each
subroutine). Our builder state (other than graphs themselves) will include
the current subroutine and the current basic block:
class GraphsBuilder(bap.adt.Visitor):
def __init__(self):
self.callgraph = nx.DiGraph()
self.cfgs = []
self.sub = None
self.blk = None
Every time we enter into a new subroutine term we create an empty digraph and append it to the list of control flow graphs. We also add a node to the call graph and label this node with the reference to the newly created control flow graph - this will make it easier for us to get the CFG of a function, the functionality that we will need later. We also update the currently processed subroutine (we do not really store the subroutine number, as we only interested in the term identifier of the subroutine).
def enter_Sub(self, sub):
cfg = nx.DiGraph()
self.callgraph.add_node(sub.id.number, name=sub.name, sub=sub, cfg=cfg)
self.sub = sub.id.number
self.cfgs.append(cfg)
When we enter a basic block, we push its term identifier as a node to the currently built control flow graph and update the current block number:
def enter_Blk(self, blk):
self.cfgs[-1].add_node(blk.id.number, blk=blk)
self.blk = blk.id.number
When we see a call we need to add an edge to the call graph and an edge to the current control flow graph to represent the fallthrough edge (if the call is not marked as a non-return call).
def enter_Call(self, jmp):
callee = direct(jmp.target[0])
if callee:
fall = direct(jmp.target[1]) if len(jmp.target) == 2 else None
if fall:
self.cfgs[-1].add_edge(self.blk, fall.number, jmp=jmp)
Note, that we are ignoring indirect calls, and using the following helper function to extract the destination of a direct jump:
def direct(jmp):
return jmp.arg if jmp is not None and jmp.constr == 'Direct' else None
Whenever we enter the Goto
statement, that represents a transfer of
the control flow between two basic blocks in a function, we just add a
new edge, if this transfer is explicit (i.e., direct)
def enter_Goto(self, jmp):
dst = direct(jmp.target)
if dst:
self.cfgs[-1].add_edge(self.blk, dst.number, jmp=jmp)
Finally, when a CPU exception occurs, we just add a fallthrough edge to the currently built control flow graph
def enter_Exn(self, exn):
fall = exn.target[1]
if fall:
self.cfgs[-1].add_edge(self.blk, fall.number, exn=exn)
As in the OCaml version, we will collect all callsites that reach targets in which we are interested. The implementation is rather straight-forward, whenever we see a call, we look at the call destination, and if it is known and reaches either the source function, or the destination function, we add it to the corresponding list.
class CallsitesCollector(bap.adt.Visitor):
def __init__(self, callgraph, src, dst):
self.callgraph = callgraph
self.caller = None
self.src = src
self.dst = dst
self.srcs = []
self.dsts = []
def clear(self):
self.srcs = []
self.dsts = []
def enter_Blk(self, blk):
self.caller = blk.id.number
def enter_Call(self,jmp):
callee = direct(jmp.target[0])
if callee:
if nx.has_path(self.callgraph, callee.number, self.src):
self.srcs.append(self.caller)
if nx.has_path(self.callgraph, callee.number, self.dst):
self.dsts.append(self.caller)
We also added the clear
method that will drop both lists, as we are
planning to reuse the same collector for all subroutines.
Now we have a full arsenal of tools to implement the verification
procedure. It is rather straightforward, we will start by finding
target functions, using the find method that is provided by all
instances of the Seq
class. If one of the targets is not present,
then the verification condition is satisfied trivially and we bail out
of the function. Otherwise, we instantiate GraphsBuilder
and run it
to build the graphs. We then look at the call graph to see if there
is a path between the source and destination functions. If there is no
such path, then we are looking at each CFG for a pair of
callsites to the target functions that have a path. And if there is one,
then it is returned.
def verify(prog, src_name, dst_name):
src = prog.subs.find(src_name)
dst = prog.subs.find(dst_name)
if src is None or dst is None:
return None
graphs = GraphsBuilder()
graphs.run(prog)
cg = graphs.callgraph
if nx.has_path(cg, src.id.number, dst.id.number):
return ('calls', nx.shortest_path(cg, src.id.number, dst.id.number))
calls = CallsitesCollector(graphs.callgraph, src.id.number, dst.id.number)
for sub in prog.subs:
calls.run(sub)
cfg = graphs.callgraph.nodes[sub.id.number]['cfg']
for src in calls.srcs:
for dst in calls.dsts:
if src != dst and nx.has_path(cfg, src, dst):
return ('sites', nx.shortest_path(cfg, src, dst))
calls.clear()
return None