The DARPA Cyber Grand Challenge program developed this visualization tool that is driven by data generated by the CGC Monitor. The Unity3D based package consumes trace data and illustrates program execution flow in terms of a trace line that moves between geometric forms representing software functions, having vertexes that represent basic blocks in the machine code. Memory accesses and data I/O are represented by blocks of data moving between software functions and a memory plane and I/O pipe.
The analysis subsystem creates several artifacts that may be of use in preparing visualizations of the behavior of a given challenge binary while consuming a given PoV. These include:
-
execution trace having entries for each executed instruction and each memory reference (addresses, not data).
-
A log of all system calls made by the program, including all parameters and returned values (including data buffers).
-
Static program structure as a list of functions and an enumeration of all basic blocks (blocks.txt) within each function, (in terms of addresses). Note programs will typically be stripped of symbol information, so the function names are simply addresses.
-
Summary of interesting events, e.g., SEGV, and pointer overwrites that led to the event.
Trace files contain one line per traced event as follows:
type: [cycles] <address> info
Type is eather "inst:" (instruction) or "data:" (memory access). Igore "exce:" entries.
For "inst" types, the address is the EIP and info is the opcode followed by the assembly statement. example:
inst: [c60f1956fd] <0x0000000008048081> 89 e5 mov ebp,esp
For "data" types, the address is the memory being accessed, and info reflects whether it is a read or a write, the number of bytes and the value to be read or written.
A log of system calls is generated as xml with entries for each system call, identified by the type of call (e.g., read, mmap, etc.). Example:
<mmap>
<cycle>2db3995dd8f6</cycle>
<eip>b7ff85a3</eip>
<address>b7fda000</address>
<size>4c8b</size>
</mmap>
<munmap>
<cycle>2db3995e716b</cycle>
<eip>b7ff85e1</eip>
<address>b7fda000</address>
<size>4c8b</size>
</munmap>
<read>
<cycle>2db39960f743</cycle>
<eip>b7fe1424</eip>
<fd>3</fd>
<buf>bfffef38</buf>
<num_bytes>32</num_bytes>
<count>32</count>
<read_data>603f05325b66c670edd1709689c4b2cf7ef8e778e1d49294cd13fae2415a96b4</read_data>
</read>
The results of static analysis on a binary is organized as one one for each function within the binary. Each line starts with the address of the start of the function, the function name (typically stripped binaries will result in names that are little more than the function address), and a list of basic block addresses within the function. example:
804beb0 deregister_tm_clones 804beb0 804bebf 804bec1 804beca
804bee0 register_tm_clones 804bee0 804bef8 804befa 804bf03
804bf20 __do_global_dtors_aux 804bf20 804bf29 804bf3c
804bf40 frame_dummy 804bf40 804bf49 804bf52 804bf67 804bee0
Event summaries are captured as xml files, most of which can be ignored. The "event" tag entries reflect interesting events within a CB, with the "event type" tag identifying the specific event as follows:
- Execution of non-executable address (e.g., on the stack)
- Return instruction that does not correspond to a call
- SEGSEGV
- SIGILL
- SIGBUS
The event types (3), (4) and (5) indicate a proof of vulnerability. The "descrip" tag includes the EIP at which the event occured. Example:
<replay_log>
<replay_entry>
<replay_name>POV_CBdf9df201_ATH_000000</replay_name>
<time_start>2014-12-11 10:50:41</time_start>
<cb_entry><cb_name>CBdf9df201_01</cb_name><cb_sys_calls>26</cb_sys_calls><cb_cycles>3000386</cb_cycles>
<cb_user_cycles>31895</cb_user_cycles><cb_faults>1</cb_faults><cb_wallclock_duration>0.26</cb_wallclock_duration>
</cb_entry><event><source><kind>CB</kind><pid>2810</pid><comm>CBdf9df201_01</comm>
</source><descrip>Signal 11 at eip: 8048b12 </descrip><event_type>3</event_type></event>
<replay_sys_calls>307</replay_sys_calls><replay_faults>5</replay_faults><time_end>2014-12-11 10:50:59</time_end>
<duration>17.99</duration><drone>10.20.200.115_10</drone>
</replay_entry>
</replay_log>
Note, event summaries do not reflect application-level events, such as whether a service poll passes or not. We do not record the SIGKILL that occurs when a CB fails a service poll because that is not a reliable determination, i.e., a CB that fails a poll could have exited prior to the SIGKILL. _
Artifacts are captured as files in a "cgcArtifacts" directory hierarchy, organized at the highest level by Challenge Set Identifier (CSID). The CSID always starts with the letters "CB", followed by six random hex characters, and ending with two hex characters that reflect the quantity of binaries within the challenge set.
Beneath each CSID directory are "author" and "competitor" subdirectories,
reflecting the origin of the CB instance. The next level of directory is
organized by "common name" of the instance of the challenge binary, i.e.,
the CSID augmented by information about the CB instance. For example, the
"MG" suffix indicates a patched (mitigated) binary from the CB author.
Beneath each common name is a directory for each replay (PoV or service poll),
that ran against that instance. Also, an "ida" subdirectory contains the
static analysis artifact in a file named "blocks.txt". Beneath each replay
directory are the system call logs, (in files with a .xml.gz extension) and
the traces, (in files with a .txt.gz extension).
Use the Simics-based CGC forensics monitor to create the trace file and the log of system calls. [Extend it to include a file-based record of events, including pointer overwrites.]
Use the Ida script functionBlocks.py to create a blocks database file containing a list of functions and their blocks.
extractMoves.py will consume the Simics trace and create the moveData.txt file representing all the data moves.
controlFlow.py will consume the Simics trace and the blocks database file and will create an operation.txt file containing each call, return and goto statement.
functionUse.py will consume operation.txt and blocks.txt to create a functionList.txt containing each function, its place in the hieararcy and the number of basic blocks that it contains.
combineDataSets.py consumes the above generated files plus the callLog.txt from the forensics system and creates the combined.txt and ranges.txt files. The combined.txt is ordered by cpu cycle. The ranges.txt simply notes highest and lowest data addresses within a hacked range, for use by the visualization memory representation.
See the README-HELP.txt file for a discusson of how the Unity visualization tool uses these files and how to navigate within that tool.