Following, are the notes I created while studying compilers primarily from 2 books:
Why study compilers?
A higher level programming language, say C, is compiled into machine code by a compiler.
A compiler is written in C, a higher level programming language.
So, how was the compiler… compiled???
How does a compiler know C?
How was C created?
A program which translates a program (source) from a higher level programming language into another language.
- Many of the design aspects of compilers have application to other fields, such as: - language interpretation - parsing techniques - transformation of one language into another
- We haven't really cracked a way to generate machine code which can full take advantage of parallel architectures - That is apparent in video games where most of them cannot take full advantage of parallelisation and thus load slower
- A look into and understaning of how processors work
- Opportunity to learn programming data structures and algorithms
- Understand why a programming language is designed the way it is
"one of the best ways of learning a programming language is to write a compiler for that language, preferably writing it in its own language"
- Is binary instructions
- Programmed by humans explicitly in the early days of computers
- Replaced the development of machine code
- Set of short instructions which translate into machine code
- The translation from assembly to machine code is done by assemblers
- Assembler: a specialised program which translates assembly code into machine code
The speed or efficiency of a higher level programming language, lies in the compilation of source code into assembly and the generation of efficient machine code, by the assembler. If the assemlber produces unoptimized code, the program will execute slower. That's why some programming language are slower than others.
- Additional tools have been developed which generate parts of the compiler - Lexical analysers and syntax analysers (two early stages of the compilation) - Are language specific - Take as input the source language specification - Produce an output for the compiler - Reduce compiler workload
- Compiler modularization into multiple tools and/or stages, helped reduce their complexity
- Some compilers features several parts: forntend and backend - Frontend: - generates an intermediate representation (IR) language - The frontend can be machine independent - Intermediate representation (IR) language - interface between frontend and backend - Backend - consumes IR and produces machine code - The backend can be source language independent
- Frontends and Backends can be mixed and matched (since they're independent from eachother)
- All programming languages implementations are somewhat interpreted.
There are 3 forms of interpretation:
1. Source level interpretation:
- on consecutive runs, the interpreter analyses the source program and translate each command into machine instruction and runs the instructions
- low runtime efficiency
- interpreter and source code in memory
2. Intermediate code interpretation:
- the analysis of the program is performed once and produces an intermediate form
- on consecutive runs, the interpreter executes the intermediate form, which is faster
- the intermediate form is a form of compilation
- interpreter and intermediate form code in memory
3. Target code interpretation:
- full compilation
- no need for an interpreter
- on consecutive runs the hardware performs the commands directly since they're already compiled
- initial compilation cost
- only machine code in memory
- The compiler, generates code for a virtual machine
- The interpreter, reads the virtual machine code and emulates the execution of the virtual machine, instruction by instruction
Semantics assign computational meaning to valid strings in a programming language syntax.
- Attribute grammars (complex)
- Reference implementation (error prone)
- Specify semantics in a natural language (English) (common, easiest)
- Operators: Mathematical symbols:
+-*/
- Operands: Numbers/Variables:
1,2,x,y
- Defines how operations are executed without the presence of parenthesis in a mathematical expression
Examples:
1+2*3
- languages used to describe the syntax of other higher level langauge
- BNF (Backus Naur Form)
- EBNF (Extended BNF)
<expr> ::= <term> | <expr> + <term> | <expr> - <term>
<term> ::= <factor> | <term> * <factor> | <term> / <factor>
<factor> ::= <integer> | (<expr>)
<integer> ::= <digit> | <integer> <digit>
<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
- Extended BNF
- Has the following aditional symbols.
Usage | Notation |
---|---|
definition | = |
concatenation | , |
termination | ; |
alternation | ` |
optional | [ ... ] |
repetition | { ... } zero or more times |
grouping | ( ... ) |
terminal string | " ... " |
terminal string | ' ... ' |
comment | (* ... *) |
special sequence | ? ... ? |
exception | - |
becomes | ::= |
- A defined syntax - is defined with the use of matalanguages
- Defined semantics - defined by reference sheets
G = (N, T, S, P)
N: non-terminal symbols
T: terminal symbols
S: starting symbol
P: production rules (α → β)
U = N ∪ T (union of non-terminal and terminal symbols)
-
Defined hierarchical classification of formal grammars
-
Which provides a framework for the defnition of programming languages
-
And the analysis of programs writen in these languages
-
Essentially: it defines productions in the form (x → y)
-
4 types
-
Chomsky type 0: free grammar or an unrestricted grammar
- simple productions of the form α → β - no constraints
-
Chomsky type 1: context-sensitive grammar
- αAβ → αγβ where α, β, γ ∈ U∗ γ is non-null A is a single non-terminal symbol
-
Chomsky type 2: context free grammar
- productions (A → γ) correspond directly to BNF - if a language can be defined in BNF - the language is no more complex than Chomsky type 2 - "Programming languages are generally defined using type 2 grammars and these grammars are used directly in the production of code for program analysis."
-
Chomsky type 3: regular grammar
- Used in the lexical analysis in the compiler - Programs in these languges can be described by FSMs (Finite State Machines) - Grammar productions: A → a or A → a B
-
Ref: https://en.wikipedia.org/wiki/Chomsky_hierarchy#The_hierarchy
-
Is one of the key tasks performed by a compiler.
-
Performs Reduction steps
- Source program => BNF and check syntax correctness - reverse process of going from BNF into valid program expressions - Much harder - Different reduction steps can lead to a non-starting symbol - Algorithms have been developed to tackle this issue.
-
The parsing is successful when the transformation (program => BNF) ends with the starting symbol.
- Production: BNF => Source program
- Reduction: Source program => BNF (reverse process)
- Top-down (start with root)
- Bottom-up (start with leaves)
-
Lexical analysis
- Group lexical tokens - Examples of lexical tokens: numbers, identifiers, punctuation, operators, strings, etc. - Chomsky type 3 - OUTPUT: group of lexical tokens
-
Syntax analysis
- INPUT: groups of lexical tokens - Perform reduction on the lexical token group and reduce them into BNF statements - If reduction fails, recover with message - Construct the program's syntax tree - Chomsky type 2 - OUTPUT: Abstract Syntax Tree
-
Semantic analysis
- INPUT: Abstract Syntax Tree - Traverses the tree and inserts the following: - types, operator overload, scopes, etc. - Flatten the parse tree to produce some form of intermediate code - OUTPUT: Intermediate Representation
-
Machine-independent optimization (optional)
- Applies code optimization techniques on the intermediate code - I.E. Loop unrolling, Function expansion, etc.
-
Code generation
- Depending on the target generates: - Hardware code (Machine) - or Assembly code - or Virtual Machine code
-
Machine-dependent optimization (optional)
- Optimizes generated machine code
-
Examine 2 approaches to designing a lexical analyser.
1. Direct implmementation. - Hand crafted lexical analyser (we write the code). 2. Syntactic specification of lexical tokens (formal path). - Use regexes to identify lexical tokens.
- Break the program into a sequence of tokens.
- All the lexical tokens should be definable in terms of regular expressions Chomsky type 3.
- Nested structures cannot be expressed with regexes and are left for the next stage.
- Does not check for syntax correctness, i.e. "The syntax correctness of the while loop".
- In general, comments can be ignored.
- Some whitespeces cannot be ignored (int a; // Note space after 'a')
- Source line numbers can be included in the Output. - Easier to debug later in case of errors or warnings.
- INPUT: Source program
- Output: Lexical tokens
Example
// Input
while (i <= 100) {
tot += a[i]; /* form vector total */
i++;
}
// Output
while (reserved word), (, i (identifier), <=, 100 (integer constant), ), {, tot (identifier),
+=, a (identifier), [, i (identifier), ], ;, i (identifier), ++, ;, }
- Special kind of data type defined by the user.
- Consists of constant integrals or integers that are given names by a user.
- The use of enum in C to name the integer values makes the entire program easy to learn.
- Enum is derrived from the mathematical constant ingegration:
d/dx f(x) = f'(x)
∫f′(x).dx = g(x)+C
And g(x)+C = f(x).
So, the integration ∫f′(x).dx is equal to g(x) plus a constant 'C', which can be any random constant.
Example
enum flag {const1, const2, ..., constN};
// By default const1=0, const2=1, ...
// OR, custom values
enum days {
Monday = 1,
Tuesday = 2,
Wednesday = 3,
Thursday = 4,
};
- Ref: https://www.cuemath.com/calculus/constant-of-integration/
- Ref: https://www.programiz.com/c-programming/c-enumeration
- Used to give a type a new name.
- Or create an entirely new type, such as when using enum.
Example
// Define a custom type named BYTE
typedef unsigned char BYTE;
// Use BYTE to create variables b1 and b2
BYTE b1, b2;
typedef enum {
Monday,
Tuesday,
Wednesday,
Thursday,
} days;
days day_of_week();
switch (expression)
{
case constant1:
// statements
break;
case constant2:
// statements
break;
.
.
.
default:
// default statements
}
- Type of data structure which can store multiple data types under a single name.
...
union Data {
int i;
float f;
char str[20];
};
int main( ) {
union Data data;
...
- Defined in
/usr/include/limits.h
# define INT_MAX 2147483647
2^31-1
- Define a function lex().
- Takes no arguments.
- Returns a lextokens result.
typedef enum {a,b,c,d} lextokens;
lextokens lex();
- Ideally, you want the lexical analyser to communication the errors it finds to the syntax analyser.
- Then the syntax analyser should know how to handle these errors.
- Until the communication (lexical analyser <=> syntax analyser) is established, the first one should print the errors to
stderr
.
- The lexical analyser has to be sufficiently tested.
- Test are automated, a main program feeds the lexical analyser with code and expects the correct results back.
- Edge cases must be taken into account.
- Writing test for the lexical analyser can prevent later issues with syntax analyser.
-
A regex is made up of symbols of the language + operators
-
Operators:
- concatenation: specified by symbol adjacency - alteration: specified with '|' - One or the other. - If succeded by *, then you can repeat symbols zero of multiple times - repetition: specified with '*' - applied to the left. - Zero or more repetitions. - Parenthesis: denote grouping of actions.
-
Precedence:
1. repetition 2. concatenation 2. alteration
• abc denotes the set of strings with the single member {abc}.
• a|b|c denotes the set {a, b, c}.
• a∗ denotes {ε, a, aa, aaa,...}. ε is the empty string.
• ab∗ denotes the infinite set {a, ab, abb, abbb,...}.
• (a|b)∗ denotes the set of strings made up of zero or more a’s or b’s.
• a(bc|d)∗e denotes the set of strings including {ae, abce, abcde, ade,...}.
- A specification in regex is not necessarily unique: a(b|c) == ab|ac
-
+
: 1 or more repetitions -
.
: represents the wildcard character (match any single character) -
?
: zero or one occurrence of the one-character regular expression(match as little as possible): https://stackoverflow.com/questions/8575281/regex-plus-vs-star-difference -
\\
: to match a special character -
^
: https://stackoverflow.com/questions/16944357/carets-in-regular-expressions- when at the start and inside `[]` means "not the following", so `[^...]` - when inside [] but not at the start, it means the actual ^ character - Examples: 1. [^abc] -> not a, b or c 2. [ab^cd] -> a, b, ^ (character), c or d 3. \^ -> ^ character 4. Anywhere else -> **start** of string/line.
-
[ ]
: character class which creates a matching list that will match any of the characters in the list, so [abc] == any of a, b, or c -
$
: matches only the ending of a line, soo$
== only words ending in o -
{}
: Braces:- If braces contain one or two numbers, they indicate minumum and maximum number of times the previous pattern can match. Example : A{1,3} matches one to three occurences of A - If the braces container a name, they refer to a named pattern by that name, see `wc` program.
-
()
: Groups series of regexes into new regexes. Example: (01) matches exactly 01. a(bc|de) matches abc or ade. -
&
: ???
Resources
- Online regex tester: https://regexr.com/
- Ref: https://users.cs.cf.ac.uk/Dave.Marshall/Internet/NEWS/regexp.html
- Online regex exercises: https://regexone.com/lesson/capturing_groups
-
Transition diagram.
- Each node is called a state. - You have a starting state. - You have an ending/acceptable sate (double circles). - Edges are labeled.
- Transition tables can be used to describe machines as well.
-
Deterministic finite-state machines
- For each state there is there is at most one possible state for each input symbol. - "The entries in the transition table are either empty or contain a single next state."
-
Non-Deterministic finite-state machines
- An input symbol can trigger the machine into more than one next state simultaneously. - Multiple starting states.
- Is a hard and time consuming process. - Simplify transition tables is hard.
- We use sofwrare which generates lexers by providing language specification.
-
Unix tool: lex (look man page ||
info lex
)- Lesk ME (1975) Lex – a lexical analyser generator. AT&T Bell Laboratories, Murray Hill. Computing Science Technical Report 39
-
Flex (look man page ||
info flex
)- Levine J (2009) Flex & bison. O’Reilly Media, Sebastopol
-
More out there...
-
Easy to use and easy to generate lexers.
-
Generate very efficient lexical analysers
- Popular tool for generating lexers.
- INPUT:
# file: lex.l
definitions
%%
rules
%%
user code
-
OUTPUT (C program
lex.yy.c.
)- yylex() - Returns the next token and executes its code from the rules.
-
Example
letter [a-zA-Z]
digit [0-9]
letter_or_digit [a-zA-Z0-9]
white_space [ \t\n]
other . # .: Any single character except newline
%%
{white_space}+ ; # +: Repeat once or multiple times; Null statement in C (ignore)
{letter}{letter_or_digit}* return 1;
{digit}+ return 2;
{other} return 3;
%%
int main() {
int lextoken;
while (lextoken = yylex())
printf("%d - %s\n",lextoken, yytext);
}
int yywrap()
{
return 1;
}
-
The
rules
in are defined in 2 columns separated by space:- The left column contains regexes to identify expressions. - The right column contains actions to be executed when the expressions are matched. - The colums are dictated by space.
Example
integer printf("found keyword INT");
// Looks for the "integer" string and prints "found keyword INT"
-
Generates lexers in C and C++
-
Can be connected with scripting languages with SWIG: https://en.wikipedia.org/wiki/SWIG
-
yytext
: contains the text that matched the regular expression pattern in the rule- When called outside of a `while` loop, it only returns a single character. - When it reads multiple characters, it evaluates an action multiple times.
-
The
yywrap
function:- Used to determine whether the lexer has reached the end of a file. - Automatically called from yylex. - If yywrap returns 1: no more characters to analyze (EOF). - if yywrap returns 0: yywrap opened a new file for processing. - It's a mechanism which allows processing of multiple files. - If skipped, `gcc` will complain, should skip it with `%option noyywrap` at the very top of the lexer source file.
-
There is no correct answer here.
-
There are serveral ways to generate lexers, you don't need to memorize or know them all.
- Knowing one way, can lead to understaning of the rest.
-
A combination of both is also possible:
- Write flex + custom code in the same file to recognize symbols.
-
Is the process of going from (E)BNF to forming the tokens
-
Two types:
- **Left-most derivation**: expand from left to right (Equal to top-down reductions) - **Right-most derivation**: expand from right to left (Equal to buttom-up reductions)
-
Choice matters, as seen in Chapter 2
-
The process of parsing is equivalent to performing derivation in reverse.
- source => BNF
-
Prasing becomes difficult for uncertain productions, such as P → Q|R (Which production is correct?)
-
Lookahead solves that issue: used for non-simple grammars (alterations)
- Alternative to backtracking (which is difficult to implement) - Looks ahead of the production to see if it produces the correct result based on what has already been produced. - One Lookahead is usually enough for most languages.
-
Associate a function with each non-terminal symbol.
- It's task is to recognise an instance of that non-terminal. - Functions call each other according to the syntax rules of the grammar, matching terminal tokens from the input as they go. - This is a recursive descent parser.
- Is a transformation to avoid backtracking or having multiple
Lookaheads
. - A complicated alteration is replaced with a simpler one in which one lookahead can predict the correct production. Example
A → αβ|αγ # Which way is correct?
# Instead, we factor to the following where a single Lookahead is enough.
A → αA1
A1 → β|γ
- "There are of course grammars that are best dealt with by increasing the parser lookahead, but this comes at a cost of increased complexity."
-
Top-down parsers: left-most derivation
-
Bottom-up parsers; right-most derivation in reverse
- Rarely writen by hand. - Usually generatd by software. - Are more powerful. - Have added complexity.
LL(k)
: Read Left to right, Leftmost derivation, k lookaheads max.LR(k)
: Read Left to right, Rightmost derivation in reverse, k lookaheads max.- Most programming languages:
LL(1)
. LL(k)
grammars areLR(k)
LR(k)
: harder to implement, more powerfull thanLL(k)
- Rightmost derivation in reverse.
- Given an expression:
x+y*z
- Produce BNF:<expr>
Example
-
Standardized algorithms exist.
-
Some lookahead is necessary.
-
The handle
:- The substring match with the production on the right hand side. - Key problem in bottom-up parsing, identification of the handle.
Example
In x+y*z
, which substring to use to produce BNF, i.e. of the form <expr>+<term>
?
- Each node has to contain the data and some meta-data about its content (if statement, integer, constant, etc..).
- The form of data depends on the content.
- #WIP Great further reading about dynamic parsers.
- There are many ways of writing a parser and their complexity varies.
- You don't need to know all of them, just a single technique from which you can derrive the rest.
- Powerful tools generating bottom-up parsers:
yacc
,bison
orCUP
- Single lookahead stage.
- The lookahead variable ch should already contain the first token of the non-terminal being recognised.
- Lookahead consistency is a common error in this type of parsers. This is all best illustrated by an example (from Sect. 4.2.3).
S → Az|z
A → x A|B
B → y
Here is a complete C program to recognise strings of this language.
#include <stdio.h>
#include <stdlib.h>
int ch;
void error(char *msg) {
printf("Error - found character %c - %s\n",ch,msg);
exit(1);
}
void b() {
if (ch == ’y’) ch = getchar();
else error("y expected");
}
void a() {
if (ch == ’x’) {
ch = getchar();
a();
}
else b();
}
void s() {
if (ch == ’z’) ch = getchar();
else {
a();
if (ch != ’z’) error("z expected");
else ch = getchar();
}
printf("Success!\n");
}
int main(int argc, char *argv[])
{ ch = getchar();
s();
return 0;
}
Examples
$ ./simpletopdown
xyz
Success!
$ ./simpletopdown
xxxxyz
Success!
$ ./simpletopdown
xxxxz
Error - found character z - y expected
$ ./simpletopdown
z
Success!
-
The choices of the parser have been done without justification or explanation.
-
Programming the choice logic is:
- Difficult to implement by hand. - Relies on 2 or 3 dimensional arrays. - Gets complicated. - Simplified by parser generators: bison
-
precedence parsing:
- Consists of 2 indexes: 1. index for the top of the stack. 2. index for the next input character (shift). - Based on these 2 tokens, the parser can decide whether to reduce, shift or error.
Shift-Reduce conflicts appear when the the same expression can have more than one syntax trees. In other words, when the parser is reading an expression, it can arrive to a point in where it can choose either shift the next token or to reduce the token at the top of the stack.
Reduce-Reduce conflicts appear when the symbols on the right-hand side can be reduced to more than one token on the left-hand side.
Example
A -> a
B -> Ab
C -> a
We can see that a
can either be reduced into A
or C
, which leads to an ambiquity.
LA
: Read left to rightLR
: Uses the revers Rightmost derivationk
: lookaheads
-
yacc
: LALR(1) parsers in C -
bison
: GNU Project version of yacc- Can generate table-driven LALR(1) in C - Includes support for other bottom-up parsing methods.
-
bison
andflex
are used in tandem.
- Similar input as
flex
. - Instead of regexes, it accepts grammar production rules.
- Produces parsers in
C
. - Is combined with
flex
: - The generated parser calls functions from the generated lexer/scanner. yyparse()
: bison generated parser function (just likeyylex()
forflex
). - return 0 :Successful parse.
S → Az|z
A → x A|B
B → y
-
normal:
S → Az → x Az → xBz → xyz
. -
reverse: read the normal from right to left.
- Start with the terminal symbol.
definitions
%%
rules
%%
user code
-
Similar strcture with
flex
with 2 columns: patterns and instructions<patterns> <instructions>
-
The patterns are grammar reduction rules
-
A colon is used to separate the non-terminal from its definition
Example
%{
#include <stdio.h>
void yyerror(char*);
int yylex(void);
%}
%%
S:
A ’z’ { printf("S->Az, done\n"); }
| ’z’ { printf("S->z, done\n"); }
A:
’x’ A { printf("A->xA\n"); }
| B { printf("A->B\n"); }
B:
’y’ { printf("B->y\n"); }
%%
void yyerror(char *s)
{
printf("***%s\n",s);
}
int yylex() {
int ch;
ch=getchar();
while (ch==’\n’) ch=getchar();
return ch;
}
int main()
{
if (yyparse() == 0) printf("Parsing successful\n");
else printf("Parsing failure\n");
return 0;
}
- "Because A is not enclosed in quote marks, it is taken as a non-terminal and bison expects its definition to appear in due course"
- "In this grammar, x, y and z are all terminal symbols because they are enclosed by single quote marks."
- "And because S is the first non-terminal to be defined, it is taken as the starting symbol."
- " Each rule consists of a pattern and a corresponding action. The idea is simple. When the bison-generated parser runs, it matches these patterns with the input, controlled by the parsing algorithm, and if a pattern matches, the corresponding action is executed. Here, the actions are used for tracing the execution of the parsing process. We will worry about more complex actions later, specifically the generation of the parse tree. "
- " The final section of user code defines additional functions required by the parser. Bison simply copies this section straight to its output. The yyerror function is called by the parser when an error has been detected and here we just output a message. We will tackle the problem of error reporting and recovery later in this chapter. The yylex function is the lexical analyser. Here, it gets the next character from the input, ignoring newline characters. Although the original BNF grammar says nothing about ignoring newlines they are ignored here to result in a slightly cleaner user interface, removing the potential confusion caused by the need for a newline to send an input buffer to the running program when running interactively. Finally the main function is defined. This calls the yyparse function which is the bison-generated parser. If yyparse returns the value zero, the parse has been successful. "
yylex()
is hand-written, should be generated byflex
- A shift pushes the current symbol from the input onto the stack.
- A reduce matches the right hand side of a rule of the grammar against the top of the stack and replaces it by the left-hand side.
- If we have to make a decision between shift and reduce this is called a shift-reduce conflict.
- Can occur when both
reduce/reduce
andshift/reduce
are possible. - "Can occure when two or more rules can be applied to perform a reducion on the same input sequence."
- Solution: Grammar modification
- Sometimes difficult to understand from
bison
error outputs.
Grammar definitoin
<calculation> :: = <expr> \n
<expr> :: = <term> |<expr> + <term> | <expr> - <term>
<term> :: = <factor> | <term> * <factor> | <term> / <factor>
<factor> :: = CONSTANT | (<expr>)
%{
#include <stdio.h>
#include <ctype.h>
void yyerror(char*);
int yylex(void);
int ival;
%}
%token CONSTANT
%%
calculation:
expr ’\n’ { printf("%d\n", $1); }
expr:
term { $$ = $1; }
| expr ’+’ term { $$ = $1 + $3; }
| expr ’-’ term { $$ = $1 - $3; }
term:
factor { $$ = $1; }
| term ’*’ factor { $$ = $1 * $3; }
| term ’/’ factor { $$ = $1 / $3; }
factor:
CONSTANT { $$ = ival; }
| ’(’ expr ’)’ { $$ = $2; }
%%
void yyerror(char *s)
{
printf("***%s\n",s);
}
int yylex() {
int ch;
ch=getchar();
while (ch==’ ’) ch=getchar();
if (isdigit(ch)) {
ival=0;
while (isdigit(ch)) {
ival=ival*10+(int)ch-(int)’0’; /* ignore overflow */
ch=getchar();
}
ungetc(ch,stdin);
return CONSTANT;
}
else return ch;
}
int main()
{
return yyparse();
}
$$
: the result$1
: left token$2
: middle token | constant$3
: right tokenival
: holds the value of the calculationCONSTANT
: used to recognize a constant (flex)
- There is no perfect option here.
- It depends on the use case and the requirements of the language.
See DL_compiler_frontend.md
-
Checkout the grammar of known languages and try to implement it.
- The C programming language, see KR Book.
-
Checkout
Pyparsing
: https://pypi.org/project/pyparsing/- https://en.wikipedia.org/wiki/Parsing_expression_grammar
- Concerned with type checking.
- Type casting.
- Usually, static typing is more helpful in compiler design.
- Uses the symbol table to record types of variables and track complex/user defined types.
example: struct, typedef, union
typedef struct tnode {
int asttype;
int astdata1,astdata2;
astptr p1,p2,p3;
} astnode;
union utype {
char xch;
int xint;
float xfloat;
} uval;
-
Updates/Annotates the syntax tree by adding types to its nodes (
recursive post-order traversal
).- A node's type is derived from the types of its children.
Example: expression: expression ’+’ term { $$ = $1 + $3; }
-
Deals with variable scope:
- Using a stack. - A stack pointer points to local variables within a context. - Once out of a context, we the stack pointer is restored to the outter context.
- Global variables are access via a different pointer (gp
).
Defined as arr[i][j] where:
- `i` is the row
- `j` is the column
int arr[3][4];
Any element (e) is accessible via its memory address:
addr(e) = (i*4 + j) * size_of_int
- Complements the Chomsky type 2 grammer (free context).
- Describes the types of the nodes (semantic information) in the AST (Abstract Syntax Tree).
- Semantic informaiton is stored in attributes associated with terminal and non-terminal symbols.
-
Interface between the frontend and backend of a compiler.
-
Multiple IRs can be used to generate machine code.
-
"It should not be regarded as an interface with which the user of the compiler is particularly concerned, but ensuring that there is a way for the compiler writer or interested user to inspect a human-readable representation of the IR is important."
-
Should be easily translatable into machine code.
-
Should permit aggressive code optimization.
-
Graph based IRs and Linear based IRs.
- Each type has its tradeoffs. - There is no standard.
-
Is regarded as machine code for a virtual machine.
-
We can use higher level programming languages (i.e. C or Rust) as IRs.
- Generated C IR. - Compile the C IR using GCC and produce machine code.
- Implemented in hte early days of compilers.
- Targeted stack-based hardware.
- Don't work well with the powerful optimization algorithms today.
- Example is JVM.
- More modern and powerful.
- Linear IRs, container an operator, up to 2 arguments and a result.
- Are more popular.
- Representation in a 3 address code:
Example
x = y*3 + z;
is
t1 = y ∗ 3
x = t1 + z
- There are similarities in RISC commands and 3-address code instructions.
- Optimization is difficult.
- Common sub-expressions in the AST.
- An AST becomes a i(Directed Acyclic Graph) DAG.
Example
- Is a directed graph.
- Nodes represent instructions.
- Edges represent possible flows.
- Show the control flow of a program.
Example
x = a + b;
y = a + 2;
z = x * b;
- Here the value of
z
depends on the vlaue ofx
. - The operation of
x
must be preceded. - So, in the DDG, the node for
x
is connected with the node forz
- The node for
z
is independent, that's why this representation is complementary and not complete.
-
Local variables: v0,v1,v2,..vn
-
Global variables: vg0,vg1,vg2,..vgn
-
Registers: r0,r2,r3,..,rn
-
None of the instructions can use more than three addresses.
-
Instructions:
- Assignments with two arguments and a destination (a = b op c), with a single argument for a unary operator and a destination (a = op b) and simple copy assignments (a = b). - Unconditional jumps (goto label). - Conditional jumps (if a relop b goto label). - Array access (a = x[i] and x[i] = a). - Function call (call function, and arguments are passed by preceding the call with an appropriate number of param instructions of the form param arg). - Function return (return). - Input and output (read a and write a).
Example
- Checkout the dragon book for three-address IRs.
- There's no definitive way to create an IR.
- An example could be a function, named
cg()
which traverses the tree and generates IR. - The decission on what to generated is performed with a lengthy
switch
statement.
Example
typedef enum {
R_REG, R_GLOBAL, R_LOCAL, R_CONST, R_NONE} regtags;
/* r5, vg5, vl5, 5, nothing returned */
/* Structure returned by the cg function */
typedef struct regstruct {
regtags regtype; /* R_REG, R_GLOBAL, R_LOCAL, R_CONST, R_NONE */
int regvalue; /* which register or integer value of constant */
} a3token;
a3token cg(astptr p) {
int nodetype;
a3token noresult = {R_NONE, -1};
astptr left,right;
if (p==NULL) return noresult;
nodetype = p->asttype;
switch (nodetype) {
.
.
.
case N_PLUS:
case N_MINUS:
case N_MUL:
case N_DIV:
left=p->p1;
right=p->p2;
t=genreg();
t2=cg(left);
t3=cg(right);
opreg(t);
printf("="); opreg(t2); outop(nodetype); opreg(t3); // CODEGEN
printf("\n");
return t;
/*
outop(): Output the arithmetic instruction for the node.
opreg(): Outputs an argument for the machine instruction, either a temporary register, a register storing
a variable or an integer constant.
genreg(): Returns the identity (as an a3token) of the next unused temporary register.
*/
case N_IF:
case N_IF:
l1=genlab();
cnode=p->p1; /* point to the cond node */
t2=cg(cnode->p1); t3=cg(cnode->p2);
printf("if ("); opreg(t2); invcondition(cnode); opreg(t3); // CODEGEN
printf(") goto l%d\n",l1); /* jump to else part or end of statement*/
t=cg(p->p2); /* cg then part */
if (p->p3 != NULL) { /* else present */
l2=genlab();
printf("goto l%d\n",l2);
printf("l%d:\n",l1);
t=cg(p->p3); /* cg else part */
printf("l%d:\n",l2);
}
else /* no else present */
printf("l%d:\n",l1);
return noresult;
/*
invcondition(): inverts the conditions to make them work with goto expressions
*/
case N_WHILE:
l1=genlab();
l2=genlab();
cnode=p->p1; /* point to the cond node */
t2=cg(cnode->p1); t3=cg(cnode->p2);
printf("l%d:\nif (",l1); opreg(t2);
invcondition(cnode); opreg(t3);
printf(") goto l%d\n",l2); /* jump out of while */
t=cg(p->p2); /* cg do part */
printf("goto l%d\n",l1);
printf("l%d:\n",l2);
return noresult;
}
}
Example using above code for the expression 1*2+3+4
will generated the following three-address-code.
r3 = 1 * 2
r2 = r3 + 3
r1 = r2 + 4
Example using above code for the expression if (a+b > a*b) c = 1 else c = 2
will generated the following three-address-code.
r1 = vg0 + vg1
r2 = vg0 * vg1
if (r1<=r2) goto l1
vg2 = 1
goto l2
l1:
vg2 = 2
l2:
Example using above code for the expression while (i<=10) i = i + 1
will generated the following three-address-code.
l1:
if (vg0>10) goto l2
r1 = vg0 + 1
vg0 = r1
goto l1
l2:
So, what this generates is which registers should store which result.
!It's important to remember that at this point we're not conserned with optimization. We don't want to get into premature optimization. So, the IR can be as simple as necessary to achieve the goal of translation.
- If quality code is not required, writing a code generator is fairly easy
- Target machine processor manuals should be at hand.
- Code generation happens with or without prior optimization.
- No standard way or universal solution for generating code.
- No formal techniques exists to represent information about the target machine and generate code for it.
- Classification of processors into either reduced instruction set computers (RISC) or complex instruction set computers (CISC).
- Both types of processor are used today.
- Simple instruction set.
- Smaller instructions.
- Each instruction runs fast because smaller commands require less CPU cycles.
- Uniform instructions.
- Complex instructions.
- Require more CPU cycles.
- Complex addressing modes.
- Another approach towards code generation is to generate code which can be executed on a VM.
- This is an interpreter.
- The target language cna be another high level language, like C.
- The generated laguage is then compiled using an existing compiler, like
gcc
.
- "The process of instruction selection forms the core of code generation."
- There are infinite ways to generate target code.
- A widely used approach is: - Transform the IR into a tree representation. - Find patters in the tree which can be re-writen into machine code instructions.
- The proceess of finding these tree patterns is called "tiling".
- One successful algorithm used for tiling is the
maximal munch
(consume the largest input possible to generate an output).
- The hardware registers are being shared and their intermediate values are stored in memory.
- Should be avoided as much as possible.
- Is time consuming (load, store, load, store,....).
Is the process of finding code blocks in which a single register can be used to carry values for multiple variables without overlapping.
- Live ranges in straight code are easy to find.
- Complex control logic make finding live rangers difficult.
Example
.
x = 1; ⎤
. ⎥ <= live range 1
. ⎥
... = x; ⎦
- Algorithm used to find if we can allocate an X amount of physical registes without
spiling
. - If that's not possible, we have to have spilling.
- If we have to have shared registers
spilled
, then, ideally we store infrequently used values in them. - #WIP (create the graph colouring algorithm in C)
- Registers are a precious commodity.
- C's indirection operation can wreak havoc to register allocation.
-
int *p; *p=3;
- Where is 'p' pointing to? - If it's pointing to a register, we also need to update that register's value. - If it's pointing to a memory address, we need to update that one as well.
-
It's possible to create a one-off custom generator which does pattern matching against the IR and data structures derived from trees.
- Complexity rises from the amount of patterns. - Usually, on-off generators, for a specific architecture, are written.
-
use the peephole optimization on the IR instruction and generate machine code.
- Each IR instruction is translated in isolation. - Poor machine code is produced. - Danger that not all input will be matched.
-
Graham-Glanville approach (Use regex on the IR and replace it with machine code).
- Use context-free grammars matchers on the IR. - Replace matching text with machine code. - Essentially, it's a parser for the IR.
-
Other systems:
1. Dynamic programming 2. `twig` 3. `BURS` (Bottom-Up Rewrite System) 4. `BURG`
Things I realized after studying compilers.
In high overview,
- We need a front-end capable or prducing an IR.
- We need a backend capable or consuming the IR.
- The backend needs to generate machine code targeting our architecture.
Example
GCC has several front-ends: C, C++, Go see: https://gcc.gnu.org/frontends.html
GCC has several target backends: aarch64, i386, arm see: https://gcc.gnu.org/backends.html.
Example
Rust uses LLVM for code generation, see https://rustc-dev-guide.rust-lang.org/backend/codegen.html.
The front-end of Rust produces the LLVM IR which is consumed by LLVM to produce targe machine code.
Particularly, Rust has multiple IRs: https://rustc-dev-guide.rust-lang.org/overview.html#intermediate-representations
ref: https://rustc-dev-guide.rust-lang.org/backend/codegen.html
Is a reusable compiler backend.
Takes input in the form of LLVM IR.
Produces output for a plethora of targets.
What is clang? https://clang.llvm.org/
clang
is a frontend (for the C language family), replacement for GCC
which uses LLVM
as its backend.
These are a subject of their own.
https://docs.python.org/3/reference/grammar.html
Machine Learning relies on specific operations (operations on matrixes) which are performed faster on application specific hardware, such as GPUs or TPUs.
Making a higher level programming language (HLPL) work (compiled) on this hardware is a primary concern of compiler infrastructure creation (GCC, CUDA, LLVM,...).
So
HLPL => (compiler) => MC => HW
Where
HLPL : higher level programming language
MC : machine code
HW : Hardware
Hardware : GPU | TPU | NPU
NPU : Network Processing Unit
That is why compiler infrastructure creation is so closely correlated with AI.
- Developing a brand new compiler from scratch, is very tough.
However, if machine M already supports another develompent environment/compiler for another language, C.
Then, we can write a compiler for L using C tools, like flex
, bison
, GCC
. This compiler will be written in C
.
Once we have a compiler for L
in C
. We can write a compiler for L
in L
using the previously developed compiler in C
.
The last compiler, written in L
, should also be able to compiler itself, essentially making L
a primary language.
Testing if a compiler can compile itself is a very strong validity test.
The Java programming language is compiled into a specific language which is ran on the JVM.
The JVM, is a stack-based interpreter which receives Java bytecode and interpretes it to machine code.
Because JVM reads Java byte code, that means that we can load some Java bytecode from another class on the run.
That class, can be located locally or on another machine on the internet.
So, the ability to load a class during runtime into the JVM is what defines dynamic class loading
in Java
.
JIT
is the process of compiling specific pieces of a program into machine code during its execution time.
The reason to use JIT
for a specific routine is because that routine might be called several times.
JIT
compilation of a routine, should resolve in fast execution times.
-
Algorithms which can take advantage of parallelisation are immature.
- People are very ingrained into thinking sequentially.
-
What about compilers for quantum computers?
-
Can you apply an AI program to generate or parallelise sequential alrorithms?
Baby Steps to a C Compiler