compiler optimizations (2)
Before we move on to covering dataflow analysis, I thought it would interesting to cover an example of such an analysis to motivate its discussion.
Live Variable Analysis
A Control flow graph represents our source code in a data structure that allows us to analyze properties of our program very easily.
For example, consider the C program that calculates the minimum of two numbers:
int minimum(int x, int y) {
int r = -1;
if(x > y)
r = y;
else
r = x;
return r;
}
In three-address form:
r = -1
t0 = x > y
if (t0 == 0) goto L1
r = y
goto L0
L1:
r = x
goto L0
L0:
return r
In CFG form:

At the return statement, what value of r is “alive”? Informally, a variable is alive after its definition upto a point in the program where it is re-defined on all possible paths.
In the CFG above, we can see that r = -1
is no longer alive at the return statement as all edges coming into the return statement have a write to r.
Doing live variable analysis helps us remove unnecessary statements from the program while maintaining correctness. In this example, the r = -1
assignment can be safely removed.
How do we calculate live variables for more complex programs?
At the level of a basic block, we need to define two sets: DEF
and USE
.
DEF
is the set of variables written to in the block.
USE
is the set of variables read from, before being written to, in the block.
BB001:
x = 10
y = z + 10
DEF(BB001) = {x, y}
USE(BB001) = {z}
BB002:
a = 10 + j
j = 25 + x
DEF(BB002) = {a, j}
USE(BB002) = {j, x}
BB003:
x = 10
j = x
DEF(BB003) = {x, j}
USE(BB003) = {} # x not included!
Our goal is to calculate IN
and OUT
, the set of variables that are alive, entering and leaving, a basic block.
Calculating OUT
is easy - it’s just the union of all the IN
sets in b’s successors.
OUT(b) = ∪ (IN(s) for s in SUCC(b))
Why don’t we define it the other way around? Take a moment and think about what direction facts flow in. Live variable analysis asks whether our variable is alive in the paths after ours, i.e., subsequent instructions. This means that facts flow from lower nodes to higher nodes, or backwards. Therefore, OUT
is defined as a union of successor IN
sets.
Now let’s try and find IN(b)
BB009:
x = 10 + y
In the above example, y
is in the USE
set of BB009
and hasn’t been defined in the same block. Therefore, y
had to have been defined earlier in the program and has to be alive entering the block.
That’s the first part of the formula: IN(b) = USE(b)
Vice versa, if a variable is alive in subsequent blocks without a definition, it has to have been alive in our block (remember that OUT(b)
is defined on successors). IN(b) = USE(b) ∪ OUT(b)
However, there is a problem in this equation.
If we re-define a variable within our block before it is read, it is no longer alive entering the block as the new definition will kill all of its previous definitions. We need to remove any variables in DEF(b)
from OUT(b)
. Now, a variable defined in a block only ends up in IN(b)
if there was a read before the definition.
This completes our equation: IN(b) = USE(b) ∪ (OUT(b) - DEF(b))
If you think of yourself as a block, the variable is alive coming in if I use it or any of my succesors use it, unless I define it myself.
Wait? Aren’t IN
and OUT
dependent on each other? To fix that, we define a special node called EXIT
whose IN
set is initialized to be empty. This serves as the base case in the recursive equation.
To calculate the IN
and OUT
set for each block, we can start from any block, and begin solving the equations for each block in our CFG. We stop when we reach a fixpoint - a round of iteration on the CFG where none of the IN
sets have changed.
Let us run through an example of live variable analysis. Consider the CFG below where BB001
has an edge to BB002
, which has an edge to BB003
.
For backwards analysis, we usually begin with the last node (BB003
in this example)
BB001:
a = 3;
b = 5;
d = 4;
x = 100;
BB002:
c = a + b;
d = 2;
BB003:
c = 4;
return b * d + c;
DEF(BB003) = {c}
USE(BB003) = {b, d}
OUT(BB003) = {} [Remember: IN(EXIT) = {}]
IN(BB003) = {b, d, c} ∪ ({} - {c}) = {b, d}
We then move onto BB002
:
BB001:
a = 3;
b = 5;
d = 4;
x = 100;
BB002:
c = a + b;
d = 2;
DEF(BB002) = {c, d}
USE(BB002) = {a, b}
OUT(BB003) = {b, d} [from IN(BB003)]
IN(BB003) = {a, b} ∪ ({b, d} - {c, d}) = {a, b}
BB003:
c = 4;
return b * d + c;
OUT(BB003) = {}
IN(BB003) = {b, d}
We then move onto BB001
:
BB001:
a = 3;
b = 5;
d = 4;
x = 100;
DEF(BB001) = {a, b, d, x}
USE(BB001) = {}
OUT(BB003) = {a, b} [from IN[BB002]]
IN(BB003) = {} ∪ ({a, b} - {a, b, d, x}) = {}
BB002:
c = a + b;
d = 2;
OUT(BB003) = {b, d}
IN(BB003) = {a, b}
BB003:
c = 4;
return b * d + c;
OUT(BB003) = {}
IN(BB003) = {b, d}
Since this CFG is a single chain, we are done. However, if our CFG contained multiple join nodes, we would need to run multiple iterations of equation solving until we reach a fixpoint. These iterations of solving equations is known as iterative dataflow analysis.
Analyzing the OUT
sets from the example lets us remove many instructions and derive an equivalent program.
BB001:
a = 3;
b = 5;
BB002:
d = 2;
BB003:
c = 4;
return b * d + c;
Live variable analysis removed three instructions in this simple program!
It is important to note that due to the way we have setup the IN
and OUT
equations, we will always reach a fixpoint as the size of the OUT
set can only increase (monotonic non-decreasing), and there is a finite bound to its size (the set of variables in the source code).
We can formally prove such properties using dataflow analysis, which I will talk about in the next post. Live variable analysis is an example of a dataflow analysis, which uses backwards iteration and a union meet operator.
Disclaimer: I am not an expert in compilers, and just want to share some cool ideas I’ve been learning. If I’ve made an error somewhere, let me know.
Enjoy Reading This Article?
Here are some more articles you might like to read next: