compiler optimizations (2)

Before we move on to covering dataflow analysis, I thought it would interesting to cover an example of such an analysis to motivate its discussion.

Live Variable Analysis

A Control flow graph represents our source code in a data structure that allows us to analyze properties of our program very easily.

For example, consider the C program that calculates the minimum of two numbers:

int minimum(int x, int y) {
    int r = -1;
  
    if(x > y)
      r = y;
    else
      r = x;
  
    return r;
}

In three-address form:

    r = -1
    t0 = x > y
    if (t0 == 0)  goto L1
    r = y
    goto L0

L1:
    r = x
    goto L0

L0:
    return r

In CFG form:


CFG for minimum code


At the return statement, what value of r is “alive”? Informally, a variable is alive after its definition upto a point in the program where it is re-defined on all possible paths.

In the CFG above, we can see that r = -1 is no longer alive at the return statement as all edges coming into the return statement have a write to r.

Doing live variable analysis helps us remove unnecessary statements from the program while maintaining correctness. In this example, the r = -1 assignment can be safely removed.

How do we calculate live variables for more complex programs?

At the level of a basic block, we need to define two sets: DEF and USE.

DEF is the set of variables written to in the block.

USE is the set of variables read from, before being written to, in the block.

BB001:
    x = 10
    y = z + 10

DEF(BB001) = {x, y}
USE(BB001) = {z}


BB002:
    a = 10 + j
    j = 25 + x

DEF(BB002) = {a, j}
USE(BB002) = {j, x}

BB003:
    x = 10
    j = x

DEF(BB003) = {x, j}
USE(BB003) = {} # x not included!

Our goal is to calculate IN and OUT, the set of variables that are alive, entering and leaving, a basic block.

Calculating OUT is easy - it’s just the union of all the IN sets in b’s successors.

OUT(b) = ∪ (IN(s) for s in SUCC(b))

Why don’t we define it the other way around? Take a moment and think about what direction facts flow in. Live variable analysis asks whether our variable is alive in the paths after ours, i.e., subsequent instructions. This means that facts flow from lower nodes to higher nodes, or backwards. Therefore, OUT is defined as a union of successor IN sets.

Now let’s try and find IN(b)

BB009:
    x = 10 + y

In the above example, y is in the USE set of BB009 and hasn’t been defined in the same block. Therefore, y had to have been defined earlier in the program and has to be alive entering the block.

That’s the first part of the formula: IN(b) = USE(b)

Vice versa, if a variable is alive in subsequent blocks without a definition, it has to have been alive in our block (remember that OUT(b) is defined on successors). IN(b) = USE(b) ∪ OUT(b)

However, there is a problem in this equation.

If we re-define a variable within our block before it is read, it is no longer alive entering the block as the new definition will kill all of its previous definitions. We need to remove any variables in DEF(b) from OUT(b). Now, a variable defined in a block only ends up in IN(b) if there was a read before the definition.

This completes our equation: IN(b) = USE(b) ∪ (OUT(b) - DEF(b))

If you think of yourself as a block, the variable is alive coming in if I use it or any of my succesors use it, unless I define it myself.

Wait? Aren’t IN and OUT dependent on each other? To fix that, we define a special node called EXIT whose IN set is initialized to be empty. This serves as the base case in the recursive equation.

To calculate the IN and OUT set for each block, we can start from any block, and begin solving the equations for each block in our CFG. We stop when we reach a fixpoint - a round of iteration on the CFG where none of the IN sets have changed.

Let us run through an example of live variable analysis. Consider the CFG below where BB001 has an edge to BB002, which has an edge to BB003.

For backwards analysis, we usually begin with the last node (BB003 in this example)

BB001:
    a = 3; 
    b = 5;
    d = 4;
    x = 100;  

BB002: 
    c = a + b;
    d = 2;

BB003: 
    c = 4;
    return b * d + c;

    DEF(BB003) = {c}
    USE(BB003) = {b, d}

    OUT(BB003) = {} [Remember: IN(EXIT) = {}]
    IN(BB003) = {b, d, c} ∪ ({} - {c}) = {b, d}

We then move onto BB002:

BB001:
    a = 3; 
    b = 5;
    d = 4;
    x = 100;  

BB002: 
    c = a + b;
    d = 2;

    DEF(BB002) = {c, d}
    USE(BB002) = {a, b}

    OUT(BB003) = {b, d} [from IN(BB003)]
    IN(BB003) = {a, b} ∪ ({b, d} - {c, d}) = {a, b}

BB003: 
    c = 4;
    return b * d + c;

    OUT(BB003) = {}
    IN(BB003) = {b, d}

We then move onto BB001:

BB001:
    a = 3; 
    b = 5;
    d = 4;
    x = 100;  

    DEF(BB001) = {a, b, d, x}
    USE(BB001) = {}

    OUT(BB003) = {a, b} [from IN[BB002]]
    IN(BB003) = {} ∪ ({a, b} - {a, b, d, x}) = {}

BB002: 
    c = a + b;
    d = 2;

    OUT(BB003) = {b, d}
    IN(BB003) = {a, b}

BB003: 
    c = 4;
    return b * d + c;

    OUT(BB003) = {}
    IN(BB003) = {b, d}

Since this CFG is a single chain, we are done. However, if our CFG contained multiple join nodes, we would need to run multiple iterations of equation solving until we reach a fixpoint. These iterations of solving equations is known as iterative dataflow analysis.

Analyzing the OUT sets from the example lets us remove many instructions and derive an equivalent program.

BB001:
    a = 3; 
    b = 5;  

BB002: 
    d = 2;

BB003: 
    c = 4;
    return b * d + c;

Live variable analysis removed three instructions in this simple program!

It is important to note that due to the way we have setup the IN and OUT equations, we will always reach a fixpoint as the size of the OUT set can only increase (monotonic non-decreasing), and there is a finite bound to its size (the set of variables in the source code).

We can formally prove such properties using dataflow analysis, which I will talk about in the next post. Live variable analysis is an example of a dataflow analysis, which uses backwards iteration and a union meet operator.

Disclaimer: I am not an expert in compilers, and just want to share some cool ideas I’ve been learning. If I’ve made an error somewhere, let me know.




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • compiler optimizations (1)