Getting started with PHP-CFG

PHP-CFG is PHP project developed by Anthony Ferrara. It uses PHP-Parser to generate the AST from the PHP code. Then transforms this into another intermediate representation in the form of a control flow graph. This enables us an easy way to traverse this graph and reason about the execution of the underlying PHP code.

Getting started

To get started, you’d need to know some basic compiler concepts. The audience of this post are people from a variety of backgrounds so I’ll try to cover as much as possible. If some concepts don’t make sense through out the text, it is likely that you need to learn more about compiler design concepts. Hopefully I have given you enough clue to know what to look for.

How does the PHP engine run a PHP script?

PHP code runs inside the Zend engine. This means that, to run a PHP script, a compiler will parse and transform the PHP code into an intermediate representation (IR) for the PHP Zend engine (a Virtual Machine) to execute.

For a very simple PHP code, you can see the IR here:

<?php

$a = 1; 
$b = 2; 
$c = $a + $b;
-----
number of ops:  5
compiled vars:  !0 = $a, !1 = $b, !2 = $c
line     #* E I O op                           fetch          ext  return  operands
-------------------------------------------------------------------------------------
   3     0  E >   ASSIGN                                                   !0, 1
   4     1        ASSIGN                                                   !1, 2
   5     2        ADD                                              ~5      !0, !1
         3        ASSIGN                                                   !2, ~5
   6     4      > RETURN                                                   1

On top, you see a simple PHP code. On the bottom, the IR equivalent is printed. You can see that !0, !1 and !2 are compiler variables that represent $a, $b and $c accordingly. On line 13, 1 is assigned to !0 ($a) and on line 15, !0 ($a) and !1 ($b) are added together and ~5 (a temporary variable) is returned which is then stored in !2 ($c).

The PHP binary, either called through the command line or by the web server first generates this IR and then executes it. The PHP itself is written in C.

What is an Abstract Syntax Tree (AST)?

In order to understand the source code, a compiler parses and transforms the language structs into a tree structure.  In the example below, 2, 7 and 3 (or their respective variables) are identifiers and * and + are expressions.  By traversing the tree, the compiler is able to generate the IR in the example of PHP. PHP-Parser is a PHP package that does just that. It is able to generate the AST for the given PHP code.

What is the Control Flow Graph (CFG)?

The AST includes all the information about the program syntax. But, it lacks information about the flow of the program, which instruction is executed before which one and how the conditions affect the program execution. In a CFG, we have this information. Below, is the representation of an if-else block of code. The nodes within the CFG are the Basic Blocks of the program.

Basic blocks are the consecutive program instructions that will always execute if we enter the block. That is, they do not have any branches in between. In the example below, $a = 2 and $b = 4 would form a basic block because they will always get executed consecutively.

<?php
if ($a > 1) { 
    $a = 2; 
    $b = 4; 
} 
else { 
    ... 
}

We plan to use PHP-CFG to generate this graph and traverse it to analyze the source code.

Running the PHP-CFG Project

Now that we now the basics, or at least know, what to look for to learn the basic stuff, it’s time to start playing with the PHP-CFG project.

You can find my fork of this project here. I have made some slight modifications to make it fit my needs, like adding symbolic variable nodes. For the purpose of this blog post, it shouldn’t matter which fork you look at.

1. Clone project and install PHP 7.4

First of all, clone and run the demo.php file to see how it works. You need PHP 7.4 to run this project. Luckily PHP 7.4 is not in beta anymore so you should be able to grab it using your system’s package manager.

2. Install composer packages

PHP-CFG, like most mature PHP projects, relies on external dependencies such as PHP-Parser. To fetch these, you need to use composer. Get composer from here. Then install the packages via composer by running the following command from within the PHP-CFG directory.

php composer.phar install

3. Run The demo.php

Let’s create a sample PHP file. Call it test.php:

<?php
$a = 2; 
if ($a > 1) { 
    echo "$a > 1"; 
} 
else { 
    echo "$a <= 1"; 
}

Now, run demo.php on this file. This will output the control flow graph in DOT format. We will then use xdot to visualize this file.

php demo.php test.php > graph.dot
xdot graph.dot

The CFG for the code above looks like this:

Note that we are using a fork of PHP-CFG. This fork includes some extra code to handle constructs such as “try catches”. It also includes extra information such as the block coverage (i.e., red block is uncovered and green would be covered block) and symbolic nodes. Ignore the block colors for now.

Starting from the very top block in the graph, every PHP file will start with a fake “main()” function as the entry point. The next line is the path to the source script for that block (…/tmp/php-cfg/demo.php). This line would point to the included file when resolving includes (We will deal with this in the future as its not part of the PHP-CFG itself). Then we can see the opcodes being listed in the first block which assigns 2 to $a. Then we branch to if/else blocks and finally return from the script.

PHP-CFG is also able to produce text based output instead of dot format. This option can be enabled by changing $graphviz to false in Line 16 of demo.php file.

4. Using PHP-CFG in other projects

For this step, we will pretty much use the same APIs used within demo.php file. First, create a parser instance, this is using PHP-Parser to generate the AST and then the PHP-CFG parser to generate the IR used in the CFG. Within the PHP-Parser, you can set the preferred syntax to PHP 5 or PHP 7 based on the PHP version that the target code under analysis is written in.

Several visitors are defined by default. DeclarationFinder, CallFinder and VariableFinder. We will see what they do later. For now, it is sufficient to know that each Visitor will hook calls on specific events throughout the parsing process. For example, we can run our code in a visitor each time a new Opcode is generated, or a new basic block is generated. These visitors either extract some high level information from the code or modify the output of the code on the fly. We add our visitors to the traverser object using addVisitor() and call parse(). This will return an array which inlcudes the CFG in its proprietary IR format. Again, the demo.php file can be used as a reference here.

Project structure

Looking at the repository, we see the following structure:

lib/PHPCfg

Hosts the main files of the project.

test

Unit tests of this project reside in this directory. Each unit test consists of a PHP file inside test\code directory in the following format. The actual unit tests are executed from PHPCfg directory by iterating over all samples in code directory. The code above the dashes is executed and the result is compared with the text below the dashes. As far as we are concerned, we can add new unit tests to test\code directory without having to worry about the rest.

PHP Code
—– (5 dashes)
Expected result

<?php

if ($a) {
    echo "a";
} else {
    echo "b";
}
echo "c";
-----
Block#1
    Stmt_JumpIf
        cond: Var#1<$a>
        if: Block#2
        else: Block#3

Block#2
    Parent: Block#1
    Terminal_Echo
        expr: LITERAL('a')
    Stmt_Jump
        target: Block#4

Block#3
    Parent: Block#1
    Terminal_Echo
        expr: LITERAL('b')
    Stmt_Jump
        target: Block#4

Block#4
    Parent: Block#2
    Parent: Block#3
    Terminal_Echo
        expr: LITERAL('c')
    Terminal_Return

Parser

This is the heart of PHP-CFG. It resides in \lib\PHPCfg\Parser.php. It’s a giant class with functions that can handle different types of IR opcodes.

[ parse($code, $fileName, $main_function_name = ‘{main}’): 105 ]
On line 105, parse() function is called which takes as input the source code, calls $astParser on it, which is PhpParser\Parser, and retrieves the AST.

[ parseAst($ast, $fileName, $main_function_name): Script: 114 ]
The AST is then passed to parseAst function. It will start by creating the {main} function as the entry point to the CFG and will iterate through the AST nodes to generate the graph.

[ parseFunc(Func $func, array $params, array $stmts): 145 ]
Next, parseFunc is called on the main entry point of the program. It will generate a new block and start parsing nodes.

[ parseNode(Node $node): 188 ]
ParseNode is rather straightforward. It takes a node, checks if the type of the node is an expression, if it is, it passes the node to parseExprNode. If the node is not an expression, it tries to parse it by calling the function named after the type of the node. For example, parseStmt_Class will parse Stmt_Class node (Which represents a class definition).

I’ll stop here for the parser. You can follow the rest of the flow throughout the code if you like. The code is pretty much self explanatory if you know what you’re looking for. An understanding of the parser is beneficial to take a good grasp of the whole CFG generation and also slight modifications of the code if necessary. Like adding a new type of block (Try, Catch blocks) or adding Block identifiers to the structure.

Opcodes

Opcodes are the IR instructions in our language. They usually perform an operation on multiple operands and sometimes return a result or change the state of the program. I will try to document them on a separate page. They can be found under lib\PHPCfg\Op directory.

Operands

These are either inputs to the Ops or produced by them. List of available operands is under lib\PHPCfg\Operand.

  • Literal: These are concrete values in our program. Such as a true boolean or 3 integer or a string.
  • NullOperand: This operand type represents the NULL values. We use this type to distinguish between nulls in our parser vs. the null in the CFG itself.
  • Symbol: This operand type is something that I included in this project to represent Symbolic values. These are the values that are unknown at the time of analysis (E.g., a GET parameter with unknown value).
  • Variable: [Missing documentation]
  • Bound Variable: [Missing documentation]
  • Temporary: The IR format generated by the parser is in SSA format. This means that each variable in the IR gets assigned to only once. This makes certain types of analyses easier, because we don’t have to track assignments to variables. To generate code in SSA form, intermediate variables need to be used. These variables do not necessarily (and mostly don’t) represent a variable in the source of the program. Temporary variables are a representation of these intermediary variables.

In the following example, !0, !1 and !2 will be temporary variables. Like other expressions, they have $expr property that points to the nodes that can be evaluated to get the value of this temporary (e.g., Line 7: Op\BinaryOp\Plus would be referenced). Sometimes the $original property is filled and points to the original variable represented by this temporary node.

<?php
$a = 2;
$a += 3;
$b = $a - 1;
// Will translate to the following SSA format (!# are temporary variables)
!0 = 2
!1 = !0 + 3
!2 = !1 - 1

Printer

This directory (lib\PHPCfg\Printer) includes the two modules that can be used to output the CFG. Text.php will display the CFG in text format and the GraphViz.php file is responsible for generating the dot file that we’ve seen before. If you want to change how these graphs are displayed (e.g., Change the display color or the information included in the blocks), these are the files that need to be modified.

Visitor

Visitors are a common pattern in compilers. The gist of the idea is that we are looking for specific operations on the source code (or the IR in this case). By running our function on every basic block or every instruction, we will have the opportunity to extract this information.

To get familiar with this concept, take a look at this project: https://github.com/silverfoxy/builtin_functions_usage
Here, visitor.php defines a visitor on the AST. The purpose of this code is to count the number of function calls to specific functions. For that, on every node (function enterNode(Node $node)), we check if there is a function call. For each function call node, we check the function name and match it with our list of functions. This provides us an easy way to count the occurrence of calls to specific functions.

Another example of visitors could be to prepend every call to certain functions with a logger. Whenever we see a call to mysqli->query, we will add a line before to log the query.

PHP-CFG defines certain visitors to extract function and method calls, declarations and also variables. For instance DeclarationFinder can be used to identify class and function definitions within the CFG. Note that there would be no direct link between the call to a function to its body inside the CFG, as a result, we will use the DeclarationFinder to find the corresponding blocks for the target function.