PHP builtin functions usage statistics

I was working on a project which required implementation of PHP builtin functions. To understand the scope of the project and its implementation effort I thought it is a good idea to get an estimate of the total number of these functions, and also, how frequently they are used among popular PHP applications. This would give me a rough idea of how many functions I need to implement myself. At the same time, the results can be leveraged to get insight into debloating effectiveness for these applications. Which is something that I’ve been working on for over a year now.

PHP builtin functions

These are specific functions that come with the PHP itself. They are very similar to some of the structures of the language itself. For instance, “echo” is a language construct, that prints something to the output stream. “preg_replace” is an example of a builtin function. The list can be found here.

Debloating

Is the idea of removing unused pieces of code from an application. Parts that are not used by the users. The outcome of this process is reduced attack surface, as any vulnerability that resided in the unused parts is now removed. In my previous research, we showed that indeed many of the known vulnerabilities and CVEs reside in the unused parts of applications. This means that, by identifying and removing these parts, one could mitigate roughly half of the known CVEs for a web application. Historic CVEs in this context are used to estimate the total number of vulnerabilities in an application. Anyway, you can read more about this at https://debloating.com if you are interested.

Usage statistics of builtin functions

The plan is to use php-parser. This PHP library is a PHP parser as the name suggests. It enables you to define different types of visitors on the AST (Abstract Syntax Tree). Given a PHP code, I define a visitor to look for tree nodes of the type “Function Call”. Our specific target node types are FuncCall, StaticCall and MethodCall. The AnalyzeBuiltinFunctionUsage class will take a directory path as an input, and iterate recursively over all files with .php extension. It will then parse every PHP file and traverse its syntax tree using the visitor we define below.

<?php
// ...
public function enterNode(Node $node) {
        $method_call = new Node\Expr\MethodCall(new Node\Expr\Exit_(), null);
        $func_call = new Node\Expr\FuncCall(null);
        $static_call = new Node\Expr\StaticCall(null, null);
        $node_name = null;
        switch ($node->getType()) {
            case $method_call->getType():
            case $func_call->getType():
                 // ...

This visitor will run enterNode() on every node in the parse tree. On lines 9 and 10, I check if current node is a function call or a method call. Given the list of builtin functions,  we can count the number of calls to each of them. Notice that we are performing a static analysis. Which means, we will have to ignore dynamic function calls.

Now that we have the code to extract number of calls for builtin functions, we can run this code on popular PHP applications. We focus on WordPress, Magento, Mediawiki and phpMyAdmin at this time, targeting different versions of each application (Similar versions to https://debloating.com).

In the chart above, the blue bar indicates the total number of distinct builtin functions used among various versions of the target web application. By emulating less than 900 builtin functions (out of ~11,000 including the ones from extensions), we can perform a sound static analysis on these popular PHP applications.

The orangish bar shows the number of builtin functions used after function debloating (i.e., removing unused functions from the applications). Yellow bar indicates builtin functions with more than 50 call sites. These are the most common and important builtin functions within these applications. These would be a good starting point to start focusing on builtin functions.

Effect of debloating on sensitive function calls

Among the builtin functions, there are some that are more security sensitive than others. Taking eval() as an example, this function will take a string as input and execute it as PHP code. If attacker controlled value reaches the call to eval, the attacker will essentially be able to execute arbitrary code on the target web server.

Sensitive builtin functions can be categorized as follows [Reference]:

  • Command Execution: These functions will run a command directly on the target operating system. Examples include exec, system and shell_exec.
  • PHP Code Execution: These functions execute arbitrary PHP code. eval, assert (yes assert!) and include are to name a few.
  • Functions with Callbacks: For these functions, if the attacker gets to control the callback parameter, they will be able to call arbitrary PHP functions and divert the control flow. For instance, call_user_func and register_shutdown_function will take the callback name as their first parameter.
  • Information Disclosure: Functions that disclose sensitive information about the execution environment (e.g., Apache version) are in this category. phpinfo is the well known example of these functions.
  • Other: Any other builtin function that can be security sensitive in other contexts goes here. Examples include extractparse_str, mail and header.

The presence of calls to these functions is totally natural in any web application. But the total number of calls to these functions can be used to roughly estimate the potential vulnerable points to these applications. Taint analysis tools will usually focus on a similar list of sensitive APIs and check whether user controlled input can reach these sinks.

The numbers below are the sum of all call sites among different application versions. As a result, the ratios should be compared among different applications rather than raw numbers.

Application Command Execution PHP Execution Callbacks Information Disclosure Filesystem Other
WordPress 50 14 (72%▼) 3156 2495 (20%▼) 4678 4032 (14%▼) 107 68 (36%▼) 3132 1666 (47%▼) 1587 1238 (22%▼)
Magento 171 0 (100%▼) 776 100 (87%▼) 1805 244 (86%▼) 264 6 (98%▼) 2899 153 (95%▼) 737 61 (92%▼)
MediaWiki 115 23 (80%▼) 1802 639 (65%▼) 2792 952 (66%▼) 185 36 (81%▼) 1779 298 (83%▼) 1212 443 (63%▼)
phpMyAdmin 76 0 (100%▼) 976 242 (75%▼) 970 182 (81%▼) 133 16 (88%▼) 1304 253 (81%▼) 892 314 (65%▼)

 

Overall, debloating proves to be successful in removing the majority of security sensitive builtin PHP functions. I have also uploaded the number of calls to each builtin function to the Github repo under results directory.

Finally, if you are planning to emulate PHP builtin functions for the purpose of static PHP code analysis, you know where to start.