(Do not be afraid of)
PHP Compiler Internals
Sebastian Bergmann
Who I Am
Sebastian Bergmann
Involved in the PHP
project since 2000
Creator of PHPUnit
Co-Founder and
Principal Consultant
with thePHP.cc
Under PHP's Hood
Server API (SAPI)
(mod_php, FastCGI, CLI, ...)
PHP Core
Request Management
File and Network Operations
Extensions
(date, dom, gd, json, mysql, pcre, pdo, reflection, session, standard, …)
Zend Engine
Compilation and Execution
Memory and Resource Allocation
How PHP executes code
Lexical Analysis
Converts the source from a sequence of characters into a
How PHP executes code
Lexical Analysis
Syntax Analysis
Analyzes a sequence of tokens to determine their grammatical
How PHP executes code
Lexical Analysis
Syntax Analysis
Bytecode Generation
Generate bytecode based on the information gathered by
analyzing the sourcecode
How PHP executes code
Lexical Analysis
Syntax Analysis
Bytecode Generation
Bytecode Execution
Lexical Analysis
1
<?php
2
if
(
TRUE
) {
3
'*'
;
4
}
5
?>
Lexical Analysis
1
<?php
2
if
(
TRUE
) {
3
'*'
;
4
}
5
?>
T_OPEN_TAG
Lexical Analysis
1
<?php
2
if
(
TRUE
) {
3
'*'
;
4
}
5
?>
T_OPEN_TAG
T_IF
T_WHITESPACE
(
T_STRING
)
T_WHITESPACE
{
T_WHITESPACE
Lexical Analysis
1
<?php
2
if
(
TRUE
) {
3
'*'
;
4
}
5
?>
T_OPEN_TAG
T_IF
T_WHITESPACE
(
T_STRING
)
T_WHITESPACE
{
T_WHITESPACE
T_PRINT
T_WHITESPACE
T_CONSTANT_ENCAPSED_STRING
;
Lexical Analysis
1
<?php
2
if
(
TRUE
) {
3
'*'
;
4
}
5
?>
T_OPEN_TAG
T_IF
T_WHITESPACE
(
T_STRING
)
T_WHITESPACE
{
T_WHITESPACE
T_PRINT
T_WHITESPACE
T_CONSTANT_ENCAPSED_STRING
;
T_WHITESPACE
}
Lexical Analysis
1
<?php
2
if
(
TRUE
) {
3
'*'
;
4
}
5
?>
T_OPEN_TAG
T_IF
T_WHITESPACE
(
T_STRING
)
T_WHITESPACE
{
T_WHITESPACE
T_PRINT
T_WHITESPACE
T_CONSTANT_ENCAPSED_STRING
;
T_WHITESPACE
}
T_WHITESPACE
T_CLOSE_TAG
Lexical Analysis
T_OPEN_TAG
T_IF
T_WHITESPACE
(
T_STRING
)
T_WHITESPACE
{
T_WHITESPACE
T_PRINT
T_WHITESPACE
T_CONSTANT_ENCAPSED_STRING
;
T_WHITESPACE
}
T_WHITESPACE
T_CLOSE_TAG
Scan a sequence of characters
<?php
if
TRUE
'*'
?>
Lexical Analysis
Lexical Analysis
You do not want to write a scanner by
hand
At least when the code for the scanner should
be efficient and maintainable
Tools such as flex or re2c generate the
code for a scanner from a set of rules
Scanner Generators
"if"
{
return
T_IF
;
}
<ST_IN_SCRIPTING>
"if"
{
return
T_IF
;
}
Lexical Analysis
PHP Tokens
T_ABSTRACT T_AND_EQUAL T_ARRAY T_ARRAY_CAST T_AS T_BAD_CHARACTER T_BOOLEAN_AND T_BOOLEAN_OR T_BOOL_CAST T_BREAK T_CASE T_CATCH T_CHARACTER T_CLASS T_CLASS_C T_CLONE T_CLOSE_TAG T_COMMENT T_CONCAT_EQUAL T_CONST T_CONSTANT_ENCAPSED_STRING T_CONTINUE T_CURLY_OPEN T_DEC T_DECLARE T_DEFAULT T_DIR T_DIV_EQUAL T_DNUMBER T_DOC_COMMENT T_DO T_DOLLAR_OPEN_CURLY_BRACES T_DOUBLE_ARROW T_DOUBLE_CAST T_DOUBLE_COLON T_ECHO T_ELSE T_ELSEIF T_EMPTY T_ENCAPSED_AND_WHITESPACE T_ENDDECLARE T_ENDFOR T_ENDFOREACH T_ENDIF T_ENDSWITCH T_ENDWHILE T_END_HEREDOC T_EVAL T_EXIT T_EXTENDS T_FILE T_FINAL T_FOR T_FOREACH T_FUNCTION T_FUNC_C T_GLOBAL T_GOTO T_HALT_COMPILER T_IF T_IMPLEMENTS T_INC T_INCLUDE T_INCLUDE_ONCE T_INLINE_HTML T_INSTANCEOF T_INT_CAST T_INTERFACE T_ISSET T_IS_EQUAL T_IS_GREATER_OR_EQUAL T_IS_IDENTICALLexical Analysis
PHP Tokens
T_IS_NOT_EQUAL T_IS_NOT_IDENTICAL T_IS_SMALLER_OR_EQUAL T_LINE T_LIST T_LNUMBER T_LOGICAL_AND T_LOGICAL_OR T_LOGICAL_XOR T_METHOD_C T_MINUS_EQUAL T_ML_COMMENT T_MOD_EQUAL T_MUL_EQUAL T_NAMESPACE T_NS_C T_NEW T_NUM_STRING T_OBJECT_CAST T_OBJECT_OPERATOR T_OLD_FUNCTION T_OPEN_TAG T_OPEN_TAG_WITH_ECHO T_OR_EQUAL T_PAAMAYIM_NEKUDOTAYIM T_PLUS_EQUAL T_PRINT T_PRIVATE T_PUBLIC T_PROTECTED T_REQUIRE T_REQUIRE_ONCE T_RETURN T_SL T_SL_EQUAL T_SR T_SR_EQUAL T_START_HEREDOC T_STATIC T_STRING T_STRING_CAST T_STRING_VARNAME T_SWITCH T_THROW T_TRY T_UNSET T_UNSET_CAST T_USE T_VAR T_VARIABLE T_WHILE T_WHITESPACE T_XOR_EQUALSyntax Analysis
Syntax Analysis
You do not want to write a parser by hand
At least when the code for the scanner should
be efficient and maintainable
Tools such as bison or lemon generate
the code for a parser from a set of rules
Parser Generators
T_IF
'('
expr
')'
{
...
}
statement
{
...
}
1
<?php
2
if
(
TRUE
) {
3
'*'
;
4
}
5
?>
sb@thinkpad ~ % php -dextension=vld.so -dvld.active=1 -dvld.execute=0 if.php filename: /home/sb/if.php
function name: (null) number of ops: 8
compiled vars: none
line # op fetch ext return operands 2 0 EXT_STMT 1 JMPZ true, ->6 3 2 EXT_STMT 3 PRINT ~0 '%2A' 4 FREE ~0 4 5 JMP ->6 6 6 EXT_STMT 7 RETURN 1
PHP Bytecode
Disassembling with vld
1
<?php
2
if
(
TRUE
) {
3
'*'
;
4
}
5
?>
sb@thinkpad ~ % bytekit if.php
bytekit-cli 1.0.0 by Sebastian Bergmann. Filename: /home/sb/if.php
Function: main Number of oplines: 8
line # opcode result operands
2 0 EXT_STMT 1 JMPZ true, ->6 3 2 EXT_STMT 3 PRINT ~0 '*' 4 FREE ~0 4 5 JMP ->6 6 6 EXT_STMT 7 RETURN 1
PHP Bytecode
1
<?php
2
if
(
TRUE
) {
3
'*'
;
4
}
5
?>
PHP Bytecode
Bytecode visualization with bytekit-cli
1
<?php
2
$a
=
1
;
3
$b
=
2
;
4
$a
+
$b
;
5
?>
sb@thinkpad ~ % bytekit add.php
bytekit-cli 1.0.0 by Sebastian Bergmann. Filename: /home/sb/add.php Function: main
Number of oplines: 10
Compiled variables: !0 = $a, !1 = $b
line # opcode result operands
2 0 EXT_STMT 1 ASSIGN !0, 1 3 2 EXT_STMT 3 ASSIGN !1, 2 4 4 EXT_STMT 5 ADD ~2 !0, !1 6 PRINT ~3 ~2 7 FREE ~3 6 8 EXT_STMT 9 RETURN 1
PHP Bytecode
PHP Bytecode
List of Opcodes
NOP ADD SUB MUL DIV MOD SL SR CONCAT BW_OR BW_AND BW_XOR BW_NOT BOOL_NOT BOOL_XOR IS_IDENTICAL IS_NOT_IDENTICAL IS_EQUAL IS_NOT_EQUAL IS_SMALLER IS_SMALLER_OR_EQUAL CAST QM_ASSIGN ASSIGN_ADD ASSIGN_SUB ASSIGN_MUL ASSIGN_DIV ASSIGN_MOD ASSIGN_SL ASSIGN_SR ASSIGN_CONCAT ASSIGN_BW_OR ASSIGN_BW_AND ASSIGN_BW_XOR PRE_INC PRE_DEC POST_INC POST_DEC ASSIGN ASSIGN_REF ECHO PRINT JMPZ JMPNZ JMPZNZ JMPZ_EX JMPNZ_EX CASE SWITCH_FREE BRK BOOL INIT_STRING ADD_CHAR ADD_STRING ADD_VAR BEGIN_SILENCE END_SILENCE INIT_FCALL_BY_NAME DO_FCALL DO_FCALL_BY_NAME RETURN RECV RECV_INIT SEND_VAL SEND_VAR SEND_REF NEW FREE INIT_ARRAY ADD_ARRAY_ELEMENT INCLUDE_OR_EVAL UNSET_VAR UNSET_DIM UNSET_OBJ FE_RESET FE_FETCH EXIT FETCH_R FETCH_DIM_R FETCH_OBJ_R FETCH_W FETCH_DIM_W FETCH_OBJ_W FETCH_RW FETCH_DIM_RW FETCH_OBJ_RW FETCH_IS FETCH_DIM_IS FETCH_OBJ_IS FETCH_FUNC_ARGPHP Bytecode
List of Opcodes
FETCH_DIM_FUNC_ARG FETCH_OBJ_FUNC_ARG FETCH_UNSET FETCH_DIM_UNSET FETCH_OBJ_UNSET FETCH_DIM_TMP_VAR FETCH_CONSTANT EXT_STMT EXT_FCALL_BEGIN EXT_FCALL_END EXT_NOP TICKS SEND_VAR_NO_REF CATCH THROW FETCH_CLASS CLONE INIT_METHOD_CALL INIT_STATIC_METHOD_CALL ISSET_ISEMPTY_VAR ISSET_ISEMPTY_DIM_OBJ PRE_INC_OBJ PRE_DEC_OBJ POST_INC_OBJ POST_DEC_OBJ ASSIGN_OBJ INSTANCEOF DECLARE_CLASS DECLARE_INHERITED_CLASS DECLARE_FUNCTION RAISE_ABSTRACT_ERROR ADD_INTERFACE VERIFY_ABSTRACT_CLASS ASSIGN_DIM ISSET_ISEMPTY_PROP_OBJ HANDLE_EXCEPTIONTest First!
--TEST--unless statement
--FILE--<?php
unless
(
FALSE
) {
'unless FALSE is TRUE, this is printed'
;
}
unless
(
TRUE
) {
'unless TRUE is TRUE, this is printed'
;
}
?>
--EXPECT--unless FALSE is TRUE, this is printed
Extending the Compiler
Add token for unless to the scanner
Add rule for unless to the parser
Generate bytecode for unless in the compiler
Add token for unless to ext/tokenizer
Add unless scanner token
<ST_IN_SCRIPTING>"if" {
return T_IF;
}
<ST_IN_SCRIPTING>
"unless"
{
return
T_UNLESS
;
}
<ST_IN_SCRIPTING>"elseif" {
return T_ELSEIF;
}
<ST_IN_SCRIPTING>"endif" {
return T_ENDIF;
}
<ST_IN_SCRIPTING>"else" {
return T_ELSE;
}
Zend/zend_language_scanner.l
Add unless parser rule
%token T_NAMESPACE
%token T_NS_C
%token T_DIR
%token T_NS_SEPARATOR
%token T_UNLESS
.
.
unticked_statement:
'{' inner_statement_list '}'
| T_IF '(' expr ')' {
.
.
| T_UNLESS
'('
expr
')'
{
zend_do_unless_cond
(
&$3
,
&$4
TSRMLS_CC
);
}
statement
{
zend_do_if_after_statement
(
&$4
,
1
TSRMLS_CC
);
} {
zend_do_if_end
(
TSRMLS_C
);
}
.
.
Zend/zend_language_parser.y
How if is compiled
void
zend_do_if_cond
(const
znode *cond
,
znode *closing_bracket_token
TSRMLS_DC
)
{
}
zend_do_if_cond() is called when an if statement is compiled
Zend/zend_compile.c
typedef struct
_znode
{
int
op_type;
union
{
zval constant
;
zend_uint var
;
zend_uint opline_num
;
zend_op_array *op_array
;
zend_op *jmp_addr
;
struct
{
zend_uint var
;
zend_uint type
;
}
EA
;
}
u
;
}
znode
;
How if is compiled
void zend_do_if_cond
(const znode *cond, znode *closing_bracket_token TSRMLS_DC)
{
int
if_cond_op_number
=
get_next_op_number
(
CG
(
active_op_array
));
zend_op *opline
=
get_next_op
(
CG
(
active_op_array
)
TSRMLS_CC
);
}
Allocate a new opline in the current oparray
Zend/zend_compile.c
struct
_zend_op
{
opcode_handler_t handler
;
znode result
;
znode op1
;
znode op2
;
ulong extended_value
;
uint lineno
;
zend_uchar opcode
;
};
How if is compiled
void zend_do_if_cond
(const znode *cond, znode *closing_bracket_token TSRMLS_DC)
{
int if_cond_op_number =
get_next_op_number(CG(active_op_array));
zend_op *opline =
get_next_op(CG(active_op_array) TSRMLS_CC);
opline
->
opcode
=
ZEND_JMPZ
;
}
Set the opcode of the new opline to JMPZ (jump if zero)
How if is compiled
void zend_do_if_cond
(const znode *cond, znode *closing_bracket_token TSRMLS_DC)
{
int if_cond_op_number =
get_next_op_number(CG(active_op_array));
zend_op *opline =
get_next_op(CG(active_op_array) TSRMLS_CC);
opline->opcode = ZEND_JMPZ;
opline
->
op1
=
*cond
;
}
Set the first operand of the new opline to the if condition
How if is compiled
void zend_do_if_cond
(const znode *cond, znode *closing_bracket_token TSRMLS_DC)
{
int if_cond_op_number =
get_next_op_number(CG(active_op_array));
zend_op *opline =
get_next_op(CG(active_op_array) TSRMLS_CC);
opline->opcode = ZEND_JMPZ;
opline->op1 = *cond;
closing_bracket_token
->
u
.
opline_num
=
if_cond_op_number
;
SET_UNUSED
(
opline
->
op2
);
INC_BPC
(
CG
(
active_op_array
));
}
Perform book keeping tasks such as marking the second operand of the
new opline as unused or incrementing the backpatching counter for the
current oparray
Add unless to compiler
void zend_do_unless_cond
(const znode *cond, znode *closing_bracket_token TSRMLS_DC)
{
int unless_cond_op_number =
get_next_op_number(CG(active_op_array));
zend_op *opline =
get_next_op(CG(active_op_array) TSRMLS_CC);
opline
->
opcode
=
ZEND_JMPNZ
;
opline->op1 = *cond;
closing_bracket_token->u.opline_num =
unless_cond_op_number;
SET_UNUSED(opline->op2);
INC_BPC(CG(active_op_array));
}
All we have to do to generate code for the unless statement, as
compared to generate code for the if statement, is to use the JMPNZ
(jump if not zero) opcode instead of the JMPZ (jump if zero) opcode
Add unless to compiler
1
<?php
2
unless
(
FALSE
) {
3
'*'
;
4
}
5
?>
The generated bytecode
sb@thinkpad ~ % bytekit unless.php
bytekit-cli 1.0.0 by Sebastian Bergmann. Filename: /home/sb/unless.php Function: main
Number of oplines: 8
line # opcode result operands
2 0 EXT_STMT 1 JMPNZ true, ->6 3 2 EXT_STMT 3 PRINT ~0 '*' 4 FREE ~0 4 5 JMP ->6 6 6 EXT_STMT 7 RETURN 1
Run the test
sb@thinkpad php-5.3-unless % make test TESTS=Zend/tests/unless.phpt Build complete.
Don't forget to run 'make test'.
===================================================================== PHP : /usr/local/src/php/php-5.3-unless/sapi/cli/php
PHP_SAPI : cli
PHP_VERSION : 5.3.0RC3-dev ZEND_VERSION: 2.3.0
PHP_OS : Linux 2.6.28-11-generic #42-Ubuntu SMP Fri Apr 17 01:57:59 UTC 2009 i686 GNU/Linux INI actual : /usr/local/src/php/php-5.3-unless/tmp-php.ini
More .INIs :
CWD : /usr/local/src/php/php-5.3-unless Extra dirs :
VALGRIND : Not used
===================================================================== Running selected tests.
PASS unless statement [Zend/tests/unless.phpt]
===================================================================== Number of tests : 1 1 Tests skipped : 0 ( 0.0%) ---Tests warned : 0 ( 0.0%) ( 0.0%) Tests failed : 0 ( 0.0%) ( 0.0%) Expected fail : 0 ( 0.0%) ( 0.0%) Tests passed : 1 (100.0%) (100.0%) ---Time taken : 0 seconds
Add unless to ext/tokenizer
ext/tokenizer/tokenizer_data.c
sb@thinkpad
tokenizer %
./tokenizer_data_gen.sh
Wrote tokenizer_data.c
The End
Thank you for your interest!
These slides will be linked soon from
http://sebastian-bergmann.de/
You can vote for this talk on
http://joind.in/582
Acknowledgements
Thomas Lee, whose Python Language Internals presentation at
OSDC 2008 inspired this presentation
Stefan Esser for creating the Bytekit extension that provides
PHP bytecode access and analysis features
Derick Rethans, David Soria Parra, and Scott MacVicar for reviewing
References
http://www.php.net/manual/en/tokens.php
http://www.zapt.info/opcodes.html
Sara Golemon: ”Extending and Embedding PHP”
http://derickrethans.nl/vld.php
http://bytekit.org/
This presentation material is published under the Attribution-Share Alike 3.0 Unported
license.
You are free:
✔
to Share – to copy, distribute and transmit the work.
✔
to Remix – to adapt the work.
Under the following conditions:
●
Attribution. You must attribute the work in the manner specified by the author or
licensor (but not in any way that suggests that they endorse you or your use of the
work).
●
Share Alike. If you alter, transform, or build upon this work, you may distribute the
resulting work only under the same, similar or a compatible license.
For any reuse or distribution, you must make clear to others the license terms of this
work.
Any of the above conditions can be waived if you get permission from the copyright
holder.