Lexical analysis and parsing are prerequisite for any language. There are many tools available in the industry which can help in achieving this goal.

To give you an idea, following is a listing of some of the available lexer and parser generator.

Lexical analyzer generators:-

  • ANTLR – Can generate lexical analyzers and parsers.
  • Flex – Alternative variant of the classic “lex” (C/C++).
  • JFlex – A rewrite of JLex.
  • Ragel – A state machine and lexical scanner generator with output support for C, C++, C#, Objective-C, D, Java, Go and Ruby source code.

Parser generators:-

  • ANTLR
  • Happy for Haskell
  • CUP for Java
  • Bison for C and C++
  • (YACC = Yet Another Compiler Compiler – the classic for C)

 

In this post, I will be discussing a brief about the major functions, classes and concepts used during MySQL query parsing.

Lexer used in MySQL

On the contrary, MySQL uses a hand-written lexer for lexical analysis. The benefit being most of the code can be optimized keeping SQL syntax in mind.

(Please refer to the files sql_lex.h and sql_lex.cc)

Some key observations:-

  • MYSQLlex() is the main lex function, called by the parser (yyparse()).
  • lex_one_token() is the key function used for tokenizing.
  • lex_start() and lex_end() are the functions which are called before and after every input query to be parsed.
  • All lexer’s terminals (MySQL tokens) are defined in the file lex.h, mainly divided into two groups:-
    • symbols (these are SQL language keywords)
    • sql_functions (these are SQL language functions)
  • Digital searching algorithm is used for searching the terminals. (“The Art of Computer Programming” by Donald E. Knuth Volume 3 “Sorting and searching” chapter 6.3). The code for this can be found in the file gen_lex_hash.cc and lex_hash.h (generated from gen_lex_hash file). This calls for further reading!!

The following classes gives an outline of the lex implementation in MySQL. (along with short description, pulled from the comments in the source code.)

  1. class st_select_lex_node

This is the base class for st_select_lex and st_select_lex_unit.

  1. class st_select_lex_unit: public st_select_lex_node

st_select_lex_unit is container of either

– One SELECT

– UNION of selects

  1. class st_select_lex: public st_select_lex_node

Store information of parsed SELECT statement (without union)

st_select_lex and st_select_lex_unit are both inherited form select_lex_node

  1. class Lex_input_stream

This class represents the character input stream consumed during lexical analysis.

In addition to consuming the input stream, this class performs some comment pre processing, by filtering out out-of-bound special text from the query input stream.

Two buffers, with pointers inside each buffers, are maintained in parallel. The ‘raw’ buffer is the original query text, which may contain out-of-bound comments. The ‘cpp’ (for comments pre processor) is the pre-processed buffer that contains only the query text that should be seen once out-of-bound data is removed.

  1. class Query_tables_list

Class representing list of all tables used by statement and other information which is necessary for opening and locking its tables, like SQL command for this statement.

Also contains information about stored functions used by statement since during its execution we may have to add all tables used by its stored functions/triggers to this list in order to pre-open and lock them.

Also used by LEX::reset_n_backup/restore_backup_query_tables_list() methods to save and restore this informatio.

  1. struct LEX: public Query_tables_list

The state of the lex parsing. This is saved in the THD struct

  1. class Yacc_state

The internal state of the syntax parser. This object is only available during parsing, and is private to the syntax parser implementation (sql_yacc.yy).

  1. class Parser_state

Internal state of the parser.

The complete state consist of:

– state data used during lexical parsing,

– state data used during syntactic parsing.

 

Parser used in MySQL

MySQL uses “bison” tool as it’s parser generator.

The parser contains the standard set of functions generated by bison tool.

All the grammar rules being defined in the file sql_yacc.yy

sql_yacc.cc and sql_yacc.h are the generated output file from bison.

 

Note: Since bison gives a flexibility to override the name of functions and variables used in the parser, in MySQL code it replaces “yy” with “MYSQL”.In the Makefile, bison is called with the argument “-p”  (“/usr/bin/bison -y -p MYSQL“).Please go through the man page of bison for more details.The idea behind providing such flexibility was to allow multiple parsers in the same application
Note: So, the parser function is not yyparse(), it is MYSQLparse(). The same being called from the function parse_sql(), in file sql_parser.cc. Macros are internally used in the generated code to achieve this goal.A code snippet from sql_yacc.cc:#define yyparse MYSQLparse#define yylex MYSQLlex

#define yyerror MYSQLerror

#define yylval MYSQLlval

#define yychar MYSQLchar

#define yydebug MYSQLdebug

#define yynerrs MYSQLnerrs
The same idea is used to merge the custom lex code with the bison parser. Since the bison parser only knows yylex() as the lexer function. But this is overridden by the macro to match the hand written lexer: MYSQLlex().

References: