While digging through Apache Spark SQL’s source code, what I realized is that looking deeper into its source code is a must for all those who are interested in implementation details of any SQL based data processing engine. Spark SQL has three major API modules: Datasource Api, Catalyst Api and Catalog Api. This series of posts concentrate mostly on the Catalyst correlating the query processing stages(Parsing, Binding, Validations, Optimization, Code generation, Query execution) to highlight all the entry points for further exploration.

PREREQUISITE: Overview of Spark Catalyst. You can start with this link.

NOTE: Code snippets in these posts refer to Spark SQL 2.0.1.

As we know since 2.0 SparkSession is the new entry point for the DataFrame through Dataset, is associated with a SessionState which wraps the SQLContext.

Every SessionState has instances of SQLConfig, SessionCalalog, Analyzer, SparkOptimizer, SparkSqlParser, SparkPlanner and many more.

 

Parsing:

Spark SQL uses ANTLR4 (see SqlBaseParser.scala have rules generated by SqlBase.g4) to parse the SQL statements into parse tree. SparkSqlParser mentioned in the above code snippet actually extends the AbstractSqlParser and uses an overridden version of  AstBuilder i.e. SparkSqlAstBuilder to transform the tree into  LogicalPlan, Expression, TableIdentifier and DataType.

 

Let take this command that you may have executed on Spark.

sparkSession.sql(“select * from sometable”)

SparkSession calls the parsePlan (shown above) to get the LogicalPlan  while creating the DataFrame

 

Also for something like this  “val filteredDs = ds.filter(“age > 15″)”, parser calls the parseExpression to get the equivalent Catalyst Expression  to construct the DataFrame i.e. Dataset[Row]

 

parsetableIdentifier resolves the tablename with the help of catalog api to create the DataFrame.

 

So, every DataFrame associated with a LogicalPlan which is the initial QueryPlan that ultimately get transformed into actual RDD code.