In my previous post, I discussed about the parsing which generated a Logical Plan. In this post I’ll show how this Logical Plan gets transformed into its corresponding RDD.

Every  SQLExecution is associated with an Id to track at runtime and a query processing / transformation pipeline i.e. QueryExecution. The QueryExecution provides the access to all of the stages of query plan transformation. Let’s go through these stages in detail.

 

Stage 1: Logical Plan to Analyzed Logical Plan(Binding)

The analyzer associated with the session state of the execution mentioned in my previous post, transforms the initial Logical plan to its resolved form after resolving the tables, datasources into LogicalRelation.  The rules for these transformations can be found in DataSourceStrategy.scala(DataSourceAnalysis, FindDataSourceTable ) and rules.scala(PreprocessTableInsertion, ResolveDataSource)

 

Stage 2: Validation and Cached Data source resolution

The analyzed plan is further validated and its data source segments are replaced to cached versions if are cached already.

 

Stage 3: Query Optimization

A set of predefined Catalyst provided SQL optimization rule and additional custom rules transform the analyzed logical plan to optimized logical plan. Custom rules are experimental and can be added through ExperimentalMethods.scala.

You can find these rule in Optimizer.scala

 

Stage 4: Spark Plan and Code generation

SparkPlanner which has a set of strategies, transform the optimized logical plan into SparkPlan with actual physical operators or methods its corresponding logical counterparts. Whole stage codegen collapsing and reuse exchange are also important transformations in terms of performance that happens at this stage. User can also add custom strategies through ExperimentalMethods.scala.

 

Stage 5: Execution

The physical plan then gets transformed into RDD for execution on Spark runtime.