SQL Parser Framework

SQL Parser Framework in Spark SQL uses ANTLR to parse a SQL text and then creates data types, Catalyst’s Expression, TableIdentifier, and LogicalPlan.

The contract of the SQL Parser Framework is described by ParserInterface interface. The contract is then abstracted in AbstractSqlParser class so subclasses have only to provide custom AstBuilder.

There are two concrete implementations of AbstractSqlParser:

  1. SparkSqlParser that is the default parser of the SQL expressions into Spark’s types.

  2. CatalystSqlParser that is used to parse data types from their canonical string representation.

ParserInterface — SQL Parser Contract

ParserInterface is the parser contract for extracting LogicalPlan, Catalyst Expressions (to create Columns from), and TableIdentifiers from a given SQL string.

package org.apache.spark.sql.catalyst.parser

trait ParserInterface {
  def parsePlan(sqlText: String): LogicalPlan

  def parseExpression(sqlText: String): Expression

  def parseTableIdentifier(sqlText: String): TableIdentifier
}

It has the only single abstract subclass AbstractSqlParser.

AbstractSqlParser

AbstractSqlParser abstract class is a ParserInterface that provides the foundation for the SQL parsing infrastructure in Spark SQL with two concrete implementations available in Spark 2.0:

AbstractSqlParser creates an layer of indirection and expects that subclasses provide custom AstBuilder that in turn converts a ANTLR ParseTree into a data type, Expression, TableIdentifier, or LogicalPlan.

protected def astBuilder: AstBuilder

AbstractSqlParser simply routes all the final parsing calls to translate a SQL string into a respective Spark SQL object to that AstBuilder.

When parsing a SQL string, it first uses its own parse protected method that sets up a proper ANTLR parsing infrastructure.

parse method

parse[T](command: String)(toResult: SqlBaseParser => T): T

parse is a protected method that sets up a proper ANTLR parsing infrastructure with SqlBaseLexer and SqlBaseParser with are the ANTLR-specific classes of Spark SQL that are auto-generated at build time.

Tip
Review the definition of ANTLR grammar for Spark SQL in sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4.

When called, parse prints out the following INFO message to the logs:

INFO SparkSqlParser: Parsing command: [command]
Tip
Enable INFO logging level for SparkSqlParser or CatalystSqlParser to see the INFO message.

AstBuilder

AstBuilder is a ANTLR SqlBaseBaseVisitor to convert a ANTLR ParseTree (that represents a SQL string) into Spark SQL’s corresponding entity using the following methods:

  1. visitSingleDataType to produce a DataType

  2. visitSingleExpression to produce a Expression

  3. visitSingleTableIdentifier to produce a TableIdentifier

  4. visitSingleStatement for a LogicalPlan

AstBuilder belongs to org.apache.spark.sql.catalyst.parser package.

Note
SqlBaseBaseVisitor is a ANTLR-specific base class for parser visitors that is auto-generated at build time.

SparkSqlParser

SparkSqlParser is the default parser of the SQL statements supported in Spark SQL. It is available as sqlParser (as the current ParserInterface object) through SessionState.

The common idiom in Spark SQL is as follows:

sparkSession.sessionState.sqlParser

SparkSqlParser is AbstractSqlParser with the astBuilder being SparkSqlAstBuilder. It supports variable substitution.

SparkSqlParser is used to parse expression strings into their corresponding Columns objects in the following:

  1. expr function

  2. selectExpr method (of Dataset)

  3. filter method (of Dataset)

  4. where method (of Dataset)

scala> expr("token = 'hello'")
16/07/07 18:32:53 INFO SparkSqlParser: Parsing command: token = 'hello'
res0: org.apache.spark.sql.Column = (token = hello)

SparkSqlParser is used to parse table strings into their corresponding table identifiers in the following:

  1. table methods in DataFrameReader and SparkSession

  2. insertInto and saveAsTable methods of DataFrameWriter

  3. createExternalTable and refreshTable methods of Catalog (and SessionState)

SparkSqlParser is used to parse sql strings into their corresponding logical query plans in the following:

  1. sql method in SparkSession

Tip

Enable INFO logging level for org.apache.spark.sql.execution.SparkSqlParser logger to see what happens inside.

Add the following line to conf/log4j.properties:

log4j.logger.org.apache.spark.sql.execution.SparkSqlParser=INFO

Refer to Logging.

Variable Substitution

Caution
FIXME See SparkSqlParser and substitutor.

CatalystSqlParser

CatalystSqlParser is an AbstractSqlParser object with the astBuilder being AstBuilder.

CatalystSqlParser is used to parse data types (using their canonical string representation), e.g. when adding fields to a schema or casting column to different data types.

import org.apache.spark.sql.types.StructType
scala> val struct = new StructType().add("a", "int")
struct: org.apache.spark.sql.types.StructType = StructType(StructField(a,IntegerType,true))

scala> val asInt = expr("token = 'hello'").cast("int")
asInt: org.apache.spark.sql.Column = CAST((token = hello) AS INT)

When parsing, you should see INFO messages in the logs:

INFO CatalystSqlParser: Parsing command: int

It is also used in HiveClientImpl (when converting columns from Hive to Spark) and in OrcFileOperator (when inferring the schema for ORC files).

Tip

Enable INFO logging level for org.apache.spark.sql.catalyst.parser.CatalystSqlParser logger to see what happens inside.

Add the following line to conf/log4j.properties:

log4j.logger.org.apache.spark.sql.catalyst.parser.CatalystSqlParser=INFO

Refer to Logging.

results matching ""

    No results matching ""