package org.apache.spark.sql.catalyst.parser
trait ParserInterface {
def parsePlan(sqlText: String): LogicalPlan
def parseExpression(sqlText: String): Expression
def parseTableIdentifier(sqlText: String): TableIdentifier
}
SQL Parser Framework
SQL Parser Framework in Spark SQL uses ANTLR to parse a SQL text and then creates data types, Catalyst’s Expression, TableIdentifier
, and LogicalPlan.
The contract of the SQL Parser Framework is described by ParserInterface interface. The contract is then abstracted in AbstractSqlParser class so subclasses have only to provide custom AstBuilder.
There are two concrete implementations of AbstractSqlParser
:
-
SparkSqlParser that is the default parser of the SQL expressions into Spark’s types.
-
CatalystSqlParser that is used to parse data types from their canonical string representation.
ParserInterface — SQL Parser Contract
ParserInterface
is the parser contract for extracting LogicalPlan, Catalyst Expressions
(to create Columns from), and TableIdentifiers
from a given SQL string.
It has the only single abstract subclass AbstractSqlParser.
AbstractSqlParser
AbstractSqlParser
abstract class is a ParserInterface that provides the foundation for the SQL parsing infrastructure in Spark SQL with two concrete implementations available in Spark 2.0:
AbstractSqlParser
creates an layer of indirection and expects that subclasses provide custom AstBuilder that in turn converts a ANTLR ParseTree
into a data type, Expression, TableIdentifier
, or LogicalPlan.
protected def astBuilder: AstBuilder
AbstractSqlParser
simply routes all the final parsing calls to translate a SQL string into a respective Spark SQL object to that AstBuilder
.
When parsing a SQL string, it first uses its own parse protected method that sets up a proper ANTLR parsing infrastructure.
parse method
parse[T](command: String)(toResult: SqlBaseParser => T): T
parse
is a protected method that sets up a proper ANTLR parsing infrastructure with SqlBaseLexer
and SqlBaseParser
with are the ANTLR-specific classes of Spark SQL that are auto-generated at build time.
Tip
|
Review the definition of ANTLR grammar for Spark SQL in sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4. |
When called, parse
prints out the following INFO message to the logs:
INFO SparkSqlParser: Parsing command: [command]
Tip
|
Enable INFO logging level for SparkSqlParser or CatalystSqlParser to see the INFO message.
|
AstBuilder
AstBuilder
is a ANTLR SqlBaseBaseVisitor
to convert a ANTLR ParseTree
(that represents a SQL string) into Spark SQL’s corresponding entity using the following methods:
-
visitSingleDataType
to produce a DataType -
visitSingleExpression
to produce a Expression -
visitSingleTableIdentifier
to produce aTableIdentifier
-
visitSingleStatement
for a LogicalPlan
AstBuilder
belongs to org.apache.spark.sql.catalyst.parser
package.
Note
|
SqlBaseBaseVisitor is a ANTLR-specific base class for parser visitors that is auto-generated at build time.
|
SparkSqlParser
SparkSqlParser
is the default parser of the SQL statements supported in Spark SQL. It is available as sqlParser (as the current ParserInterface object) through SessionState
.
The common idiom in Spark SQL is as follows:
sparkSession.sessionState.sqlParser
SparkSqlParser
is AbstractSqlParser with the astBuilder
being SparkSqlAstBuilder
. It supports variable substitution.
SparkSqlParser
is used to parse expression strings into their corresponding Columns objects in the following:
-
expr function
-
selectExpr method (of
Dataset
) -
filter method (of
Dataset
) -
where method (of
Dataset
)
scala> expr("token = 'hello'")
16/07/07 18:32:53 INFO SparkSqlParser: Parsing command: token = 'hello'
res0: org.apache.spark.sql.Column = (token = hello)
SparkSqlParser
is used to parse table strings into their corresponding table identifiers in the following:
-
table
methods in DataFrameReader and SparkSession -
insertInto and saveAsTable methods of
DataFrameWriter
-
createExternalTable
andrefreshTable
methods of Catalog (and SessionState)
SparkSqlParser
is used to parse sql strings into their corresponding logical query plans in the following:
-
sql method in
SparkSession
Tip
|
Enable Add the following line to
Refer to Logging. |
Variable Substitution
Caution
|
FIXME See SparkSqlParser and substitutor .
|
CatalystSqlParser
CatalystSqlParser
is an AbstractSqlParser object with the astBuilder
being AstBuilder.
CatalystSqlParser
is used to parse data types (using their canonical string representation), e.g. when adding fields to a schema or casting column to different data types.
import org.apache.spark.sql.types.StructType
scala> val struct = new StructType().add("a", "int")
struct: org.apache.spark.sql.types.StructType = StructType(StructField(a,IntegerType,true))
scala> val asInt = expr("token = 'hello'").cast("int")
asInt: org.apache.spark.sql.Column = CAST((token = hello) AS INT)
When parsing, you should see INFO messages in the logs:
INFO CatalystSqlParser: Parsing command: int
It is also used in HiveClientImpl
(when converting columns from Hive to Spark) and in OrcFileOperator
(when inferring the schema for ORC files).
Tip
|
Enable Add the following line to
Refer to Logging. |