LogicalPlan — Logical Query Plan

LogicalPlan is a QueryPlan that corresponds to a logical operator being a structured query in Spark SQL.

Note
LogicalPlan uses Catalyst Framework to represents itself as a tree of children LogicalPlans.

A LogicalPlan is what makes a Dataset that you can ask for through its QueryExecution.

val plan = dataset.queryExecution.logical

A logical plan can be analyzed which is to say that the plan (including children) has gone through analysis and verification.

scala> plan.analyzed
res1: Boolean = true

A logical plan can also be resolved to a specific schema.

scala> plan.resolved
res2: Boolean = true

A logical plan knows the size of objects that are results of query operators, like join, through Statistics object.

scala> val stats = plan.statistics
stats: org.apache.spark.sql.catalyst.plans.logical.Statistics = Statistics(8,false)

A logical plan knows the maximum number of records it can compute.

scala> val maxRows = plan.maxRows
maxRows: Option[Long] = None

A logical plan can be streaming if it contains one or more structured streaming sources.

resolveQuoted method

Caution
FIXME

Specialized LogicalPlans

  • LeafNode

  • UnaryNode

  • BinaryNode

Command — Logical Commands

Command is the base for leaf logical plans that represent non-query commands to be executed by the system. It defines output to return an empty collection of Attributes.

Known commands are:

  1. CreateTable

  2. Any RunnableCommand

RunnableCommand — Logical Commands with Side Effects

RunnableCommand is the base trait for side-effecting logical commands that are executed for their side-effects.

RunnableCommand defines one abstract method run that computes a collection of Rows.

run(sparkSession: SparkSession): Seq[Row]
Note
RunnableCommand is translated to ExecutedCommandExec in BasicOperators strategy.

Is Logical Plan Structured Streaming — isStreaming method

isStreaming: Boolean

isStreaming is a part of the public API of LogicalPlan and is enabled (i.e. true) when a logical plan is a streaming source.

By default, it walks over subtrees and calls itself, i.e. isStreaming, on every child node to find a streaming source.

val spark: SparkSession = ...

// Regular dataset
scala> val ints = spark.createDataset(0 to 9)
ints: org.apache.spark.sql.Dataset[Int] = [value: int]

scala> ints.queryExecution.logical.isStreaming
res1: Boolean = false

// Streaming dataset
scala> val logs = spark.readStream.format("text").load("logs/*.out")
logs: org.apache.spark.sql.DataFrame = [value: string]

scala> logs.queryExecution.logical.isStreaming
res2: Boolean = true

results matching ""

    No results matching ""