import org.apache.spark.sql.SparkSession
val spark: SparkSession = SparkSession.builder
.master("local[*]")
.appName("My Spark Application")
.config("spark.sql.warehouse.dir", "c:/Temp") (1)
.getOrCreate
Settings
The following list are the settings used to configure Spark SQL applications.
You can set them in a SparkSession upon instantiation using config method.
-
Set spark.sql.warehouse.dir for the Spark session
spark.sql.warehouse.dir
spark.sql.warehouse.dir
(default: ${system:user.dir}/spark-warehouse
) is the default location for managed databases and tables.
spark.sql.parquet.filterPushdown
spark.sql.parquet.filterPushdown
(default: true
) is a flag to control the filter predicate push-down optimization for data sources using parquet file format.
spark.sql.catalogImplementation
spark.sql.catalogImplementation
(default: in-memory
) is an internal setting to select the active catalog implementation.
There are two possible values:
-
in-memory
(default) -
hive
Tip
|
You can enable Hive support in a SparkSession using enableHiveSupport builder method.
|
spark.sql.shuffle.partitions
spark.sql.shuffle.partitions
(default: 200
) — the default number of partitions to use when shuffling data for joins or aggregations.
spark.sql.allowMultipleContexts
spark.sql.allowMultipleContexts
(default: true
) controls whether creating multiple SQLContexts/HiveContexts is allowed.
spark.sql.autoBroadcastJoinThreshold
spark.sql.autoBroadcastJoinThreshold
(default: 10 * 1024 * 1024
) configures the maximum size in bytes for a table that will be broadcast to all worker nodes when performing a join. If the size of the statistics of the logical plan of a DataFrame is at most the setting, the DataFrame is broadcast for join.
Negative values or 0
disable broadcasting.
Consult Broadcast Join for more information about the topic.
spark.sql.columnNameOfCorruptRecord
spark.sql.columnNameOfCorruptRecord
…FIXME
spark.sql.dialect
spark.sql.dialect
- FIXME
spark.sql.sources.default
spark.sql.sources.default
(default: parquet
) sets the default data source to use in input/output.
It is used when reading or writing data in DataFrameWriter and DataFrameReader, when creating external table from a path (in Catalog.createExternalTable
) and in the streaming DataStreamReader and DataStreamWriter.
spark.sql.streaming.checkpointLocation
spark.sql.streaming.checkpointLocation
is the default location for storing checkpoint data for continuously executing queries.