import org.apache.spark.sql.types.StructType
val schemaUntyped = new StructType()
.add("a", "int")
.add("b", "string")
Schema — Structure of Data
A schema is the description of the structure of your data (which together create a Dataset in Spark SQL). It can be implicit (and inferred at runtime) or explicit (and known at compile time).
A schema is described using StructType which is a collection of StructField objects (that in turn are tuples of names, types, and nullability
classifier).
StructType
and StructField
belong to the org.apache.spark.sql.types
package.
You can use the canonical string representation of SQL types to describe the types in a schema (that is inherently untyped at compile type) or use type-safe types from the org.apache.spark.sql.types
package.
// it is equivalent to the above expression
import org.apache.spark.sql.types.{IntegerType, StringType}
val schemaTyped = new StructType()
.add("a", IntegerType)
.add("b", StringType)
Tip
|
Read up on SQL Parser Framework in Spark SQL to learn about CatalystSqlParser that is responsible for parsing data types.
|
It is however recommended to use the singleton DataTypes class with static methods to create schema types.
import org.apache.spark.sql.types.DataTypes._
val schemaWithMap = StructType(
StructField("map", createMapType(LongType, StringType), false) :: Nil)
StructType offers printTreeString that makes presenting the schema more user-friendly.
scala> schemaTyped.printTreeString
root
|-- a: integer (nullable = true)
|-- b: string (nullable = true)
scala> schemaWithMap.printTreeString
root
|-- map: map (nullable = false)
| |-- key: long
| |-- value: string (valueContainsNull = true)
// You can use prettyJson method on any DataType
scala> println(schema1.prettyJson)
{
"type" : "struct",
"fields" : [ {
"name" : "a",
"type" : "integer",
"nullable" : true,
"metadata" : { }
}, {
"name" : "b",
"type" : "string",
"nullable" : true,
"metadata" : { }
} ]
}
As of Spark 2.0, you can describe the schema of your strongly-typed datasets using encoders.
import org.apache.spark.sql.Encoders
scala> Encoders.INT.schema.printTreeString
root
|-- value: integer (nullable = true)
scala> Encoders.product[(String, java.sql.Timestamp)].schema.printTreeString
root
|-- _1: string (nullable = true)
|-- _2: timestamp (nullable = true)
case class Person(id: Long, name: String)
scala> Encoders.product[Person].schema.printTreeString
root
|-- id: long (nullable = false)
|-- name: string (nullable = true)
Implicit Schema
val df = Seq((0, s"""hello\tworld"""), (1, "two spaces inside")).toDF("label", "sentence")
scala> df.printSchema
root
|-- label: integer (nullable = false)
|-- sentence: string (nullable = true)
scala> df.schema
res0: org.apache.spark.sql.types.StructType = StructType(StructField(label,IntegerType,false), StructField(sentence,StringType,true))
scala> df.schema("label").dataType
res1: org.apache.spark.sql.types.DataType = IntegerType