val people: Dataset[Person] = ...
import org.apache.spark.sql.streaming.ProcessingTime
import scala.concurrent.duration._
import org.apache.spark.sql.streaming.OutputMode.Complete
df.writeStream
.queryName("textStream")
.outputMode(Complete)
.trigger(ProcessingTime(10.seconds))
.format("console")
.start
DataStreamWriter
DataFrameWriter
is a part of Structured Streaming API as of Spark 2.0 that is responsible for writing the output of streaming queries to sinks and hence starting their execution.
-
queryName to set the name of a query
-
outputMode to specify output mode.
-
start to start continuous writing to a sink.
Specifying Output Mode — outputMode
method
outputMode(outputMode: OutputMode): DataStreamWriter[T]
outputMode
specifies output mode of a streaming Dataset which is what gets written to a streaming sink when there is a new data available.
Currently, the following output modes are supported:
-
OutputMode.Append
— only the new rows in the streaming dataset will be written to a sink. -
OutputMode.Complete
— entire streaming dataset (with all the rows) will be written to a sink every time there are updates. It is supported only for streaming queries with aggregations.
Setting Query Name — queryName
method
queryName(queryName: String): DataStreamWriter[T]
queryName
sets the name of a streaming query.
Internally, it is just an additional option with the key queryName
.
Setting How Often to Execute Streaming Query — trigger
method
trigger(trigger: Trigger): DataStreamWriter[T]
trigger
method sets the time interval of the trigger (batch) for a streaming query.
Note
|
Trigger specifies how often results should be produced by a StreamingQuery. See Trigger.
|
The default trigger is ProcessingTime(0L) that runs a streaming query as often as possible.
Tip
|
Consult Trigger to learn about Trigger and ProcessingTime types.
|
Starting Continuous Writing to Sink — start
methods
start(): StreamingQuery
start(path: String): StreamingQuery (1)
-
Sets
path
option topath
and callsstart()
start
methods start a streaming query and return a StreamingQuery object to continually write data.
Note
|
Whether or not you have to specify path option depends on the DataSource in use.
|
Recognized options:
-
queryName
is the name of active streaming query. -
checkpointLocation
is the directory for checkpointing.