defaultParallelism(): Int
Scheduler Backends
Spark comes with a pluggable backend mechanism called scheduler backend (aka backend scheduler) to support various cluster managers, e.g. Apache Mesos, Hadoop YARN or Spark’s own Spark Standalone and Spark local.
These cluster managers differ by their custom task scheduling modes and resource offers mechanisms, and Spark’s approach is to abstract the differences in SchedulerBackend Contract.
A scheduler backend is created and started as part of SparkContext’s initialization (when TaskSchedulerImpl is started - see Creating Scheduler Backend and Task Scheduler).
Caution
|
FIXME Image how it gets created with SparkContext in play here or in SparkContext doc. |
Scheduler backends are started and stopped as part of TaskSchedulerImpl’s initialization and stopping.
Being a scheduler backend in Spark assumes a Apache Mesos-like model in which "an application" gets resource offers as machines become available and can launch tasks on them. Once a scheduler backend obtains the resource allocation, it can start executors.
Tip
|
Understanding how Apache Mesos works can greatly improve understanding Spark. |
SchedulerBackend Contract
Note
|
org.apache.spark.scheduler.SchedulerBackend is a private[spark] Scala trait in Spark.
|
Every SchedulerBackend
has to follow the following contract:
-
Can be started (using
start()
) and stopped (usingstop()
) -
Calculate default level of parallelism
-
Answers
isReady()
to inform whether it is currently started or stopped. It returnstrue
by default. -
Knows the application id for a job (using
applicationId()
).
Caution
|
FIXME applicationId() doesn’t accept an input parameter. How is Scheduler Backend related to a job and an application?
|
-
Knows an application attempt id (see applicationAttemptId)
-
Knows the URLs for the driver’s logs (see getDriverLogUrls).
Caution
|
FIXME Screenshot the tab and the links |
reviveOffers
Method
Note
|
It is used in TaskSchedulerImpl using backend internal reference when submitting tasks.
|
There are currently three custom implementations of reviveOffers
available in Spark for different clustering options:
Default Level of Parallelism (defaultParallelism method)
Default level of parallelism is used by TaskScheduler to use as a hint for sizing jobs.
Note
|
It is used in TaskSchedulerImpl.defaultParallelism .
|
Refer to LocalBackend for local mode.
Refer to Default Level of Parallelism for CoarseGrainedSchedulerBackend.
Refer to Default Level of Parallelism for CoarseMesosSchedulerBackend.
No other custom implementations of defaultParallelism()
exists.
Killing Task — killTask
Method
killTask(taskId: Long, executorId: String, interruptThread: Boolean)
killTask
throws a UnsupportedOperationException
by default.
applicationAttemptId
applicationAttemptId(): Option[String] = None
applicationAttemptId
returns the application attempt id of a Spark application.
It is currently only supported by YARN cluster scheduler backend as the YARN cluster manager supports multiple application attempts.
Note
|
applicationAttemptId is also a part of TaskScheduler contract and TaskSchedulerImpl directly calls the SchedulerBackend’s applicationAttemptId .
|
getDriverLogUrls
getDriverLogUrls: Option[Map[String, String]]
returns no URLs by default.
It is currently only supported by YarnClusterSchedulerBackend
Available Implementations
Spark comes with the following scheduler backends:
-
LocalBackend (local mode)
-
-
SparkDeploySchedulerBackend used in Spark Standalone (and local-cluster - FIXME)
-
-
YarnClientSchedulerBackend (for client deploy mode)
-
YarnClusterSchedulerBackend (for cluster deploy mode).
-
-