log4j.logger.org.apache.spark.MapOutputTracker=DEBUG
MapOutputTracker
A MapOutputTracker is a Spark service to track the locations of the (shuffle) map outputs of a stage. It uses an internal MapStatus map with an array of MapStatus
for every partition for a shuffle id.
There are two versions of MapOutputTracker
:
-
MapOutputTrackerMaster for a driver
-
MapOutputTrackerWorker for executors
MapOutputTracker is available under SparkEnv.get.mapOutputTracker
. It is also available as MapOutputTracker
in the driver’s RPC Environment.
Tip
|
Enable Add the following line to |
It works with ShuffledRDD when it asks for preferred locations for a shuffle using tracker.getPreferredLocationsForShuffle
.
It is also used for mapOutputTracker.containsShuffle
and MapOutputTrackerMaster.registerShuffle when a new ShuffleMapStage is created.
Caution
|
FIXME DAGScheduler.mapOutputTracker
|
MapOutputTrackerMaster.getStatistics(dependency) returns MapOutputStatistics
that becomes the result of JobWaiter.taskSucceeded for ShuffleMapStage if it’s the final stage in a job.
MapOutputTrackerMaster.registerMapOutputs for a shuffle id and a list of MapStatus
when a ShuffleMapStage is finished.
unregisterShuffle
Caution
|
FIXME |
MapStatus
A MapStatus is the result returned by a ShuffleMapTask to DAGScheduler that includes:
-
the location where ShuffleMapTask ran (as
def location: BlockManagerId
) -
an estimated size for the reduce block, in bytes (as
def getSizeForBlock(reduceId: Int): Long
).
There are two types of MapStatus:
-
CompressedMapStatus that compresses the estimated map output size to 8 bits (
Byte
) for efficient reporting. -
HighlyCompressedMapStatus that stores the average size of non-empty blocks, and a compressed bitmap for tracking which blocks are empty.
When the number of blocks (the size of uncompressedSizes
) is greater than 2000, HighlyCompressedMapStatus is chosen.
Caution
|
FIXME What exactly is 2000? Is this the number of tasks in a job? |
Caution
|
FIXME Review ShuffleManager |
Epoch Number
Caution
|
FIXME |
MapOutputTrackerWorker
A MapOutputTrackerWorker is the MapOutputTracker
for executors. The internal mapStatuses
map serves as a cache and any miss triggers a fetch from the driver’s MapOutputTrackerMaster.
Note
|
The only difference between MapOutputTrackerWorker and the base abstract class MapOutputTracker is that the internal mapStatuses mapping between ints and an array of MapStatus objects is an instance of the thread-safe java.util.concurrent.ConcurrentHashMap.
|