MapOutputTracker

A MapOutputTracker is a Spark service to track the locations of the (shuffle) map outputs of a stage. It uses an internal MapStatus map with an array of MapStatus for every partition for a shuffle id.

There are two versions of MapOutputTracker:

MapOutputTracker is available under SparkEnv.get.mapOutputTracker. It is also available as MapOutputTracker in the driver’s RPC Environment.

Tip

Enable DEBUG logging level for org.apache.spark.MapOutputTracker logger to see what happens in MapOutputTracker.

Add the following line to conf/log4j.properties:

log4j.logger.org.apache.spark.MapOutputTracker=DEBUG

It works with ShuffledRDD when it asks for preferred locations for a shuffle using tracker.getPreferredLocationsForShuffle.

It is also used for mapOutputTracker.containsShuffle and MapOutputTrackerMaster.registerShuffle when a new ShuffleMapStage is created.

Caution
FIXME DAGScheduler.mapOutputTracker

MapOutputTrackerMaster.getStatistics(dependency) returns MapOutputStatistics that becomes the result of JobWaiter.taskSucceeded for ShuffleMapStage if it’s the final stage in a job.

MapOutputTrackerMaster.registerMapOutputs for a shuffle id and a list of MapStatus when a ShuffleMapStage is finished.

unregisterShuffle

Caution
FIXME

MapStatus

A MapStatus is the result returned by a ShuffleMapTask to DAGScheduler that includes:

  • the location where ShuffleMapTask ran (as def location: BlockManagerId)

  • an estimated size for the reduce block, in bytes (as def getSizeForBlock(reduceId: Int): Long).

There are two types of MapStatus:

  • CompressedMapStatus that compresses the estimated map output size to 8 bits (Byte) for efficient reporting.

  • HighlyCompressedMapStatus that stores the average size of non-empty blocks, and a compressed bitmap for tracking which blocks are empty.

When the number of blocks (the size of uncompressedSizes) is greater than 2000, HighlyCompressedMapStatus is chosen.

Caution
FIXME What exactly is 2000? Is this the number of tasks in a job?
Caution
FIXME Review ShuffleManager

Epoch Number

Caution
FIXME

MapOutputTrackerWorker

A MapOutputTrackerWorker is the MapOutputTracker for executors. The internal mapStatuses map serves as a cache and any miss triggers a fetch from the driver’s MapOutputTrackerMaster.

Note
The only difference between MapOutputTrackerWorker and the base abstract class MapOutputTracker is that the internal mapStatuses mapping between ints and an array of MapStatus objects is an instance of the thread-safe java.util.concurrent.ConcurrentHashMap.

results matching ""

    No results matching ""