MapOutputTrackerMaster

A MapOutputTrackerMaster is the MapOutputTracker for a driver.

A MapOutputTrackerMaster is the source of truth for the collection of MapStatus objects (map output locations) per shuffle id (as recorded from ShuffleMapTasks).

MapOutputTrackerMaster uses Spark’s org.apache.spark.util.TimeStampedHashMap for mapStatuses.

Note
There is currently a hardcoded limit of map and reduce tasks above which Spark does not assign preferred locations aka locality preferences based on map output sizes — 1000 for map and reduce each.

It uses MetadataCleaner with MetadataCleanerType.MAP_OUTPUT_TRACKER as cleanerType and cleanup function to drop entries in mapStatuses.

You should see the following INFO message when the MapOutputTrackerMaster is created (FIXME it uses MapOutputTrackerMasterEndpoint):

INFO SparkEnv: Registering MapOutputTracker

registerShuffle Method

Caution
FIXME

getStatistics Method

Caution
FIXME

unregisterMapOutput Method

Caution
FIXME

registerMapOutputs Method

Caution
FIXME

incrementEpoch Method

Caution
FIXME

cleanup Function for MetadataCleaner

cleanup(cleanupTime: Long) method removes old entries in mapStatuses and cachedSerializedStatuses that have timestamp earlier than cleanupTime.

It uses org.apache.spark.util.TimeStampedHashMap.clearOldValues method.

Tip

Enable DEBUG logging level for org.apache.spark.util.TimeStampedHashMap logger to see what happens in TimeStampedHashMap.

Add the following line to conf/log4j.properties:

log4j.logger.org.apache.spark.util.TimeStampedHashMap=DEBUG

You should see the following DEBUG message in the logs for entries being removed:

DEBUG Removing key [entry.getKey]

getEpoch Method

Caution
FIXME

Settings

Table 1. MapOutputTrackerMaster’s Spark Properties
Spark Property Default Value Description

spark.shuffle.reduceLocality.enabled

true

Controls whether to compute locality preferences for reduce tasks.

When enabled (i.e. true), MapOutputTrackerMaster computes the preferred hosts on which to run a given map output partition in a given shuffle, i.e. the nodes that the most outputs for that partition are on.

results matching ""

    No results matching ""