掌握Apache Spark 2.0
Introduction
Overview of Apache Spark
Spark SQL
Spark SQL — Structured Queries on Large Scale
SparkSession — The Entry Point to Spark SQL
Builder — Building SparkSession with Fluent API
Datasets — Strongly-Typed DataFrames with Encoders
Encoders — Internal Row Converters
InternalRow — Internal Binary Row Format
DataFrame — Dataset of Rows
Row
RowEncoder — DataFrame Encoder
Schema — Structure of Data
StructType
StructField
Data Types
Dataset Operators
Column Operators
Standard Functions — functions object
User-Defined Functions (UDFs)
Aggregation — Typed and Untyped Grouping
UserDefinedAggregateFunction — User-Defined Aggregate Functions (UDAFs)
Window Aggregate Operators — Windows
Joins
Caching
DataSource API — Loading and Saving Datasets
DataFrameReader — Reading from External Data Sources
DataFrameWriter
DataSource
DataSourceRegister
CSVFileFormat
ParquetFileFormat
Custom Formats
BaseRelation
SparkPlanner — Query Planner
DDLStrategy
FileSourceStrategy
DataSourceStrategy
JoinSelection
DataSinks
Structured Query Plan
QueryPlanner — Transforming Logical Plans to Physical Queries
Query Execution
LogicalPlan — Logical Query Plan
LocalRelation Logical Operator
Join Logical Operator
ExplainCommand Logical Command
Logical Query Plan Analyzer
CheckAnalysis
SparkPlan — Physical Execution Plan
LocalTableScanExec Physical Operator
WindowExec Physical Operator
CoalesceExec Physical Operator
ExecutedCommandExec Physical Operator
BroadcastNestedLoopJoinExec Physical Operator
Debugging Query Execution
Datasets vs DataFrames vs RDDs
SQLConf
Catalog
ExternalCatalog — System Catalog of Permanent Entities
SessionState
SQL Parser Framework
SQLExecution Helper Object
Logical Query Plan Optimizer
Predicate Pushdown / Filter Pushdown
Combine Typed Filters
Propagate Empty Relation
Simplify Casts
Column Pruning
Constant Folding
Nullability (NULL Value) Propagation
Vectorized Parquet Decoder
GetCurrentDatabase / ComputeCurrentTime
Eliminate Serialization
CatalystSerde
Tungsten Execution Backend (aka Project Tungsten)
Whole-Stage Code Generation (CodeGen)
Hive Integration
Spark SQL CLI - spark-sql
CacheManager
Thrift JDBC/ODBC Server — Spark Thrift Server (STS)
SparkSQLEnv
Catalyst — Tree Manipulation Framework
TreeNode
Expression TreeNode
Attribute Expression
Generator
(obsolete) SQLContext
Settings
Spark MLlib
Spark MLlib — Machine Learning in Spark
ML Pipelines and PipelineStages (spark.ml)
ML Pipeline Components — Transformers
ML Pipeline Components — Estimators
ML Pipeline Models
Evaluators
CrossValidator
Params and ParamMaps
ML Persistence — Saving and Loading Models and Pipelines
Example — Text Classification
Example — Linear Regression
Latent Dirichlet Allocation (LDA)
Vector
LabeledPoint
Streaming MLlib
GeneralizedLinearRegression
Structured Streaming
Structured Streaming — Streaming Datasets
DataStreamReader
DataStreamWriter
Streaming Sources
FileStreamSource
KafkaSource
TextSocketSource
MemoryStream
Streaming Sinks
ConsoleSink
ForeachSink
StreamSourceProvider — Streaming Source Provider
KafkaSourceProvider
TextSocketSourceProvider
StreamSinkProvider
StreamingQueryManager
StreamingQuery
Trigger
StreamExecution
StreamingRelation
StreamingQueryListenerBus
MemoryPlan Logical Query Plan
Spark Streaming
Spark Streaming
StreamingContext
Stream Operators
Windowed Operators
SaveAs Operators
Stateful Operators
web UI and Streaming Statistics Page
Streaming Listeners
Checkpointing
JobScheduler
InputInfoTracker
JobGenerator
DStreamGraph
Discretized Streams (DStreams)
Input DStreams
ReceiverInputDStreams
ConstantInputDStreams
ForEachDStreams
WindowedDStreams
MapWithStateDStreams
StateDStreams
TransformedDStream
Receivers
ReceiverTracker
ReceiverSupervisors
ReceivedBlockHandlers
Ingesting Data from Kafka
KafkaUtils — Creating Kafka DStreams and RDDs
DirectKafkaInputDStream — Direct Kafka DStream
ConsumerStrategy — Kafka Consumers' Post-Configuration API
ConsumerStrategies Factory Object
LocationStrategy — Preferred Hosts per Topic Partitions
KafkaRDD
HasOffsetRanges and OffsetRange
RecurringTimer
Backpressure
Dynamic Allocation (Elastic Scaling)
ExecutorAllocationManager
StreamingSource
Settings
Spark Core / Tools
Spark Shell — spark-shell shell script
Web UI — Spark Application’s Web Console
Jobs Tab
Stages Tab
Stages for All Jobs
Stage Details
Pool Details
Storage Tab
Environment Tab
EnvironmentListener Spark Listener
Executors Tab
ExecutorsListener Spark Listener
SQL Tab
SQLListener Spark Listener
JobProgressListener Spark Listener
StorageStatusListener Spark Listener
StorageListener Spark Listener
RDDOperationGraphListener Spark Listener
BlockStatusListener Spark Listener
SparkUI
Spark Submit — spark-submit shell script
SparkSubmitArguments
SparkSubmitOptionParser — spark-submit's Command-Line Parser
SparkSubmitCommandBuilder Command Builder
spark-class shell script
AbstractCommandBuilder
SparkLauncher — Launching Spark Applications Programmatically
Spark Core / Architecture
Spark Architecture
Driver
Executors
TaskRunner
ExecutorSource
Master
Workers
Spark Core / RDD
Anatomy of Spark Application
SparkConf — Programmable Configuration for Spark Applications
Spark Properties and spark-defaults.conf Properties File
Deploy Mode
SparkContext
HeartbeatReceiver RPC Endpoint
Inside Creating SparkContext
ConsoleProgressBar
Local Properties — Creating Logical Job Groups
RDD - Resilient Distributed Dataset
Operators
Transformations
Actions
RDD Lineage — Logical Execution Plan
Partitions and Partitioning
Shuffling
Checkpointing
Dependencies
ParallelCollectionRDD
ParallelCollectionRDD
MapPartitionsRDD
PairRDDFunctions
CoGroupedRDD
HadoopRDD
ShuffledRDD
BlockRDD
Spark Core / Optimizations
Caching and Persistence
Broadcast variables
Accumulators
Spark Core / Services
SerializerManager
MemoryManager — Memory Management
UnifiedMemoryManager
SparkEnv — Spark Runtime Environment
DAGScheduler
Jobs
Stages
ShuffleMapStage — Intermediate Stage in Job
ResultStage — Final Stage in Job
DAGSchedulerEventProcessLoop — dag-scheduler-event-loop DAGScheduler Event Bus
JobListener and JobWaiter
Task Scheduler
Tasks
TaskSets
Schedulable
TaskSetManager
Schedulable Pool
Schedulable Builders
FIFOSchedulableBuilder
FairSchedulableBuilder
Scheduling Mode — spark.scheduler.mode Spark Property
TaskSchedulerImpl — Default TaskScheduler
Speculative Execution of Tasks
TaskResultGetter
TaskContext
TaskResults — DirectTaskResult and IndirectTaskResult
TaskMemoryManager
MemoryConsumer
TaskMetrics
TaskSetBlacklist — Blacklisting Executors and Nodes For TaskSet
Scheduler Backend
CoarseGrainedSchedulerBackend
Executor Backend
CoarseGrainedExecutorBackend
BlockManager
MemoryStore
DiskStore
BlockDataManager
ShuffleClient
BlockTransferService
BlockManagerMaster — BlockManager for Driver
BlockInfoManager
BlockInfo
Dynamic Allocation (of Executors)
ExecutorAllocationManager — Allocation Manager for Spark Core
ExecutorAllocationClient
ExecutorAllocationListener
ExecutorAllocationManagerSource
Shuffle Manager
ExternalShuffleService
ExternalClusterManager — Pluggable Cluster Managers
HTTP File Server
Broadcast Manager
Data Locality
Cache Manager
Spark, Akka and Netty
OutputCommitCoordinator
RPC Environment (RpcEnv)
Netty-based RpcEnv
ContextCleaner
MapOutputTracker
MapOutputTrackerMaster
TransportConf — Transport Configuration
Spark Deployment Environments
Deployment Environments — Run Modes
Spark local (pseudo-cluster)
Spark on cluster
Spark on YARN
Spark on YARN
YarnShuffleService — ExternalShuffleService on YARN
ExecutorRunnable
Client
YarnRMClient
ApplicationMaster
AMEndpoint — ApplicationMaster RPC Endpoint
YarnClusterManager — ExternalClusterManager for YARN
TaskSchedulers for YARN
YarnScheduler
YarnClusterScheduler
SchedulerBackends for YARN
YarnSchedulerBackend
YarnClientSchedulerBackend
YarnClusterSchedulerBackend
YarnSchedulerEndpoint RPC Endpoint
YarnAllocator
Introduction to Hadoop YARN
Setting up YARN Cluster
Kerberos
ConfigurableCredentialManager
ClientDistributedCacheManager
YarnSparkHadoopUtil
Settings
Spark Standalone
Spark Standalone
Standalone Master
Standalone Worker
web UI
Submission Gateways
Management Scripts for Standalone Master
Management Scripts for Standalone Workers
Checking Status
Example 2-workers-on-1-node Standalone Cluster (one executor per worker)
StandaloneSchedulerBackend
Spark on Mesos
Spark on Mesos
MesosCoarseGrainedSchedulerBackend
About Mesos
Execution Model
Execution Model
Security
Spark Security
Securing Web UI
Spark Core / Data Sources
Data Sources in Spark
Using Input and Output (I/O)
Spark and Parquet
Serialization
Spark and Cassandra
Spark and Kafka
Couchbase Spark Connector
Spark GraphX
Spark GraphX — Distributed Graph Computations
Graph Algorithms
Monitoring, Tuning and Debugging
Unified Memory Management
Spark History Server
HistoryServer
SQLHistoryListener
FsHistoryProvider
HistoryServerArguments
Logging
Performance Tuning
Spark Metrics System
MetricsConfig — Metrics System Configuration
Metrics Source
Spark Listeners — Intercepting Events from Spark Scheduler
LiveListenerBus
ReplayListenerBus
EventLoggingListener — Event Logging
StatsReportListener — Logging Summary Statistics
Debugging Spark using sbt
Varia
Building Apache Spark from Sources
Spark and Hadoop
Spark and software in-memory file systems
Spark and The Others
Distributed Deep Learning on Spark
Spark Packages
Interactive Notebooks
Interactive Notebooks
Apache Zeppelin
Spark Notebook
Spark Tips and Tricks
Spark Tips and Tricks
Access private members in Scala in Spark shell
SparkException: Task not serializable
Running Spark on Windows
Exercises
One-liners using PairRDDFunctions
Learning Jobs and Partitions Using take Action
Spark Standalone - Using ZooKeeper for High-Availability of Master
Spark’s Hello World using Spark shell and Scala
WordCount using Spark shell
Your first complete Spark application (using Scala and sbt)
Spark (notable) use cases
Using Spark SQL to update data in Hive using ORC files
Developing Custom SparkListener to monitor DAGScheduler in Scala
Developing RPC Environment
Developing Custom RDD
Working with Datasets using JDBC (and PostgreSQL)
Causing Stage to Fail
Further Learning
Courses
Books
Spark Distributions
DataStax Enterprise
MapR Sandbox for Hadoop (Spark 1.5.2 only)
Spark Workshop
Spark Advanced Workshop
Requirements
Day 1
Day 2
Spark Talk Ideas
Spark Talks Ideas (STI)
10 Lesser-Known Tidbits about Spark Standalone
Learning Spark internals using groupBy (to cause shuffle)
Powered by
GitBook
TaskSetBlacklist — Blacklisting Executors and Nodes For TaskSet
TaskSetBlacklist
— Blacklisting Executors and Nodes For TaskSet
Caution
FIXME
results matching "
"
No results matching "
"