General Hadoop Performance Considerations
MapReduce Jobs and Tasks
-
Each ScaleR algorithm running in MapReduce invokes one or more MapReduce Jobs, one after another
-
Each MapReduce Job consists of one or more Map tasks
-
Map tasks can execute in parallel
-
Set RxHadoopMR( … consoleOutput=TRUE … ) to track job progress
MapReduce Job and Task Scaling
-
Random Forest with rxExec (small to medium data)
-
#jobs = 1
-
#tasks = nTrees (default is 10)
-
Random Forest (large data, e.g. 100 GB+)
-
#jobs ~ nTrees * maxDepth (default is 10 x 10; start smaller, e.g. 2 x 2)
-
#tasks = #inputSplits
-
-
Logistic Regression, GLM, k-Means
-
#jobs = #iterations (typically 4 - 15 iterations)
-
#tasks = #inputSplits
-
-
Linear Regression, Ridge Regression, rxImportControl #inputSplits by setting mapred.min.split.size
-
#jobs = 1-2
-
#tasks = #inputSplits
-
-