General Hadoop Performance Considerations

General Hadoop Performance Considerations

MapReduce Jobs and Tasks
  • Each ScaleR algorithm running in MapReduce invokes one or more MapReduce Jobs, one after another
  • Each MapReduce Job consists of one or more Map tasks
  • Map tasks can execute in parallel
  • Set RxHadoopMR( …  consoleOutput=TRUE … ) to track job progress
 MapReduce Job and Task Scaling
  • Random Forest with rxExec (small to medium data)
    • #jobs = 1
    • #tasks = nTrees (default is 10)
    • Random Forest (large data, e.g. 100 GB+)
      • #jobs ~ nTrees * maxDepth (default is 10 x 10; start smaller, e.g. 2 x 2)
      • #tasks = #inputSplits
    • Logistic Regression, GLM, k-Means
      • #jobs = #iterations (typically 4 - 15 iterations)
      • #tasks = #inputSplits
    • Linear Regression, Ridge Regression, rxImportControl #inputSplits by setting mapred.min.split.size
      • #jobs = 1-2
      • #tasks = #inputSplits
Eigenschappen

Artikel-id: 3104164 - Laatst bijgewerkt: 1 nov. 2015 - Revisie: 1

Feedback