General Hadoop Performance Considerations

General Hadoop Performance Considerations

MapReduce Jobs and Tasks
  • Each ScaleR algorithm running in MapReduce invokes one or more MapReduce Jobs, one after another
  • Each MapReduce Job consists of one or more Map tasks
  • Map tasks can execute in parallel
  • Set RxHadoopMR( …  consoleOutput=TRUE … ) to track job progress
 MapReduce Job and Task Scaling
  • Random Forest with rxExec (small to medium data)
    • #jobs = 1
    • #tasks = nTrees (default is 10)
    • Random Forest (large data, e.g. 100 GB+)
      • #jobs ~ nTrees * maxDepth (default is 10 x 10; start smaller, e.g. 2 x 2)
      • #tasks = #inputSplits
    • Logistic Regression, GLM, k-Means
      • #jobs = #iterations (typically 4 - 15 iterations)
      • #tasks = #inputSplits
    • Linear Regression, Ridge Regression, rxImportControl #inputSplits by setting mapred.min.split.size
      • #jobs = 1-2
      • #tasks = #inputSplits
Note This is a "FAST PUBLISH" article created directly from within the Microsoft support organization. The information contained herein is provided as-is in response to emerging issues. As a result of the speed in making it available, the materials may include typographical errors and may be revised at any time without notice. See Terms of Use for other considerations.
Properties

Article ID: 3104164 - Last Review: 11/01/2015 04:45:00 - Revision: 1.0


  • KB3104164
Feedback