Sign in with Microsoft
Sign in or create an account.
Hello,
Select a different account.
You have multiple accounts
Choose the account you want to sign in with.

General Hadoop Performance Considerations

MapReduce Jobs and Tasks

  • Each ScaleR algorithm running in MapReduce invokes one or more MapReduce Jobs, one after another

  • Each MapReduce Job consists of one or more Map tasks

  • Map tasks can execute in parallel

  • Set RxHadoopMR( …  consoleOutput=TRUE … ) to track job progress

MapReduce Job and Task Scaling

  • Random Forest with rxExec (small to medium data)

    • #jobs = 1

    • #tasks = nTrees (default is 10)

    • Random Forest (large data, e.g. 100 GB+)

      • #jobs ~ nTrees * maxDepth (default is 10 x 10; start smaller, e.g. 2 x 2)

      • #tasks = #inputSplits

    • Logistic Regression, GLM, k-Means

      • #jobs = #iterations (typically 4 - 15 iterations)

      • #tasks = #inputSplits

    • Linear Regression, Ridge Regression, rxImportControl #inputSplits by setting mapred.min.split.size

      • #jobs = 1-2

      • #tasks = #inputSplits

Need more help?

Want more options?

Explore subscription benefits, browse training courses, learn how to secure your device, and more.

Communities help you ask and answer questions, give feedback, and hear from experts with rich knowledge.

Was this information helpful?

What affected your experience?
By pressing submit, your feedback will be used to improve Microsoft products and services. Your IT admin will be able to collect this data. Privacy Statement.

Thank you for your feedback!

×