General Hadoop Performance Considerations - Microsoft Support

Support

Sign in

Sign in with Microsoft

Sign in or create an account.

Hello,

Select a different account.

You have multiple accounts

Choose the account you want to sign in with.

Revolution Analytics More...Less

General Hadoop Performance Considerations

MapReduce Jobs and Tasks

Each ScaleR algorithm running in MapReduce invokes one or more MapReduce Jobs, one after another
Each MapReduce Job consists of one or more Map tasks
Map tasks can execute in parallel
Set RxHadoopMR( … consoleOutput=TRUE … ) to track job progress

MapReduce Job and Task Scaling

Random Forest with rxExec (small to medium data)
- #jobs = 1
- #tasks = nTrees (default is 10)
- Random Forest (large data, e.g. 100 GB+)
  - #jobs ~ nTrees * maxDepth (default is 10 x 10; start smaller, e.g. 2 x 2)
  - #tasks = #inputSplits
- Logistic Regression, GLM, k-Means
  - #jobs = #iterations (typically 4 - 15 iterations)
  - #tasks = #inputSplits
- Linear Regression, Ridge Regression, rxImportControl #inputSplits by setting mapred.min.split.size
  - #jobs = 1-2
  - #tasks = #inputSplits

Email

SUBSCRIBE RSS FEEDS

Need more help?

Want more options?

Discover Community

Explore subscription benefits, browse training courses, learn how to secure your device, and more.

Microsoft 365 subscription benefits

Microsoft 365 training

Microsoft security

Accessibility center

Communities help you ask and answer questions, give feedback, and hear from experts with rich knowledge.

Ask the Microsoft Community

Microsoft Tech Community

Windows Insiders

Microsoft 365 Insiders