Forest and Boosted Tree Prediction Speed on Hadoop

  • By default, rxPredict launches one MR job per tree to minimize memory usage

  • For smallish data sets, call rxPredict inside rxExec or set scheduleOnce=TRUE (in 7.3) to reduce the scheduling overhead

–      rxPredict(dforestObject, data = myData, outData = myOutData, scheduleOnce = TRUE, ...)

  • For larger data sets, set scheduleOnce=1 to do prediction in parallel using a single MR job (available in 7.3; internally, uses rxDataStep to call predict.randomForest; requires the randomForest package )

–      rxPredict(dforestObject, data = myData, outData = myOutData, scheduleOnce = 1, ...

Need more help?

Expand your skills
Explore Training
Get new features first
Join Microsoft Insiders

Was this information helpful?

What affected your experience?

Thank you for your feedback!

×