Tuning Forest and Boosted Tree Prediction Speed on Hadoop

Forest and Boosted Tree Prediction Speed on Hadoop  
  • By default, rxPredict launches one MR job per tree to minimize memory usage
  • For smallish data sets, call rxPredict inside rxExec or set scheduleOnce=TRUE (in 7.3) to reduce the scheduling overhead
–      rxPredict(dforestObject, data = myData, outData = myOutData, scheduleOnce = TRUE, ...)
  • For larger data sets, set scheduleOnce=1 to do prediction in parallel using a single MR job (available in 7.3; internally, uses rxDataStep to call predict.randomForest; requires the randomForest package )
–      rxPredict(dforestObject, data = myData, outData = myOutData, scheduleOnce = 1, ...
Note This is a "FAST PUBLISH" article created directly from within the Microsoft support organization. The information contained herein is provided as-is in response to emerging issues. As a result of the speed in making it available, the materials may include typographical errors and may be revised at any time without notice. See Terms of Use for other considerations.
Properties

Article ID: 3104165 - Last Review: 11/01/2015 04:56:00 - Revision: 1.0

Revolution Analytics

  • KB3104165
Feedback