Tuning Forest and Boosted Tree Prediction Speed on Hadoop

Forest and Boosted Tree Prediction Speed on Hadoop  
  • By default, rxPredict launches one MR job per tree to minimize memory usage
  • For smallish data sets, call rxPredict inside rxExec or set scheduleOnce=TRUE (in 7.3) to reduce the scheduling overhead
–      rxPredict(dforestObject, data = myData, outData = myOutData, scheduleOnce = TRUE, ...)
  • For larger data sets, set scheduleOnce=1 to do prediction in parallel using a single MR job (available in 7.3; internally, uses rxDataStep to call predict.randomForest; requires the randomForest package )
–      rxPredict(dforestObject, data = myData, outData = myOutData, scheduleOnce = 1, ...
Properties

Article ID: 3104165 - Last Review: 1 Nov 2015 - Revision: 1

Feedback