Forest and Boosted Tree Prediction Speed on Hadoop
-
By default, rxPredict launches one MR job per tree to minimize memory usage
-
For smallish data sets, call rxPredict inside rxExec or set scheduleOnce=TRUE (in 7.3) to reduce the scheduling overhead
– rxPredict(dforestObject, data = myData, outData = myOutData, scheduleOnce = TRUE, ...)
-
For larger data sets, set scheduleOnce=1 to do prediction in parallel using a single MR job (available in 7.3; internally, uses rxDataStep to call predict.randomForest; requires the randomForest package )
– rxPredict(dforestObject, data = myData, outData = myOutData, scheduleOnce = 1, ...