Hadoop Composite XDF Block Size tuning suggestions

Hadoop Composite XDF Block Size

 MapReduce splits each input text file into one or more input splits which by default are the HDFS block size, e.g. 128 MB
  • Each input split is converted from uncompressed, unparsed text to a compressed and parsed output binary xdfd file in the “data” subdirectory of the output directory – header info for the set of xdfd’s is in a single xdfm metadata file in the “metadata” directory 
  • For efficiency in subsequent analyses, each output xdfd file should roughly match the HDFS block size
  • To compensate for XDF compression you’ll therefore usually need to increase the output xdfd file size by  increasing the input split size using this parameter to RxHadoopMR():
    • hadoopSwitches="-Dmapred.min.split.size=1000000000"
    • For more recent Hadoop installations using YARN, the parameter is mapreduce.input.fileinputformat.split.minsize
  • Increasing the input split size further may reduce the number of composite XDF files and hence the number of parallelized map tasks in subsequent analyses. This may be useful if the number of available map slots or containers is small relative to the number of splits. Conversely, when many map slots or containers are available, smaller input splits and more xdfd’s may result in faster completion.
  • Example
Importing an input CSV of 670 MB in the Hortonworks Sandbox using the default input split size (32MB) created 670/32=21 xdfd’s with an rxSummary performance of 185”.  Increasing input split size to 150MB created 5 xdfd’s each about 32MB with an rxSummary performance of 68”.

 rxSetComputeContext(RxHadoopMR(hadoopSwitches =


rxImport(myCSV, myCXdf, overwrite=TRUE)

rxSetComputeContext(RxHadoopMR())  # set it back when done

Article ID: 3104166 - Last Review: 1 Nov 2015 - Revision: 1