Hadoop Composite XDF Block Size tuning suggestions

Hadoop Composite XDF Block Size

 MapReduce splits each input text file into one or more input splits which by default are the HDFS block size, e.g. 128 MB
  • Each input split is converted from uncompressed, unparsed text to a compressed and parsed output binary xdfd file in the “data” subdirectory of the output directory – header info for the set of xdfd’s is in a single xdfm metadata file in the “metadata” directory 
  • For efficiency in subsequent analyses, each output xdfd file should roughly match the HDFS block size
  • To compensate for XDF compression you’ll therefore usually need to increase the output xdfd file size by  increasing the input split size using this parameter to RxHadoopMR():
    • hadoopSwitches="-Dmapred.min.split.size=1000000000"
    • For more recent Hadoop installations using YARN, the parameter is mapreduce.input.fileinputformat.split.minsize
  • Increasing the input split size further may reduce the number of composite XDF files and hence the number of parallelized map tasks in subsequent analyses. This may be useful if the number of available map slots or containers is small relative to the number of splits. Conversely, when many map slots or containers are available, smaller input splits and more xdfd’s may result in faster completion.
  • Example
Importing an input CSV of 670 MB in the Hortonworks Sandbox using the default input split size (32MB) created 670/32=21 xdfd’s with an rxSummary performance of 185”.  Increasing input split size to 150MB created 5 xdfd’s each about 32MB with an rxSummary performance of 68”.

 rxSetComputeContext(RxHadoopMR(hadoopSwitches =

        "-Dmapreduce.input.fileinputformat.split.minsize=150000000"))

rxImport(myCSV, myCXdf, overwrite=TRUE)

rxSetComputeContext(RxHadoopMR())  # set it back when done
Note This is a "FAST PUBLISH" article created directly from within the Microsoft support organization. The information contained herein is provided as-is in response to emerging issues. As a result of the speed in making it available, the materials may include typographical errors and may be revised at any time without notice. See Terms of Use for other considerations.
Properties

Article ID: 3104166 - Last Review: 11/01/2015 05:05:00 - Revision: 1.0

Revolution Analytics

  • KB3104166
Feedback