Sign in with Microsoft
Sign in or create an account.
Hello,
Select a different account.
You have multiple accounts
Choose the account you want to sign in with.

Hadoop Composite XDF Block Size

 MapReduce splits each input text file into one or more input splits which by default are the HDFS block size, e.g. 128 MB

  • Each input split is converted from uncompressed, unparsed text to a compressed and parsed output binary xdfd file in the “data” subdirectory of the output directory – header info for the set of xdfd’s is in a single xdfm metadata file in the “metadata” directory

  • For efficiency in subsequent analyses, each output xdfd file should roughly match the HDFS block size

  • To compensate for XDF compression you’ll therefore usually need to increase the output xdfd file size by  increasing the input split size using this parameter to RxHadoopMR():

    • hadoopSwitches="-Dmapred.min.split.size=1000000000"

    • For more recent Hadoop installations using YARN, the parameter is mapreduce.input.fileinputformat.split.minsize

  • Increasing the input split size further may reduce the number of composite XDF files and hence the number of parallelized map tasks in subsequent analyses. This may be useful if the number of available map slots or containers is small relative to the number of splits. Conversely, when many map slots or containers are available, smaller input splits and more xdfd’s may result in faster completion.

  • Example

Importing an input CSV of 670 MB in the Hortonworks Sandbox using the default input split size (32MB) created 670/32=21 xdfd’s with an rxSummary performance of 185”.  Increasing input split size to 150MB created 5 xdfd’s each about 32MB with an rxSummary performance of 68”.

 rxSetComputeContext(RxHadoopMR(hadoopSwitches =

        "-Dmapreduce.input.fileinputformat.split.minsize=150000000"))

rxImport(myCSV, myCXdf, overwrite=TRUE)

rxSetComputeContext(RxHadoopMR())  # set it back when done

Need more help?

Want more options?

Explore subscription benefits, browse training courses, learn how to secure your device, and more.

Communities help you ask and answer questions, give feedback, and hear from experts with rich knowledge.

Was this information helpful?

What affected your experience?
By pressing submit, your feedback will be used to improve Microsoft products and services. Your IT admin will be able to collect this data. Privacy Statement.

Thank you for your feedback!

×