QA: Running mapreduce jobs using RevoScaleR

How can customers monitor their MR jobs in 'http: //xxxxxxx:50030/?' You can monitor the mapreduce jobs two ways:

Via the Hadoop jobtracker URL - 'http://<jobTrackerhost>:50030/' and drill down into task details.
The other way would be to look at the job output files created by Revolution R in running your mapreduce job. By default these output files re deleted after running the job, but you can set the option 'autoCleanup = FALSE' when you create the Hadoop compute context using RxHadoopMR(). You can also use the RevoScaleR command 'rxGetJobOutput' to list the hadoop output from running the job.

       2. Can I control the number of mapper and reduce tasks in running my code via RxHadoopMR()?

Recently we added an optional parameter to RxHadoopMR() called hadoopSwitches. This argument allows you to specify any generic Hadoop command-line switches. For example, to specify a queue to run the job on, you could do this:

hadoopSwitches = "-Dmapred.job.queue.name=default"

Multiple switches can be set by separating them with a space character, just as one would do in a command line.

Controlling the number of mappers in MapReduce is somewhat tricky. The basic rule is that the number of map tasks equals the number of input splits. If your input files are "non-splittable", e.g. certain types of compressed files, then the number of input splits equals the number of input files. The individual files within a Composite XDF set are non-splittable. On the other hand, if your file is splittable, for example it is a CSV file, then FileInputFormat will split that file into chunks close to the HDFS block size, typically 128 MB. If you have a very large CSV file or files (e.g. 10 TB) and do not want too many map tasks, you can set mapred.min.split.size to a large number, thereby getting larger input splits and fewer map tasks. This can be set using the hadoopSwitches argument. The downside of this trick is that you will sacrifice data locality. To have huge splits AND data locality, you need to increase the HDFS block size. There is a little more info at this page: http://wiki.apache.org/hadoop/HowManyMapsAndReduces

For HPC jobs (i.e. rxExec() ), you can directly set the number of map tasks using rxExec()'s timesToRun and taskChunkSize arguments. The number of map tasks will will be equal to:

timesToRun / taskChunkSize.

        3. Is it possible to call/create a custom Mapper / Reducer function in RevoScaleR?

     Their are a few ways to do this:

Use 'rxExec()': It allows you to distribute and run any arbitrary R code in parallel - this would assume that you already created a Hadoop compute context using 'RxHadoopMR()'.
If you have a RxHadoopMR() compute context already defined, you can use the rxDataStep() function to call a 'Reducer' function
on your data in HDFS - rxDataStep() allows you to also call an arbitrary R function via the 'transformFunc' argument.

Use the 'rmr' package that is part of RHadoop.

4. For accessing 'Hive/HBase' do you have any specific packages or is it ok to use the 'RHBase' package?

RevoScaleR doesn't contain any specific functionality for Hive/HBase - you can use the RHBase package to supplement the other R function that exist in RevoScaleR. If you have an ODBC driver installed for HBase you can use the RxOdbcData() function to import data and run SQL queries against data stored in HBase. Take a look at the RevoScaleR ODBC Data Import/Export Guide for specific information on how to import data via ODBC:

http://packages.revolutionanalytics.com/doc/7.1.0/linux/RevoScaleR_ODBC.pdf

QA: Running mapreduce jobs using RevoScaleR

Need more help?

Want more options?

Was this information helpful?

Thank you for your feedback!