Tuning Options for ScaleR text Imports

Windows/Linux Block Size
  • When choosing block size, try to select rowsPerRead to yield ~10M elements in the block, or even less
    • With 20 columns, rowsPerRead=500e3
    • With 1000 cols, rowsPerRead=1000
  • This tends to give a block size such that you can process multiple blocks per read
  • Use blocksPerRead > 1
    • The exact value depends on how much RAM you have available
    • Generally having multiple blocks in memory simultaneously improves performance
  • It is easy to increase blocksPerRead, but expensive to re-block, so err on the side of having smaller blocks
  • If you use rxSplit() or rxDataStep() to create samples, e.g. training/validation, then use rxDataStep() to re-block according to the previous principle
Note This is a "FAST PUBLISH" article created directly from within the Microsoft support organization. The information contained herein is provided as-is in response to emerging issues. As a result of the speed in making it available, the materials may include typographical errors and may be revised at any time without notice. See Terms of Use for other considerations.

Article ID: 3104210 - Last Review: 10/29/2015 06:04:00 - Revision: 1.0

Revolution Analytics

  • KB3104210