Tuning Options for ScaleR text Imports

Windows/Linux Block Size
  • When choosing block size, try to select rowsPerRead to yield ~10M elements in the block, or even less
    • With 20 columns, rowsPerRead=500e3
    • With 1000 cols, rowsPerRead=1000
  • This tends to give a block size such that you can process multiple blocks per read
  • Use blocksPerRead > 1
    • The exact value depends on how much RAM you have available
    • Generally having multiple blocks in memory simultaneously improves performance
  • It is easy to increase blocksPerRead, but expensive to re-block, so err on the side of having smaller blocks
  • If you use rxSplit() or rxDataStep() to create samples, e.g. training/validation, then use rxDataStep() to re-block according to the previous principle

Article ID: 3104210 - Last Review: 29 Oct 2015 - Revision: 1