Stranger and Stranger Said Alice; Big Data and Big Calculations

By Tom Oldfield | 2019年1月14日(月曜日) - 08:00
"Stranger and Stranger" Said Alice

“My calculation in Vortex sometimes runs and sometimes fails with a memory error?”

I was recently asked this question - why a calculation sometimes runs and sometimes does not run in Vortex (with a memory error) even though the program had access to the same amount of memory in both cases.  In fact, the situation was stranger: the process could be run successfully on the full dataset after first doing the calculation on a subset of the data.   Surely that is crazy since the pages of memory available remain unchanged (assuming no memory leakage).

The first thing to note was that this calculation was a clustering calculation which is of order N2 on memory, so we can see that we quickly run out of runtime memory in a process; but this problem was only asking for about 1 gigabyte (GB) of memory when 4 GB was allocated to the Vortex JVM.  The second critical issue is that the process was asking for a single array which requires contiguous memory to be allocated.

Memory allocation and data

Many people will have run a disk-defragmentation program and looked at the graphical representation of the disk contents, initially as a mess of tiny fragments of file-blocks all over this disk, and after, a nice clean block of data followed by a large empty space (hopefully) .  In some ways this is how memory usage will be used within the program space, in general a simple memory allocator will try all free blocks of memory starting at the beginning of the program and stop looking when there is an address with enough contiguous space following it to fit the data structure.  (Notice this is different from disk allocation which is done by disk-block size so has links. RAM is designed for speed so is always contiguous – see later). You can see that asking for a large array will:

  • Probably take more time – as the search is likely to take longer to find a large block of free memory
  • Will become more problematic over time as the program space becomes more fragmented
  • May not succeed even if there is plenty of memory available to the process as the memory is fragmented in the process space
  • May become possible later if progressively larger blocks of memory are requested then deallocated as this tends to defragment the constantly updating memory as it masks the largest region which is probably going to get bigger as memory around is deallocated
  • A single use, single action program is likely to have nearly all process space available as contiguous memory, while a long running program that performs many actions (such as Vortex) is likely to have a highly fragmented memory space

So, this issue of running large calculations in Vortex will hit the problem of data size much more quickly than might occur in a single use program.  However, we can mitigate this problem:

  • by assigning more memory to the process – though when the calculation is N2 on memory this does not provide much of a solution
  • by using a different algorithm – such as a block diagonal algorithm O(nm), or a dynamic algorithm O(n) both of which are available within Vortex (e.g., Kmeans clustering or Sequence cluster dynamic blocking algorithm)

Is there a solution? We are getting into the realm of big-data where our normal solution to a problem becomes difficult to achieve and the user must start to consider alternative ways to handle data when the data reaches critical size.

Really big data and memory

As a final note as we move towards really big data – Genomics.  We have been told that now we have 64-bit addressing on nearly all computers, so as long as you have enough money you can just throw more computer memory at the problem.  This is not quite correct for contiguous memory allocation; it comes down to a decision made years ago by chip-set manufacturers. In most computer chip-sets, machine code instructions consist of a fixed block of two 64-bit words; the first is a 64-bit address and the second is the operand and other information.  In the case of memory arrays, an example is the REL instruction which consists of:

  • Word 1: 64-bit base address of the array
  • Word 2: 32-bits of stuff: An operand –  for example, add to accumulator
  • Word 2: 32-bits of relative address

The relative address defines the offset from the base address that mirrors the high-level language code of array[56]: where array is the data object, the base address is where we find the 64-bit address location of the beginning of the array – array[0] (that might not be true depending on the implementation), and the offset “56” data units (i.e., bytes, ints, doubles).  This means vector operations on arrays can be traversed with a single clock instruction and we guarantee that the 56th offset position is the data we want.  

You can see that the offset address is only 31 bits (32 minus sign bit): so is limited to 2,147,483,648.  Why does this matter – why would ever want to allocate a single byte array of over 2 GB or a double array of 16 GB?  Because of genomics, the human genome is larger than 3 GB.

Of course, you would probably allocate memory by chromosome, or only consider the exons, or even compress the data in memory, but what about the rice genome? Things are getting close to the limit and we must start to understand the limitations of machine code when we consider genomic analysis.  So, what happens when we try write code around data that is too big? Well we would need to allocate an array with a large number (Java error = Integer number too large) and use a long data type for the array index (Java error = incompatible index data type) – the people who wrote the compilers knew of this problem (obviously).  

The final word

So what about Vortex and genomics? Well Vortex does handle genomic data with zero order performance on graphics, but not as a single sequence since it will  hit this memory allocation problem.  But, the genomic packages of Vortex have the option of the use of a special set of data types written in house, e.g., “longByte”.  These packages are designed so they can be used as standalone programs (i.e., sequence alignment) and are therefore future proofed against huge data and enable the use of data sizes beyond the offset limit of 31bits – until we hit the 64-bit limit and that will never happen….


Request a demonstration or
ask us a question