Principles of Computing – Big Data Challenges

By Tom Oldfield | Wednesday, December 5, 2018 - 08:00 UTC
Principles of Computing

It was previously discussed in blog articles that Vortex has Zero-order graphics whose measurable performance was approaching zero order as a function of data size.  Vortex also has a close to 1st order sequence alignment algorithm that is some 100x faster than BLAST – the de-facto standard program in life-sciences – so how do we realise this.

It’s the algorithm that is key, stupid!

The key to high performance computing is the design of the algorithm, and in fact the field of data science is essentially mathematics: numerical analysis solutions to multiple branches of maths and statistics.  To achieve fast graphics, or just to calculate difficult problems on huge data sets faster than the “expectation of the user” the algorithm is key.

Since an algorithm is a mathematical (numerical) solution to a problem, we need to understand how to implement approximation and compromises that do not detract from the overall goal for the user.  This is particularly the problem for NP-hard algorithms where exact solutions would take a very long time to run.  This requires thinking outside the box – how do we re-imagine the problem to solve to give huge a performance increase without loss of knowledge.  An example from Dotmatics Vortex is the Life-science sequence clustering algorithm that uses a Block-grouping all vs all algorithm; in this case for a real customer case of clustering 138,000 sequences (so more than 19,000,000,000 sequence comparisons) a performance improvement from the expected > 200 years for BLAST like algorithm to the actual time of 86 seconds on a Dell business laptop.  Given that we can succeed at this first hurdle – algorithm design, what else can we do to design high performance big data software.

It’s a number stupid – don’t complicate the problem!

There is a massive growth in “modern” computer languages now such as Python and JavaScript, then there is Scala etc.  So are these computer languages suitable for big-data analytics – well, read on.  Key to understanding big data is that we need to keep the problem as small as possible but realising that compressing the data beyond its native state will result in a computational overhead.  If we take many numerical data values, then we need to store each value in the smallest number of bytes that provides the necessary precision and domain range for the data.  We need to use primitive data.  Using objects for each value (even “strings”) is hugely detrimental to memory and access considerations.  Another important consideration; use data structures for all the data that is suitable for the problem – use arrays if you need to calculate something from all the data, or the data is sequential (ie ordered).  Only use hash maps where you need random access – this type of data structure scales badly for truly big data.  Particularly for vector style calculations, the data should be contiguous in memory and data arrays do this; there is a single machine code instruction designed for their access.

Computing languages are generic – design for the problem!

Vortex is implemented in the computer language Java which has many convenient classes and methods that can be used to quickly implement a program.  Unfortunately, many of these were not designed with big/complex data in mind; they were designed for as many programmers as possible to quickly generate working programs on “normal” data.  If you want to work with big-data, and Life-sciences is huge data, we need write some high-performance code.  Just because a computer language allows you to call a single method with a single line of code to solve your problem, it does not mean that the provided method does it efficiently. 

The largest Java class to date to be re-implemented in Vortex using Java is the Swing graphical JTable which has been replaced by a high performance richly featured table that is not only graphically zero-order up to the memory size of the computer, but also has renderers for a huge range of scientific data types that layer on top of a N-dimensional hyper-cube data system.  This that allows recursive sub-rows within rows and columns within columns.  How else are we going to show sequence alignments and annotations for whole genomes at 70 fps (frames per second)?

Go beyond the problem!

There are situations within any program where we require extreme performance, or access across massive amounts of data is required.  This is where a good knowledge in hardware design and compiler implementation counts; it is possible to work with a computer to squeeze the last bit of performance within a well-designed algorithm at the core of a computation engine.  Most important, we fail don’t fail with sustained throughput of huge data.  For example, how we work with chunks of data needs to match the L1, L2 and L3 cache size of typical chipsets, bus widths for data flow (i.e. big data processing), or even how may registers a typical CPU has.

For example, data is paged at every level – ie a block of data can be read into L3 cache form main memory and micro-code/machine code is highly optimised to move sequential arrays of data with good throughput where Hashmaps and other random-access data structures would be pathological.

The second part of this is knowledge of compilers that produce machine code, typically RISC, based on hardware/firmware with many performance features such as speculative instructions.  Can we get the compiler to see how to use as few machine code instructions as possible? Yes, but only for programming languages “close to the machine code”. The most extreme example of this technic is the latest sequence alignment algorithm included in Vortex that is some 100x faster than BLAST (the life science bench mark), where the core algorithm has been reduced to just 4 machine code instructions per/monomer comparison and is close to first order even though it handles mutations/insertions/deletions and can identify small regions of alignment.

Summary

I have just scratched the surface of the principles of high performance programming, and these extend the normal good practice of program design.  It really comes down to experience and thinking beyond the problem and keeping 1 step ahead of the customer expectation; a fast program is one where the customer thinks it’s fast.  This is the same as the simple description of “Big-data” is data that is too big to handle in conventional ways.

Comments