There is often much confusion between hypothesis and discovery driven problem-solving methods. The whole of science is built on the idea that we create a hypothesis, then the community tries to take this apart – find the single negative case that break the proposed hypothesis. Science is the “scientific method” and predicated on defining a hypothesis that is falsifiable. Key to falsifiable hypothesis is we can never conclusively prove a scientific principle, but a single negative can disprove it. We can only approach a “Theory” the ultimate goal of scientific discovery – a hypothesis where the entire world has failed for 10-100 years to disprove or find evidence that goes against this theory. Examples include the theory of gravitation or the theory of evolution.
From the algorithm or data analysis point of view, we create a hypothesis driven functionality when we try to prove (or disprove) a target exists – the search is based on the science we wish to look for. An example of this is the simple string search algorithm – where we attempt to find similar strings within a data set; in life-sciences we have sequence alignment. In this case we have a target sequence and try to identify sections of a genome, for example, that match this sequence. The hypothesis here is that the search sequence exists or does not exist. In this way, we identify information within the data that immediately provides knowledge since we know the question asked.
So, what is discovery driven data analysis?
The technical description is “to identify information within data by applying technics outside the data domain (usually mathematical targets) to identify patterns, clusters or outliers”. Since the question was asked mathematically it cannot tell us knowledge about the reason for the information found but it can find information we had no knowledge of. It can discover new things not previously known since we don’t need an existing hypothesis to test for it.
For example, we can look for correlations in atomic position throughout the protein structure coordinate data (Oldfield T.J. 2002) using statistical methods that find atomic clusters based on position and chemistry. From this analysis, we can find the known active sites within proteins such as the catalytic triad or zinc-finger, but we additionally find many other atomic clusters which represent unknown local structure features within proteins. For example, we find that the catalytic triad catalytic site is very often closely associated with an additional phenylalanine residue close by and always in the same orientation. We could not have assumed that this should be included in a search hypothesis since no immediately obvious chemical reason exists for this residue to be there in multiple classes of protein.
So, if discovery driven data analysis is so powerful why is it not used more. Well, it works between the boundaries of data noise and the known data properties (systematic bias) so finding additional information can be a very difficult process of fine tuning and optimising. Second, it does not tell you knowledge – it only points the way and leaves the scientist to find the reason.
Within Dotmatics software (Vortex) a multitude of discovery functions are provided: The Life-science function include Biological sequence activity relationships, principle component analysis of sequence mutations, match pairs analysis of sequence mutations. Extending these ideas are the predictive tools of general/ordinary N-dimensional least squares, Bayes classification and Bayes probability prediction methods. Finally, methods are being developed around Maximum likelihood, neural networks, genetic algorithms and probabilistic methods (ie HHMs) and will become available soon. Key to these tools is a measure of predictive power – do we have a sensible result, and these are built into all predictive analysis methods using various cross validation methods. Without this feedback it would not be easy for the scientist to know that the results were significant. In fact, this has gone one step further where the predictive tools can auto-remove parametric data that provides little or no predictive power to auto-optimise the probability targets.
Data mining the protein data bank: residue interactions.
Oldfield T.J. : Proteins (2002) 49(4), 510-28.