How and why to use DNA sequence alignment methods

How do you perform DNA sequence alignment?

DNA sequence alignment is a method of arranging protein sequences to study genetic and evolutionary relationships. By arranging protein sequences, genetic similarities can be identified and used to draw conclusions about the relationship between different sequences.

These similarities may be a consequence of functional, structural, or evolutionary relationships. These similarities are beneficial for many biological applications, such as genome assembly, and even some non-biological applications, including natural language processing.

Benefits of DNA alignment

DNA sequence alignment is performed using a matrix, with aligned sequences of nucleotide residues represented in the rows of the matrix. Gaps are inserted between residues to align similar characters in subsequent columns.

When comparing two sequences, a common ancestor can be determined by characterizing the similarities and interpreting mismatches. Mismatches can be interpreted in several ways, most often as point mutations. Similarly, gaps may be interpreted as insertion or deletion mutations. The mutations may have been introduced in either single or multiple lineages since first diverging in their evolutionary history.

As compared to other types of sequence alignment, DNA sequence alignment benefits from DNA and RNA nucleotide bases being very similar to one another. The conservation of base pairs is indicative of shared structural or functional roles.

How to align DNA sequences

DNA sequence alignment can be performed both manually and computationally.

When working with short sequences, they may simply be aligned by hand. However, most DNA sequence alignment is performed on long and highly variable sequences, which require digital tools. These tools, typically known as DNA sequence alignment software, are complex algorithms capable of producing high-fidelity sequence alignments.

Different categories of computational sequence alignment

Within computational sequence alignment, there are two different categories:

Local alignments

Local alignments identify regions of similarity within long sequences that are ultimately divergent. While often preferable over global alignments, the added challenge of identifying regions with similarities complicates the calculations, often requiring complicated and specialized methodologies such as dynamic programming or probabilistic methods.

Global alignments

Global alignments force alignments to span across all query sequences. While simpler compared to local alignments, spanning all query sequences is computationally intensive and requires more time to align and analyze DNA sequences.