Loading AI tools
Bioinformatics search algorithm From Wikipedia, the free encyclopedia
In bioinformatics, BLAST (basic local alignment search tool)[3] is an algorithm and program for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences. A BLAST search enables a researcher to compare a subject protein or nucleotide sequence (called a query) with a library or database of sequences, and identify database sequences that resemble the query sequence above a certain threshold. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence.
This article needs additional citations for verification. (April 2024) |
Original author(s) | Stephen Altschul, Warren Gish, Webb Miller, Eugene Myers, and David Lipman |
---|---|
Developer(s) | NCBI |
Stable release | 2.16.0+[1]
/ 25 June 2024 |
Written in | C and C++[2] |
Operating system | UNIX, Linux, Mac, MS-Windows |
Type | Bioinformatics tool |
License | Public domain |
Website | blast |
BLAST is one of the most widely used bioinformatics programs for sequence searching.[4] It addresses a fundamental problem in bioinformatics research. The heuristic algorithm it uses is much faster than other approaches, such as calculating an optimal alignment. This emphasis on speed is vital to making the algorithm practical on the huge genome databases currently available, although subsequent algorithms can be even faster.
The BLAST program was designed by Eugene Myers, Stephen Altschul, Warren Gish, David J. Lipman and Webb Miller at the NIH and was published in J. Mol. Biol. in 1990. BLAST extended the alignment work of a previously developed program for protein and DNA sequence similarity searches, FASTA, by adding a novel stochastic model developed by Samuel Karlin and Stephen Altschul.[5] They proposed "a method for estimating similarities between the known DNA sequence of one organism with that of another",[3] and their work has been described as "the statistical foundation for BLAST."[6] Subsequently, Altschul, Gish, Miller, Myers, and Lipman designed and implemented the BLAST program, which was published in the Journal of Molecular Biology in 1990 and has been cited over 100,000 times since.[7]
While BLAST is faster than any Smith-Waterman implementation for most cases, it cannot "guarantee the optimal alignments of the query and database sequences" as Smith-Waterman algorithm does. The Smith-Waterman algorithm was an extension of a previous optimal method, the Needleman–Wunsch algorithm, which was the first sequence alignment algorithm that was guaranteed to find the best possible alignment. However, the time and space requirements of these optimal algorithms far exceed the requirements of BLAST.
BLAST is more time-efficient than FASTA by searching only for the more significant patterns in the sequences, yet with comparative sensitivity. This could be further realized by understanding the algorithm of BLAST introduced below.
Examples of other questions that researchers use BLAST to answer are:
BLAST is also often used as part of other algorithms that require approximate sequence matching.
BLAST is available on the web on the NCBI website. Different types of BLASTs are available according to the query sequences and the target databases. Alternative implementations include AB-BLAST (formerly known as WU-BLAST), FSA-BLAST (last updated in 2006), and ScalaBLAST.[8][9]
The original paper by Altschul, et al.[7] was the most highly cited paper published in the 1990s.[10]
Input sequences (in FASTA or Genbank format), database to search and other optional parameters such as scoring matrix.[clarification needed]
BLAST output can be delivered in a variety of formats. These formats include HTML, plain text, and XML formatting. For NCBI's webpage, the default format for output is HTML. When performing a BLAST on NCBI, the results are given in a graphical format showing the hits found, a table showing sequence identifiers for the hits with scoring related data, as well as alignments for the sequence of interest and the hits received with corresponding BLAST scores for these. The easiest to read and most informative of these is probably the table.
If one is attempting to search for a proprietary sequence or simply one that is unavailable in databases available to the general public through sources such as NCBI, there is a BLAST program available for download to any computer, at no cost. This can be found at BLAST+ executables. There are also commercial programs available for purchase. Databases can be found on the NCBI site, as well as on the Index of BLAST databases (FTP).
Using a heuristic method, BLAST finds similar sequences, by locating short matches between the two sequences. This process of finding similar sequences is called seeding. It is after this first match that BLAST begins to make local alignments. While attempting to find similarity in sequences, sets of common letters, known as words, are very important. For example, suppose that the sequence contains the following stretch of letters, GLKFA. If a BLAST was being conducted under normal conditions, the word size would be 3 letters. In this case, using the given stretch of letters, the searched words would be GLK, LKF, and KFA. The heuristic algorithm of BLAST locates all common three-letter words between the sequence of interest and the hit sequence or sequences from the database. This result will then be used to build an alignment. After making words for the sequence of interest, the rest of the words are also assembled. These words must satisfy a requirement of having a score of at least the threshold T, when compared by using a scoring matrix.
One commonly used scoring matrix for BLAST searches is BLOSUM62,[11] although the optimal scoring matrix depends on sequence similarity. Once both words and neighborhood words are assembled and compiled, they are compared to the sequences in the database in order to find matches. The threshold score T determines whether or not a particular word will be included in the alignment. Once seeding has been conducted, the alignment which is only 3 residues long, is extended in both directions by the algorithm used by BLAST. Each extension impacts the score of the alignment by either increasing or decreasing it. If this score is higher than a pre-determined T, the alignment will be included in the results given by BLAST. However, if this score is lower than this pre-determined T, the alignment will cease to extend, preventing the areas of poor alignment from being included in the BLAST results. Note that increasing the T score limits the amount of space available to search, decreasing the number of neighborhood words, while at the same time speeding up the process of BLAST
To run the software, BLAST requires a query sequence to search for, and a sequence to search against (also called the target sequence) or a sequence database containing multiple such sequences. BLAST will find sub-sequences in the database which are similar to subsequences in the query. In typical usage, the query sequence is much smaller than the database, e.g., the query may be one thousand nucleotides while the database is several billion nucleotides.
The main idea of BLAST is that there are often High-scoring Segment Pairs (HSP) contained in a statistically significant alignment. BLAST searches for high scoring sequence alignments between the query sequence and the existing sequences in the database using a heuristic approach that approximates the Smith-Waterman algorithm. However, the exhaustive Smith-Waterman approach is too slow for searching large genomic databases such as GenBank. Therefore, the BLAST algorithm uses a heuristic approach that is less accurate than the Smith-Waterman algorithm but over 50 times faster. [12] The speed and relatively good accuracy of BLAST are among the key technical innovations of the BLAST programs.
An overview of the BLAST algorithm (a protein to protein search) is as follows:[12]
BLASTn compares one or more nucleotide sequence to a database or another sequence. This is useful when trying to identify evolutionary relationships between organisms. [14]
tBLASTn used to search for proteins in sequences that haven't been translated into proteins yet. It takes a protein sequence and compares it to all possible translations of a DNA sequence. This is useful when looking for similar protein-coding regions in DNA sequences that haven't been fully annotated, like ESTs (short, single-read cDNA sequences) and HTGs (draft genome sequences). Since these sequences don't have known protein translations, we can only search for them using tBLASTn.[15]
BLASTx compares a nucleotide query sequence, which can be translated into six different protein sequences, against a database of known protein sequences. This tool is useful when the reading frame of the DNA sequence is uncertain or contains errors that might cause mistakes in protein-coding. BLASTx provides combined statistics for hits across all frames, making it helpful for the initial analysis of new DNA sequences.[16]
BLASTp, or Protein BLAST, is used to compare protein sequences. You can input one or more protein sequences that you want to compare against a single protein sequence or a database of protein sequences. This is useful when you're trying to identify a protein by finding similar sequences in existing protein databases.[17]
Parallel BLAST versions of split databases are implemented using MPI and Pthreads, and have been ported to various platforms including Windows, Linux, Solaris, Mac OS X, and AIX. Popular approaches to parallelize BLAST include query distribution, hash table segmentation, computation parallelization, and database segmentation (partition). Databases are split into equal sized pieces and stored locally on each node. Each query is run on all nodes in parallel and the resultant BLAST output files from all nodes merged to yield the final output. Specific implementations include MPIblast, ScalaBLAST, DCBLAST and so on.[18]
MPIblast makes use of a database segmentation technique to parallelize the computation process.[19] This allows for significant performance improvements when conducting BLAST searches across a set of nodes in a cluster. In some scenarios a superlinear speedup is achievable. This makes MPIblast suitable for the extensive genomic datasets that are typically used in bioinformatics.
BLAST generally runs at a speed of O(n), where n is the size of the database.[20] The time to complete the search increases linearly as the size of the database increases. MPIblast utilizes parallel processing to speed up the search. The ideal speed for any parallel computation is a complexity of O(n/p), with n being the size of the database and p being the number of processors. This would indicate that the job is evenly distributed among the p number of processors. This is visualized in the included graph. The superlinear speedup that can sometimes occur with MPIblast can have a complexity better than O(n/p). This occurs because the cache memory can be used to decrease the run time.[21]
The predecessor to BLAST, FASTA, can also be used for protein and DNA similarity searching. FASTA provides a similar set of programs for comparing proteins to protein and DNA databases, DNA to DNA and protein databases, and includes additional programs for working with unordered short peptides and DNA sequences. In addition, the FASTA package provides SSEARCH, a vectorized implementation of the rigorous Smith-Waterman algorithm. FASTA is slower than BLAST, but provides a much wider range of scoring matrices, making it easier to tailor a search to a specific evolutionary distance.
An extremely fast but considerably less sensitive alternative to BLAST is BLAT (Blast Like Alignment Tool). While BLAST does a linear search, BLAT relies on k-mer indexing the database, and can thus often find seeds faster.[22] Another software alternative similar to BLAT is PatternHunter.
Advances in sequencing technology in the late 2000s has made searching for very similar nucleotide matches an important problem. New alignment programs tailored for this use typically use BWT-indexing of the target database (typically a genome). Input sequences can then be mapped very quickly, and output is typically in the form of a BAM file. Example alignment programs are BWA, SOAP, and Bowtie.
For protein identification, searching for known domains (for instance from Pfam) by matching with Hidden Markov Models is a popular alternative, such as HMMER.
An alternative to BLAST for comparing two banks of sequences is PLAST. PLAST provides a high-performance general purpose bank to bank sequence similarity search tool relying on the PLAST[23] and ORIS[24] algorithms. Results of PLAST are very similar to BLAST, but PLAST is significantly faster and capable of comparing large sets of sequences with a small memory (i.e. RAM) footprint.
For applications in metagenomics, where the task is to compare billions of short DNA reads against tens of millions of protein references, DIAMOND[25] runs at up to 20,000 times as fast as BLASTX, while maintaining a high level of sensitivity.
The open-source software MMseqs is an alternative to BLAST/PSI-BLAST, which improves on current search tools over the full range of speed-sensitivity trade-off, achieving sensitivities better than PSI-BLAST at more than 400 times its speed.[26]
Optical computing approaches have been suggested as promising alternatives to the current electrical implementations. OptCAM is an example of such approaches and is shown to be faster than BLAST.[27]
This article's tone or style may not reflect the encyclopedic tone used on Wikipedia. (December 2023) |
While both Smith-Waterman and BLAST are used to find homologous sequences by searching and comparing a query sequence with those in the databases, they do have their differences.
Due to the fact that BLAST is based on a heuristic algorithm, the results received through BLAST will not include all the possible hits within the database. BLAST misses hard to find matches.
An alternative in order to find all the possible hits would be to use the Smith-Waterman algorithm. This method varies from the BLAST method in two areas, accuracy and speed. The Smith-Waterman option provides better accuracy, in that it finds matches that BLAST cannot, because it does not exclude any information. Therefore, it is necessary for remote homology. However, when compared to BLAST, it is more time consuming and requires large amounts of computing power and memory. However, advances have been made to speed up the Smith-Waterman search process dramatically. These advances include FPGA chips and SIMD technology.
For more complete results from BLAST, the settings can be changed from their default settings. The optimal settings for a given sequence, however, may vary. The settings one can change are E-Value, gap costs, filters, word size, and substitution matrix.
Note, the algorithm used for BLAST was developed from the algorithm used for Smith-Waterman. BLAST employs an alignment which finds "local alignments between sequences by finding short matches and from these initial matches (local) alignments are created".[28]
To help users interpreting BLAST results, different software is available. According to installation and use, analysis features and technology, here are some available tools:[29]
Example visualizations of BLAST results are shown in Figure 4 and 5.
BLAST can be used for several purposes. These include identifying species, locating domains, establishing phylogeny, DNA mapping, and comparison.
Seamless Wikipedia browsing. On steroids.
Every time you click a link to Wikipedia, Wiktionary or Wikiquote in your browser's search results, it will show the modern Wikiwand interface.
Wikiwand extension is a five stars, simple, with minimum permission required to keep your browsing private, safe and transparent.