Blast Algorithm
From DrugPedia: A Wikipedia for Drug discovery
Line 48: | Line 48: | ||
After rounding up all the word hits as seeds in the neighborhood region, both directions of a seed are extended to generate an alignment. | After rounding up all the word hits as seeds in the neighborhood region, both directions of a seed are extended to generate an alignment. | ||
- | + | …..???? PQL ??? …. | |
- | + | …. $$$ PQL $$$ …. | |
Line 65: | Line 65: | ||
[[Image:blast2.png]] | [[Image:blast2.png]] | ||
+ | |||
Trimming back for getting hsp | Trimming back for getting hsp | ||
+ | |||
+ | Extension should be invoked only when two non-overlapping hits are found within distance A on the same diagonal. | ||
+ | |||
+ | [[Image:blast3.png]] | ||
+ | |||
+ | non overlapping hsp | ||
+ | |||
+ | [[Image:blast4.gif]] | ||
+ | |||
+ | extension of seed | ||
+ | |||
+ | [[Image:blast5.gif]] | ||
+ | |||
+ | extension until cumulative score drops | ||
+ | |||
+ | ==Evaluation Step== | ||
+ | |||
+ | we know that Blast is based on Karlin-Altschul statistics . | ||
+ | |||
+ | The Karlin-Altschul equation is given by:- | ||
+ | |||
+ | [[Image:blast6.png]] | ||
+ | |||
+ | e value | ||
+ | |||
+ | |||
+ | |||
+ | E = expected no. of alignments | ||
+ | k= minor constant | ||
+ | m= length of query | ||
+ | n= length of database | ||
+ | λ= Scaling factor | ||
+ | S= raw score | ||
+ | λS= Normalized score | ||
+ | mn= search space | ||
+ | |||
+ | Raw score (S): Sum of scores for each aligned position and scores for gaps | ||
+ | S = sum(matches) - sum(mismatches) - sum(gap penalties) | ||
+ | note: this score varies with the scoring matrix used and thus may not be meaningfully | ||
+ | compared for different searches | ||
+ | |||
+ | Bit score (S’): Version of the raw score that is normalized by the scale of the scoring matrix ( ) and the scale of the search space size (K) | ||
+ | S’ = (sum(S) – ln(K)) / ln(2) | ||
+ | note: because it is normalized the bit score can be meaningfully compared across searches | ||
+ | |||
+ | E value: Number of alignments with score S’ or better that one would expect to find by chance in a search of a database of the same size | ||
+ | E = mn2-S’ | ||
+ | m = effective length of database | ||
+ | n = effective length of query sequence | ||
+ | note: E values may change if databases of different sizes are searched | ||
+ | |||
+ | |||
+ | Once we have all the HSP’s, we use a different cutoff “S” to rank the HSP’s. | ||
+ | The raw scores of HSPs are sorted base on “S” | ||
+ | The “S” alignment threshold can remove many random and low-scoring alignments. But if it's | ||
+ | set too high, it may also remove real alignments. | ||
+ | Once the HSPs are organized, they can be evaluated with a Final threshold based on the entire search space and corresponding to the value E set for the search. | ||
+ | BLAST reports any alignment or group of alignments that meets the E requirement. | ||
+ | |||
+ | Sometimes, two or more HSP regions that can be made into a longer alignment will be found. | ||
+ | |||
+ | For example, HSP score #1: 65 and 40 and HSP score #2: 52 and 45 | ||
+ | |||
+ | Poisson method: probability of multiple score is higher when the lower score of each set is higher. (45 is higher than 40) | ||
+ | Sum-of-scores method: 65+40=105 is higher than 52+45=97. | ||
+ | |||
+ | original BLAST uses the Poisson method whereas gapped BLAST and the WU-BLAST use the sum-of scores method. | ||
+ | |||
+ | Smith-Waterman local alignments are shown for the query sequence with each of the matched sequences in the database. | ||
+ | Those matches whose E value is lower then a threshold value are reported. | ||
+ | |||
+ | References:- | ||
+ | D.W. Mount (2004). "Bioinformatics: Sequence and Genome Analysis." | ||
+ | Ian Korf, Mark Yandell & Joseph Bedell."An essential guide to basic local alignment search tool" | ||
+ | http://en.wikipedia.org/wiki/BLAST | ||
+ | |||
+ | |||
+ | Submitted by:- | ||
+ | Ekta Pathak |
Revision as of 09:15, 13 September 2008
Blasts stands for Basic Local Alignment Search Tool.Its Based on Karlin-Altschul statistics.
It compares novel sequence with that present in nucleotide & protein databases which are already characterized. It finds regions of sequence similarity which will yield functional & evolutionary clues. Regions of similarity can be:
local: where the region of similarity is based on small stretches in the sequence global: regions of similarity are present across the sequence.
Main idea of basic BLAST is that "Homologous sequences are likely to contain a short high scoring similarity region a hit. Each hit gives a seed that BLAST tries to extend on both sides".
Contents |
How does blast works?
First BLAST optionally filters out low-complexity regions that are not useful for producing meaningful sequence alignments. Then, a three-step layers to sequentially refine the “good alignments”.
- Seeding step
- Extension step
- Evaluation step
The Seeding step
BLAST assumes that significant alignments have words in common. A word refers to number of letters.For example, if 3 letters is a word, then the sequence PQGEFG has words PQG,QGE, GEF and EFG. Protein sequences have word length of 3 and 11 for DNA sequences.
listing of words in a sequence
BLAST cares about only the high-scoring words. The scores are created by comparing the word in the list(eg. PQG) with all the 3-letter words (PQG,QGE, GEF and EFG). By using the scoring matrix (substitution matrix) to score the comparison of each residue pair, there are 20^3 possible match scores for a 3-letter word. For example, the score obtained by comparing PQG with PEG and PQA is 15 and 12, respectively. For DNA words, a match is scored as +5 and a mismatch as -4. After that, a neighborhood word score threshold T is used to reduce the number of possible matching words. The words whose scores are greater than the threshold T will remain in the possible matching words list, while those with lower scores will be discarded. For example, PEG is kept, but PQA is abandoned when T is 13.
Seq Score Seq Score ….. …. ....... ........ PQG aligns with PEG 15 PEG 15 PRG 14 T=13 PRG 14 PSG 13 PSG 13 PQA 12 PQA 12
If T=13 gives us approximately 50 word hits and if a sequence is of length 250 amino acids, what is the approximate total number of words to search for?
50 x 250 = 12,500 !!
As oppose to 20^3 x 250 = 2,000,000!! After getting the list of 50 words in the first sequence position, we can use these words to search the database. These 50 words can be pre-computed and this speeds up the overall search process!
The Extension Step
After rounding up all the word hits as seeds in the neighborhood region, both directions of a seed are extended to generate an alignment.
…..???? PQL ??? …. …. $$$ PQL $$$ ….
When do we stop the extension? ???
Let’s do an example:
Using match score = +1, mismatch score = -1, and “T” is our seed and extending to the right...
Use a drop off score variable, X = 5
The quick brown fox jumps over the lazy dog The quiet brown cat purrs when she sees him 123 45654 56789 876 5654… score 000 00012 10000 123 4345… drop off score
After terminating the extension, the alignment is trimmed back to the maximum score, 9. And the larger stretch of sequence is our High Scoring Pair.
Trimming back for getting hsp
Extension should be invoked only when two non-overlapping hits are found within distance A on the same diagonal.
non overlapping hsp
extension of seed
extension until cumulative score drops
Evaluation Step
we know that Blast is based on Karlin-Altschul statistics .
The Karlin-Altschul equation is given by:-
e value
E = expected no. of alignments k= minor constant m= length of query n= length of database λ= Scaling factor S= raw score λS= Normalized score mn= search space
Raw score (S): Sum of scores for each aligned position and scores for gaps S = sum(matches) - sum(mismatches) - sum(gap penalties) note: this score varies with the scoring matrix used and thus may not be meaningfully compared for different searches
Bit score (S’): Version of the raw score that is normalized by the scale of the scoring matrix ( ) and the scale of the search space size (K) S’ = (sum(S) – ln(K)) / ln(2) note: because it is normalized the bit score can be meaningfully compared across searches
E value: Number of alignments with score S’ or better that one would expect to find by chance in a search of a database of the same size E = mn2-S’ m = effective length of database n = effective length of query sequence note: E values may change if databases of different sizes are searched
Once we have all the HSP’s, we use a different cutoff “S” to rank the HSP’s.
The raw scores of HSPs are sorted base on “S”
The “S” alignment threshold can remove many random and low-scoring alignments. But if it's
set too high, it may also remove real alignments.
Once the HSPs are organized, they can be evaluated with a Final threshold based on the entire search space and corresponding to the value E set for the search.
BLAST reports any alignment or group of alignments that meets the E requirement.
Sometimes, two or more HSP regions that can be made into a longer alignment will be found.
For example, HSP score #1: 65 and 40 and HSP score #2: 52 and 45
Poisson method: probability of multiple score is higher when the lower score of each set is higher. (45 is higher than 40) Sum-of-scores method: 65+40=105 is higher than 52+45=97.
original BLAST uses the Poisson method whereas gapped BLAST and the WU-BLAST use the sum-of scores method.
Smith-Waterman local alignments are shown for the query sequence with each of the matched sequences in the database. Those matches whose E value is lower then a threshold value are reported.
References:- D.W. Mount (2004). "Bioinformatics: Sequence and Genome Analysis." Ian Korf, Mark Yandell & Joseph Bedell."An essential guide to basic local alignment search tool" http://en.wikipedia.org/wiki/BLAST
Submitted by:-
Ekta Pathak