Blast Algorithm
From DrugPedia: A Wikipedia for Drug discovery
Line 61: | Line 61: | ||
123 45654 56789 876 5654… score | 123 45654 56789 876 5654… score | ||
000 00012 10000 123 4345… drop off score | 000 00012 10000 123 4345… drop off score | ||
+ | |||
+ | After terminating the extension, the alignment is trimmed back to the maximum score, 9. And the larger stretch of sequence is our High Scoring Pair. | ||
+ | |||
+ | [[Image:blast2.png]] | ||
+ | Trimming back for getting hsp |
Revision as of 08:57, 13 September 2008
Blasts stands for Basic Local Alignment Search Tool.Its Based on Karlin-Altschul statistics.
It compares novel sequence with that present in nucleotide & protein databases which are already characterized. It finds regions of sequence similarity which will yield functional & evolutionary clues. Regions of similarity can be:
local: where the region of similarity is based on small stretches in the sequence global: regions of similarity are present across the sequence.
Main idea of basic BLAST is that "Homologous sequences are likely to contain a short high scoring similarity region a hit. Each hit gives a seed that BLAST tries to extend on both sides".
How does blast works?
First BLAST optionally filters out low-complexity regions that are not useful for producing meaningful sequence alignments. Then, a three-step layers to sequentially refine the “good alignments”.
- Seeding step
- Extension step
- Evaluation step
The Seeding step
BLAST assumes that significant alignments have words in common. A word refers to number of letters.For example, if 3 letters is a word, then the sequence PQGEFG has words PQG,QGE, GEF and EFG. Protein sequences have word length of 3 and 11 for DNA sequences.
listing of words in a sequence
BLAST cares about only the high-scoring words. The scores are created by comparing the word in the list(eg. PQG) with all the 3-letter words (PQG,QGE, GEF and EFG). By using the scoring matrix (substitution matrix) to score the comparison of each residue pair, there are 20^3 possible match scores for a 3-letter word. For example, the score obtained by comparing PQG with PEG and PQA is 15 and 12, respectively. For DNA words, a match is scored as +5 and a mismatch as -4. After that, a neighborhood word score threshold T is used to reduce the number of possible matching words. The words whose scores are greater than the threshold T will remain in the possible matching words list, while those with lower scores will be discarded. For example, PEG is kept, but PQA is abandoned when T is 13.
Seq Score Seq Score ….. …. ....... ........ PQG aligns with PEG 15 PEG 15 PRG 14 T=13 PRG 14 PSG 13 PSG 13 PQA 12 PQA 12
If T=13 gives us approximately 50 word hits and if a sequence is of length 250 amino acids, what is the approximate total number of words to search for?
50 x 250 = 12,500 !!
As oppose to 20^3 x 250 = 2,000,000!! After getting the list of 50 words in the first sequence position, we can use these words to search the database. These 50 words can be pre-computed and this speeds up the overall search process!
The Extension Step
After rounding up all the word hits as seeds in the neighborhood region, both directions of a seed are extended to generate an alignment.
When do we stop the extension? ??? Let’s do an example: Using match score = +1, mismatch score = -1, and “T” is our seed and extending to the right... Use a drop off score variable, X = 5
The quick brown fox jumps over the lazy dog The quiet brown cat purrs when she sees him 123 45654 56789 876 5654… score 000 00012 10000 123 4345… drop off score
After terminating the extension, the alignment is trimmed back to the maximum score, 9. And the larger stretch of sequence is our High Scoring Pair.