Blast Algorithm

From DrugPedia: A Wikipedia for Drug discovery

(Difference between revisions)
Jump to: navigation, search
(New page: Blasts stands for ''Basic Local Alignment Search Tool''.Its Based on ''Karlin-Altschul'' statistics. It compares novel sequence with that present in nucleotide & protein databases which ...)
Line 12: Line 12:
First BLAST optionally filters out low-complexity regions that are not useful for producing meaningful sequence alignments. Then, a three-step layers to sequentially refine the “good alignments”.  
First BLAST optionally filters out low-complexity regions that are not useful for producing meaningful sequence alignments. Then, a three-step layers to sequentially refine the “good alignments”.  
-
1 Seeding step
+
#Seeding step
-
 
+
#Extension step
-
2 Extension step
+
#Evaluation step
-
 
+
-
3 Evaluation step
+
==The Seeding step==
==The Seeding step==
Line 38: Line 36:
                                             PSG       13                        PSG   13
                                             PSG       13                        PSG   13
                                             PQA       12                        PQA   12
                                             PQA       12                        PQA   12
 +
 +
 +
If T=13 gives us approximately 50 word hits and if a sequence is of length 250 amino acids, what is the approximate total number of words to search for?
 +
                                                     
 +
                                                              50 x 250 = 12,500 !!
 +
 +
As oppose to 20^3 x 250 = 2,000,000!! After getting the list of 50 words in the first sequence position, we can use these words to search the database. These 50 words can be pre-computed and this speeds up the overall search process!
 +
 +
==The Extension Step==
 +
 +
After rounding up all the word hits as seeds in the neighborhood region, both directions of a seed are extended to generate an alignment.
 +
 +
 +
 +
 +
 +
When do we stop the extension? ???
 +
Let’s do an example:
 +
Using match score = +1, mismatch score = -1, and “T” is our seed and extending to the right...
 +
Use a drop off score variable, X = 5
 +
 +
                    The quick  brown fox  jumps over the lazy dog
 +
                    The quiet  brown cat  purrs when she sees him
 +
                    123 45654  56789 876 5654…   score
 +
                    000 00012 10000 123 4345…     drop off score

Revision as of 08:54, 13 September 2008

Blasts stands for Basic Local Alignment Search Tool.Its Based on Karlin-Altschul statistics.

It compares novel sequence with that present in nucleotide & protein databases which are already characterized. It finds regions of sequence similarity which will yield functional & evolutionary clues. Regions of similarity can be:

local: where the region of similarity is based on small stretches in the sequence global: regions of similarity are present across the sequence.

Main idea of basic BLAST is that "Homologous sequences are likely to contain a short high scoring similarity region a hit. Each hit gives a seed that BLAST tries to extend on both sides".

How does blast works?

First BLAST optionally filters out low-complexity regions that are not useful for producing meaningful sequence alignments. Then, a three-step layers to sequentially refine the “good alignments”.

  1. Seeding step
  2. Extension step
  3. Evaluation step

The Seeding step

BLAST assumes that significant alignments have words in common. A word refers to number of letters.For example, if 3 letters is a word, then the sequence PQGEFG has words PQG,QGE, GEF and EFG. Protein sequences have word length of 3 and 11 for DNA sequences.

Image:blast1.png

listing of words in a sequence


BLAST cares about only the high-scoring words. The scores are created by comparing the word in the list(eg. PQG) with all the 3-letter words (PQG,QGE, GEF and EFG). By using the scoring matrix (substitution matrix) to score the comparison of each residue pair, there are 20^3 possible match scores for a 3-letter word. For example, the score obtained by comparing PQG with PEG and PQA is 15 and 12, respectively. For DNA words, a match is scored as +5 and a mismatch as -4. After that, a neighborhood word score threshold T is used to reduce the number of possible matching words. The words whose scores are greater than the threshold T will remain in the possible matching words list, while those with lower scores will be discarded. For example, PEG is kept, but PQA is abandoned when T is 13.



                                           Seq       Score                      Seq     Score       
                                            …..       ….                       .......  ........
     PQG        aligns with                PEG	       15                        PEG	   15
                                           PRG	       14       T=13             PRG	   14 
                                           PSG	       13                        PSG	   13
                                           PQA	       12                        PQA	   12


If T=13 gives us approximately 50 word hits and if a sequence is of length 250 amino acids, what is the approximate total number of words to search for?

                                                              50 x 250 = 12,500 !! 

As oppose to 20^3 x 250 = 2,000,000!! After getting the list of 50 words in the first sequence position, we can use these words to search the database. These 50 words can be pre-computed and this speeds up the overall search process!

The Extension Step

After rounding up all the word hits as seeds in the neighborhood region, both directions of a seed are extended to generate an alignment.



When do we stop the extension? ??? Let’s do an example: Using match score = +1, mismatch score = -1, and “T” is our seed and extending to the right... Use a drop off score variable, X = 5

                    The quick  brown fox  jumps over the lazy dog
                    The quiet   brown cat  purrs when she sees him
                    123 45654  56789 876 5654…    score
                    000 00012 10000 123 4345…     drop off score