Blast Algorithm

From DrugPedia: A Wikipedia for Drug discovery

(Difference between revisions)
Jump to: navigation, search
Line 22: Line 22:
[[Image:blast1.png]]
[[Image:blast1.png]]
-
listing of words in a sequence
+
===Listing of words in a sequence===
Line 38: Line 38:
-
If T=13 gives us approximately 50 word hits and if a sequence is of length 250 amino acids, what is the approximate total number of words to search for?
+
If T=13 gives us approximately 50 word hits and if a sequence is of length 250 amino acids, what is the approximate total number of words to search for
-
                                                     
+
 
-
                                                              50 x 250 = 12,500 !!  
+
50 x 250 = 12,500 !!  
As oppose to 20^3 x 250 = 2,000,000!! After getting the list of 50 words in the first sequence position, we can use these words to search the database. These 50 words can be pre-computed and this speeds up the overall search process!
As oppose to 20^3 x 250 = 2,000,000!! After getting the list of 50 words in the first sequence position, we can use these words to search the database. These 50 words can be pre-computed and this speeds up the overall search process!
Line 47: Line 47:
After rounding up all the word hits as seeds in the neighborhood region, both directions of a seed are extended to generate an alignment.
After rounding up all the word hits as seeds in the neighborhood region, both directions of a seed are extended to generate an alignment.
 +
…..???? PQL ??? ….
 +
 +
…. $$$ PQL $$$ ….
-
                                                  …..???? PQL ??? ….
 
-
                                                  …. $$$ PQL $$$ ….
 
 +
===When do we stop the extension?===
-
When do we stop the extension? ???
 
Let’s do an example:  
Let’s do an example:  
 +
Using match score = +1, mismatch score = -1, and “T” is our seed and extending to the right...  
Using match score = +1, mismatch score = -1, and “T” is our seed and extending to the right...  
Use a drop off score variable, X = 5
Use a drop off score variable, X = 5
Line 63: Line 65:
After terminating the extension, the alignment is trimmed back to the maximum score, 9. And the larger stretch of sequence is our High Scoring Pair.
After terminating the extension, the alignment is trimmed back to the maximum score, 9. And the larger stretch of sequence is our High Scoring Pair.
-
 
[[Image:blast2.png]]
[[Image:blast2.png]]
-
Trimming back for getting hsp
+
===Trimming back for getting hsp===
Extension should be invoked only when two non-overlapping hits are found within distance A on the same diagonal.
Extension should be invoked only when two non-overlapping hits are found within distance A on the same diagonal.
 +
[[Image:blast3.png|thumb|300px|non overlapping hsp]]
 +
[[Image:blast4.gif|thumb|300px|extension of seed]]
 +
[[Image:blast5.gif|thumb|300px|extension until cumulative score drops]]
-
[[Image:blast3.png]]
 
-
non overlapping hsp
+
==Evaluation Step==
 +
We know that Blast is based on  Karlin-Altschul statistics .
 +
The  Karlin-Altschul equation is given by:
 +
[[Image:blast6.png|100px|e value]] 
 +
                   
 +
E = expected no. of alignments
-
[[Image:blast4.gif]]
+
k= minor constant
-
extension of seed
+
m= length of query
-
[[Image:blast5.gif]]
+
n= length of database
-
+
-
extension until cumulative score drops
+
-
==Evaluation Step==
+
λ= Scaling factor
-
 
+
-
we know that Blast is based on  Karlin-Altschul statistics .
+
-
          The  Karlin-Altschul equation is given by:-
+
S= raw score
-
                  [[Image:blast6.png]]
+
λS= Normalized score
-
                         
+
-
                e value
+
-
                                                                                     
+
-
 
+
-
            E = expected no. of alignments
+
mn= search space
-
            k= minor constant
+
-
            m= length of query
+
-
            n= length of database
+
-
            λ= Scaling factor
+
-
            S= raw score
+
-
            λS= Normalized score
+
-
            mn= search space
+
Raw score (S):  Sum of scores for each aligned position and scores for gaps
Raw score (S):  Sum of scores for each aligned position and scores for gaps
Line 138: Line 131:
Those matches whose E value is lower then a threshold value are reported.
Those matches whose E value is lower then a threshold value are reported.
-
References:-
 
-
D.W. Mount (2004). "Bioinformatics: Sequence and Genome Analysis."
 
-
Ian Korf, Mark Yandell & Joseph Bedell."An essential guide to basic local alignment search tool"
+
==References==
 +
* D.W. Mount (2004). "Bioinformatics: Sequence and Genome Analysis."
 +
 
 +
* Ian Korf, Mark Yandell & Joseph Bedell."An essential guide to basic local alignment search tool"

Revision as of 11:31, 15 September 2008

Blasts stands for Basic Local Alignment Search Tool.Its Based on Karlin-Altschul statistics.

It compares novel sequence with that present in nucleotide & protein databases which are already characterized. It finds regions of sequence similarity which will yield functional & evolutionary clues. Regions of similarity can be:

local: where the region of similarity is based on small stretches in the sequence global: regions of similarity are present across the sequence.

Main idea of basic BLAST is that "Homologous sequences are likely to contain a short high scoring similarity region a hit. Each hit gives a seed that BLAST tries to extend on both sides".

Contents

How does blast works?

First BLAST optionally filters out low-complexity regions that are not useful for producing meaningful sequence alignments. Then, a three-step layers to sequentially refine the “good alignments”.

  1. Seeding step
  2. Extension step
  3. Evaluation step

The Seeding step

BLAST assumes that significant alignments have words in common. A word refers to number of letters.For example, if 3 letters is a word, then the sequence PQGEFG has words PQG,QGE, GEF and EFG. Protein sequences have word length of 3 and 11 for DNA sequences.

Image:blast1.png

Listing of words in a sequence

BLAST cares about only the high-scoring words. The scores are created by comparing the word in the list(eg. PQG) with all the 3-letter words (PQG,QGE, GEF and EFG). By using the scoring matrix (substitution matrix) to score the comparison of each residue pair, there are 20^3 possible match scores for a 3-letter word. For example, the score obtained by comparing PQG with PEG and PQA is 15 and 12, respectively. For DNA words, a match is scored as +5 and a mismatch as -4. After that, a neighborhood word score threshold T is used to reduce the number of possible matching words. The words whose scores are greater than the threshold T will remain in the possible matching words list, while those with lower scores will be discarded. For example, PEG is kept, but PQA is abandoned when T is 13.



                                           Seq       Score                      Seq     Score       
                                            …..       ….                       .......  ........
     PQG        aligns with                PEG	       15                        PEG	   15
                                           PRG	       14       T=13             PRG	   14 
                                           PSG	       13                        PSG	   13
                                           PQA	       12                        PQA	   12


If T=13 gives us approximately 50 word hits and if a sequence is of length 250 amino acids, what is the approximate total number of words to search for

50 x 250 = 12,500 !!

As oppose to 20^3 x 250 = 2,000,000!! After getting the list of 50 words in the first sequence position, we can use these words to search the database. These 50 words can be pre-computed and this speeds up the overall search process!

The Extension Step

After rounding up all the word hits as seeds in the neighborhood region, both directions of a seed are extended to generate an alignment. …..???? PQL ??? ….

…. $$$ PQL $$$ ….


When do we stop the extension?

Let’s do an example:

Using match score = +1, mismatch score = -1, and “T” is our seed and extending to the right... Use a drop off score variable, X = 5

                    The quick  brown fox  jumps over the lazy dog
                    The quiet  brown cat  purrs when she sees him
                    123 45654  56789 876 5654…    score
                    000 00012  10000 123 4345…     drop off score

After terminating the extension, the alignment is trimmed back to the maximum score, 9. And the larger stretch of sequence is our High Scoring Pair. Image:blast2.png

Trimming back for getting hsp

Extension should be invoked only when two non-overlapping hits are found within distance A on the same diagonal.

Error creating thumbnail: convert: unable to open image `/usr1/webserver/cgidocs/drugpedia/images/Blast3.png': No such file or directory @ blob.c/OpenBlob/2480.
convert: unable to open file `/usr1/webserver/cgidocs/drugpedia/images/Blast3.png' @ png.c/ReadPNGImage/2889.
convert: missing an image filename `/usr1/webserver/cgidocs/drugpedia/images/thumb/Blast3.png/300px-Blast3.png' @ convert.c/ConvertImageCommand/2800.
non overlapping hsp
Error creating thumbnail: convert: unable to open image `/usr1/webserver/cgidocs/drugpedia/images/Blast4.gif': No such file or directory @ blob.c/OpenBlob/2480.
convert: missing an image filename `/usr1/webserver/cgidocs/drugpedia/images/thumb/Blast4.gif/300px-Blast4.gif' @ convert.c/ConvertImageCommand/2800.
extension of seed
Error creating thumbnail: convert: unable to open image `/usr1/webserver/cgidocs/drugpedia/images/Blast5.gif': No such file or directory @ blob.c/OpenBlob/2480.
convert: missing an image filename `/usr1/webserver/cgidocs/drugpedia/images/thumb/Blast5.gif/300px-Blast5.gif' @ convert.c/ConvertImageCommand/2800.
extension until cumulative score drops


Evaluation Step

We know that Blast is based on Karlin-Altschul statistics . The Karlin-Altschul equation is given by:

Error creating thumbnail: convert: unable to open image `/usr1/webserver/cgidocs/drugpedia/images/Blast6.png': No such file or directory @ blob.c/OpenBlob/2480.
convert: unable to open file `/usr1/webserver/cgidocs/drugpedia/images/Blast6.png' @ png.c/ReadPNGImage/2889.
convert: missing an image filename `/usr1/webserver/cgidocs/drugpedia/images/thumb/Blast6.png/100px-Blast6.png' @ convert.c/ConvertImageCommand/2800.

E = expected no. of alignments

k= minor constant

m= length of query

n= length of database

λ= Scaling factor

S= raw score

λS= Normalized score

mn= search space

Raw score (S): Sum of scores for each aligned position and scores for gaps S = sum(matches) - sum(mismatches) - sum(gap penalties) note: this score varies with the scoring matrix used and thus may not be meaningfully compared for different searches

Bit score (S’): Version of the raw score that is normalized by the scale of the scoring matrix ( ) and the scale of the search space size (K) S’ = (sum(S) – ln(K)) / ln(2) note: because it is normalized the bit score can be meaningfully compared across searches

E value: Number of alignments with score S’ or better that one would expect to find by chance in a search of a database of the same size E = mn2-S’ m = effective length of database n = effective length of query sequence note: E values may change if databases of different sizes are searched


Once we have all the HSP’s, we use a different cutoff “S” to rank the HSP’s. The raw scores of HSPs are sorted base on “S” The “S” alignment threshold can remove many random and low-scoring alignments. But if it's set too high, it may also remove real alignments. Once the HSPs are organized, they can be evaluated with a Final threshold based on the entire search space and corresponding to the value E set for the search. BLAST reports any alignment or group of alignments that meets the E requirement.

Sometimes, two or more HSP regions that can be made into a longer alignment will be found.

For example, HSP score #1: 65 and 40 and HSP score #2: 52 and 45

Poisson method: probability of multiple score is higher when the lower score of each set is higher. (45 is higher than 40) Sum-of-scores method: 65+40=105 is higher than 52+45=97.

original BLAST uses the Poisson method whereas gapped BLAST and the WU-BLAST use the sum-of scores method.

Smith-Waterman local alignments are shown for the query sequence with each of the matched sequences in the database. Those matches whose E value is lower then a threshold value are reported.


References

  • D.W. Mount (2004). "Bioinformatics: Sequence and Genome Analysis."
  • Ian Korf, Mark Yandell & Joseph Bedell."An essential guide to basic local alignment search tool"