Revision as of 08:57, 13 September 2008

Blasts stands for Basic Local Alignment Search Tool.Its Based on Karlin-Altschul statistics.

It compares novel sequence with that present in nucleotide & protein databases which are already characterized. It finds regions of sequence similarity which will yield functional & evolutionary clues. Regions of similarity can be:

local: where the region of similarity is based on small stretches in the sequence global: regions of similarity are present across the sequence.

Main idea of basic BLAST is that "Homologous sequences are likely to contain a short high scoring similarity region a hit. Each hit gives a seed that BLAST tries to extend on both sides".

How does blast works?

First BLAST optionally filters out low-complexity regions that are not useful for producing meaningful sequence alignments. Then, a three-step layers to sequentially refine the “good alignments”.

Seeding step
Extension step
Evaluation step

The Seeding step

BLAST assumes that significant alignments have words in common. A word refers to number of letters.For example, if 3 letters is a word, then the sequence PQGEFG has words PQG,QGE, GEF and EFG. Protein sequences have word length of 3 and 11 for DNA sequences.

listing of words in a sequence

BLAST cares about only the high-scoring words. The scores are created by comparing the word in the list(eg. PQG) with all the 3-letter words (PQG,QGE, GEF and EFG). By using the scoring matrix (substitution matrix) to score the comparison of each residue pair, there are 20^3 possible match scores for a 3-letter word. For example, the score obtained by comparing PQG with PEG and PQA is 15 and 12, respectively. For DNA words, a match is scored as +5 and a mismatch as -4. After that, a neighborhood word score threshold T is used to reduce the number of possible matching words. The words whose scores are greater than the threshold T will remain in the possible matching words list, while those with lower scores will be discarded. For example, PEG is kept, but PQA is abandoned when T is 13.

                                           Seq       Score                      Seq     Score       
                                            …..       ….                       .......  ........
     PQG        aligns with                PEG	       15                        PEG	   15
                                           PRG	       14       T=13             PRG	   14 
                                           PSG	       13                        PSG	   13
                                           PQA	       12                        PQA	   12

If T=13 gives us approximately 50 word hits and if a sequence is of length 250 amino acids, what is the approximate total number of words to search for?

                                                              50 x 250 = 12,500 !!

As oppose to 20^3 x 250 = 2,000,000!! After getting the list of 50 words in the first sequence position, we can use these words to search the database. These 50 words can be pre-computed and this speeds up the overall search process!

The Extension Step

After rounding up all the word hits as seeds in the neighborhood region, both directions of a seed are extended to generate an alignment.

When do we stop the extension? ??? Let’s do an example: Using match score = +1, mismatch score = -1, and “T” is our seed and extending to the right... Use a drop off score variable, X = 5

                    The quick  brown fox  jumps over the lazy dog
                    The quiet   brown cat  purrs when she sees him
                    123 45654  56789 876 5654…    score
                    000 00012 10000 123 4345…     drop off score

After terminating the extension, the alignment is trimmed back to the maximum score, 9. And the larger stretch of sequence is our High Scoring Pair.

Trimming back for getting hsp

Blast Algorithm

From DrugPedia: A Wikipedia for Drug discovery

Revision as of 08:57, 13 September 2008

How does blast works?

The Seeding step

The Extension Step

Views

Personal tools

Search

Navigation

Toolbox

@@ Line 61: / Line 61: @@
 45654  56789 876 5654…    score
 00012 10000 123 4345…     drop off score
+After terminating the extension, the alignment is trimmed back to the maximum score, 9. And the larger stretch of sequence is our High Scoring Pair.
+[[Image:blast2.png]]
+Trimming back for getting hsp