Experiment-9: Introduction to Bio-informatics Tools: Sequence Alignment
To perform sequence alignment to determine levels of similarity between two or multiple protein sequences
Bioinformatics is a combination of two major disciplines of science, the biological and computer sciences. The study carried over with respect to biology part is helped and represented in a better way with the help of computer tools together constitute bio-informatics. The need for bioinformatics arose with HumanGenome Project, which aimed to sequence the entire human genome. Today, bioinformatics has become indispensible to biological scientists and is involved in day to day research, since, it has grown to encompass proteomics, transcriptomics, molecular modeling and several other disciplines in the field of biological research. One such application by using bio-informatics tools has been explained in this module.
Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. The comparisons of the sequence are made with the query sequence to that of the database sequence. This help to determine the functional, structural and evolutionary relationships between them.
Information generated from sequence alignment can help in assigning functions to unknown protein, also determines the evolutionary relations of organisms and helps to predict the 3D structure of a protein.Homology is attributed to similarity due to a descent from a common ancestor, i.e. if two sequences are from different organisms are similar those are termed as Homologous.
Types of sequence alignment-
Based on sequence length – Depending on whether the sequences are compared in parts (similar sections) or as a whole, there are two methods:
1. Global sequence alignment – In this alignment, the entire length of all the sequences are scanned to look for the similarities and gaps are inserted to fill unaligned spaces.1) Seq 1 - TAGC-GC-GT 2) Seq 2 - TA-CA-CAGT
2. Local sequence alignment – This alignment focuses on the local regions with high similarities and assigns gaps to portions that do not align.1) Seq 1 - CGATAACGTAT 2) Seq 2 - --ATAAAC---
Based on number of sequences- According to number of sequence being compared there are two methods:
1. Pairwise sequence alignment - This involves aligning two sequences and to get the best region of similarity. 1) Seq 1 – 1 KTSSGNGAEDS 11 2) Seq 2 - 1 KTSSGNGAEDS 11
Various methods used to perform pairwise alignment of nucleotide and protein sequences:
Ø Dot Plot: Graphical method for two sequences to identify regions of similarity and dissimilarity represented with presence or absence of dots.
Ø Dynamic Programming: This method breaks a problem into small sub-problems and uses the solution of the sub-problems to compute the solution of the larger one. Some algorithms like Needleman-Wansch and Smith-Waterman are used here.
Ø Heuristic Method: When a single sequence is to be compared against the whole database, methods like BLAST and FASTA are used, which compare the query sequence one by one against all the sequences and then generates a result that shows the alignment of the query sequence with all matches in decreasing order of homology.
Some of the parameters used for producing optimum alignment are as follows-
Ø Max target sequences – It displays the results with total number of aligned sequences on a page.
Ø Expected Threshold – It is a statistical indicator which calculates the probability that the resulting alignments are caused by random chance. The lower the E value, the more significant is the score. The default value is kept 10 as 10 matches are expected to be found random by chance (Stochastic model of Karlin & Altschul, 1990).
Ø Query match - It gives the maximum match in query range. This is useful for comparing many stronger matches of the query results from the weaker ones.
Ø Word size – This algorithm works by using word matches between the query and the database sequences. It searches for exact word matches, and then extends to the full length alignment. A word size of 3 is optimum for standard protein alignment and word size of 2 is required for short and nearly exact matches.
ØScoring schemes – Different scoring schemes algorithms are devised to obtain an optimum alignment. Use of any substitution matrix helps in aligning possible pairs of residues and also generates scores. To check the quality of pairwise sequence alignment, different PAM and BLOSUM matrices are used.
- PAM (Point Accepted Mutation) – This is developed by calculating the substitution of amino acid during evolution which are naturally accepted. PAM30 is used for sequence less than 35% in length whereas PAM70 is used for sequences ranging from 30% to 50%. (more details click here)
- BLOSUM (Block amino acid substitution matrix) – This has been developed using conserved regions called BLOCKS, of distantly related protein sequences available within the block database. Out of all, BLOSUM 62 matrix is best used for detecting most protein similarities. BLOSUM 45 may be used for longer and weaker alignments. (more details click here)
ØGap costs – A gap is a space which is introduced into an alignment to compensate for insertions and deletions in one sequence relative to another. Too many gaps should be avoided in the alignment and hence a gap penalty or gap score is assigned. Extension of the gap to encompass additional nucleotides or amino acid is also penalized in the scoring of an alignment. A penalty is subtracted from the alignment score when a gap penalty is introduced. Increase in gap costs parameter results in a lower number of gaps in the alignment. The penalty for the creation of a gap should be large enough so that gaps are introduced only where needed, and the penalty for extending a gap should take into account the likelihood that insertions and deletions occur over several residues at a time. Some commonly used values here are existence 10, 11 and extension 1.
2. Multiple sequence Alignment (MSA) - This involves the alignment of more than two (protein/ DNA) sequences and assesses the sequence conservation of proteins domains and structures. It is an extrapolation of the pairwise sequence alignment which reflects alignment of similar sequences and provides a better alignment score. Various analysis like Homology modeling for prediction of protein structure, Phylogenetic analysis, motif detection etc are based on the results of multiple sequence alignment.
There are many softwares like Clustal, t-coffee, Phylip, MSA, MUSCLE used for obtaining multiple sequence alignment. Example –
Seq 1 - PQGGGGWGQ
Seq 2 - P-HGGGWGQ
Seq 3 - P-HGGGWGQ
Seq 4 - P-HGGGWGQ
Seq 5 - P-HGGGWGQ
Following parameters should be considered while aligning multiple sequences:
- Protein weight matrix - The matrix used to generate the alignment score must be able to produce the highest score. Eg: PAM and BLOSUM.
- Gap open – The penalty to open a gap. The presence of a gap is frequently given more significance than the length of the gap. By default, the gap opening penalty is 10.
- Gap extension – The penalty to extend a gap. Extension of the gap also involves additional of amino acids which is penalized in the scoring of the alignment. By default, gap extension penalty is 0.20.
Application of MSA results:
Phylogenetic analysis – It is one of the major areas where multiple sequence analysis results are used to find the evolutionary relatedness between sequences. The results are displayed in form of phylogenetic tree which has set of nodes and branches to link the nodes.
File format view:
- PHYLIP - PHYLogeny Inference Package. It’s a format for Joe Felsenstein’s phylogenetic applications, having 8 letter maximum lengths for the sequence ID.
- Cladogram - In a cladogram, the external taxa line up neatly in a row. Their branch lengths are not proportional to the number of evolutionary changes and thus no evolutionary time analysis can be done. Only the relative ordering of the taxa can be analyzed.
- Phylogram - In a phylogram, the branch lengths represent the amount of evolutionary divergence. Such trees are said to be scaled.
- Pearson/FASTA - Text based format to represent amino acids in single letter code. It also has sequence names followed by comments.
Consensus Symbol indication:
1) An * (asterisk) indicates positions which have a single, fully conserved residue.
2) A: (colon) indicates conservation between groups of strongly similar properties - scoring > 0.5 in the PAM 250 matrix.
3) A. (Period) indicates conservation between groups of weakly similar properties - scoring =< 0.5 in the PAM 250 matrix.
The following is the color code of the amino acids according to the properties of the amino acids.
- Jalview - Java alignment editor. It is a visualization tool for alignment algorithms and other database search results.