User login

Multiple Alignmenton on eBioKit & wEMBOSS

Teacher: Hans-Henrik Fuxelius

Introduction to ClustalW in wEMBOSS

The multiple sequence alignment tool Clustal W was developed by Julie Thompson and Toby Gibson (both at EMBL, Heidelberg, Germany) and Des Higgins (University of County Cork, Cork, Ireland).

The simultaneous alignment of many nucleotide or amino acid sequences is now an essential tool in molecular biology. Multiple alignments are used to find diagnostic patterns to characterise protein families; to detect or demonstrate homology between new sequences and existing families of sequences; to help predict the secondary and tertiary structures of new sequences; to suggest oligonucleotide primers for PCR; as an essential prelude to molecular evolutionary analysis. The rate of appearance of new sequence data is steadily increasing and the development of efficient and accurate automatic methods for multiple alignment is, therefore, of major importance. The majority of automatic multiple alignments are now carried out using the "progressive" approach of Feng and Doolittle . The new methods are made available in a program called CLUSTAL W which is freely available and portable to a wide variety of computers and operating systems.

The basic alignment method

The basic multiple alignment algorithm consists of three main stages:

  1. all pairs of sequences are aligned separately in order to calculate a distance matrix giving the divergence of each pair of sequences
  2. a guide tree is calculated from the distance matrix
  3. the sequences are progressively aligned according to the branching order in the guide tree.

1) The distance matrix/pairwise alignments

In the original CLUSTAL programs, the pairwise distances were calculated using a fast approximate method. This allows very large numbers of sequences to be aligned, even on a modern laptop. The scores are calculated as the number of k-tuple matches (runs of identical residues, typically 1 or 2 long for proteins or 2 to 4 long for nucleotide sequences) in the best alignment between two sequences minus a fixed penalty for every gap.  These scores are calculated as the number of identities in the best alignment divided by the number of residues compared (gap positions are excluded). Both of these scores are initially calculated as percent identity scores and are converted to distances by dividing by 100 and subtracting from 1.0 to give number of differences per site. We do not correct for multiple substitutions in these initial distances.

2) The guide tree

The trees used to guide the final multiple alignment process are calculated from the distance matrix of step 1 using the
Neighbour-Joining method. This produces unrooted trees with branch lengths proportional to estimated divergence along each branch. The root is placed by a "mid-point" method  at a position where the means of the branch lengths on either side of the root are equal. These trees are also used to derive a weight for each sequence. The weights are dependent upon the distance from the root of the tree but sequences which have a common branch with other sequences share the weight derived from the shared branch.

3) Progressive alignment

The basic procedure at this stage is to use a series of pairwise alignments to align larger and larger groups of sequences, following the branching order in the guide tree. You proceed from the tips of the rooted tree towards the root.  At each stage a full dynamic programming algorithm is used with a residue weight matrix and penalties for opening and extending gaps. Each step consists of aligning two existing alignments or sequences. Gaps that are present in older alignments remain fixed. In the basic algorithm, new gaps that are introduced at each stage to get full gap opening and extension penalties, even if they are introduced inside old gap positions. In order to calculate the score between a position from one sequence or alignment and one from another, the average of all the pairwise weight matrix scores from the amino acids in the two sets of sequences is used i.e. if you align 2 alignments with 2 and 4 sequences respectively, the score at each position is the average of 8 (2x4) comparisons. If either set of sequences contains one or more gaps in one of the positions being considered, each gap versus a residue is scored as zero. The default amino acid weight matrices we use are rescored to have only positive values. Therefore, this treatment of gaps treats the score of a residue versus a gap as having the worst possible score.

Taken in selected parts from [CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice.,Julie D. Thompson, Desmond G. Higgins* and Toby J. Gibson**]

Below is the wEMBOSS splashscreen. Under "Alignment->Multiple->emma" you find the ClustalW program. Under "Project Files->View with->Jalview" to open Jalview for viewing an alignment.

wEMBOSS start page

Below is a window from Jalview, it can be download and run as a standalone on OSX or Windows. If you have problems running Jalview from inside wEMBOSS, install it locally.

Jalview window