The Phylogeny of HIV
(Based on wEMBOSS & eBioKit)
Teacher: Hans-Henrik Fuxelius
Overview
During this exercise you will:
- Perform a multiple alignment of gp120
protein
sequences from HIV and SIV (using Clustal)
- Construct an unrooted tree from the alignment of
gp120 sequences (using the "neighbor joining" algorithm in Clustal)
- Visualize the gp120-based tree using the program Jalview
- Consider the evolutionary implications of the
gp120-based tree
- Investigate the robustness of your tree by
bootstrapping
- Perform a second multiple alignment based on POL
protein sequences from HIV and SIV.
- Construct a new neighbor joining tree from the POL
alignment
- Investigate whether the POL-based tree supports the
conclusions from the gp120-based analysis
- Perform a multiple alignment of the same POL
sequences and a POL sequence from HTLV-1.
- Root the POL-based tree using the HTLV sequence as an
outgroup
Background: AIDS, HIV1, HIV2, and SIV
Acquired Immune Deficiency Syndrome (AIDS)
is
caused
by
two
divergent
viruses, Human Immunodeficiency Virus one (HIV-1) and
Human Immunodeficiency Virus two (HIV-2).
HIV-1
is
responsible
for
the
global pandemic, while HIV-2 has, until recently, been restricted to
West Africa and appears to be less virulent in its effects. Viruses
related to HIV have been found in many species of non-human primates
(monkeys, apes, ...) and have been named Simian Immunodeficiency Virus,
SIV.
These primate viruses are lentiviruses, a subfamily of
the retroviruses. Retroviruses have RNA genomes but are unique among
RNA viruses because they have a replication cycle that involves the
reverse transcription of their RNA genome into DNA (this is the
opposite direction compared to the usual flow of information from
DNA to RNA). The reverse-transcribed viral DNA is stably
incorporated into the genomic DNA of an infected cell and subsequent
transcription can then create multiple copies of mRNA encoding new
viral material.
Like other retroviruses, particles of HIV are made up of
2 copies of the single-stranded RNA genome packaged inside a protein
core, or capsid. The core particle also contains viral proteins that
are essential for the early steps of the virus life cycle, such as
reverse transcription and integration. A lipid envelope, derived from
the infected cell, surrounds the core particle. Embedded in this
envelope are the surface glycoproteins of HIV: gp120; and, gp41. The
gp120 protein is crucial for binding of the virus particle to target
cells. It is the specific affinity of gp120 for the CD4 protein that
targets HIV to those cells of the immune system that express CD4 on
their surface (e.g., T-helper lymphocytes, monocytes, and
macrophages).
Purpose of exercise, description of data:
In this exercise you are going to investigate the
phylogenetic relationship
between HIV and SIV and investigate the evolutionary aspects.
You will do this using two different data sets:
- A set consisting of 27 different gp120 protein
sequences from isolates of
HIV1, HIV2, chimpanzee SIV and macaque monkey SIV: gp120.fasta
- A set consisting of 20 different POL-polyprotein
sequences from HIV1, HIV2, chimpanzee SIV and sooty mangabey
SIV: hiv-siv-pol.fasta
and with the HTLV-1 sequence: htlv-hiv-siv-pol.fasta
(Note for enthusiasts: a number of lines of evidence
have indicated that
macaques are not naturally infected with SIV and that they have
acquired their SIV
infection while in captivity by cross-species transmission of SIV from
sooty
mangabeys. This means that both the macaque SIVs and the sooty mangabey
SIVs originate from sooty mangabeys).
Finally - The Exercise:
First, we will use the file gp120.fasta, which contains
27 gp120 envelope
protein sequences from isolates of HIV-1, HIV-2, and SIV in
fasta-format.
- Create a working directory
called multalign on your
harddisk,
download the gp120 file to this directory and take
a look at its contents (using a text editor like notepad and
nedit, or preferrably by a sequence alignment editor like JalView and use clustalx colouring).
In this file, all HIV-1 sequences have names starting
HV1. All HIV-2 sequences
have names starting HV2. SIVCZ was isolated from chimpanzee. SIVMK,
SIVM1, and
SIVML were isolated from macaques.
Multiple alignment
We will use the program Clustal (named emma in wEMBOSS) to make a multiple
alignment of the
virus sequences.
- Load the sequences
into ClustalW (wEMBOSS->emma):
- The first thing you have to do is load the sequences. In the
wEMBOSS menu
choose "Alignment->multiple->emma", and select gp120.fasta
from the
multalign directory.
- In emma the
sequences are displayed on the
screen with all possible parameter setting, use the preset values and
run the program. Scroll down the pop up window and rightclick to
save the resulting alignment to your multalign folder, name
it "gp120_emma.fas". Take a minute and view gp120_emma.fas in Jalview,
apply some useful colour and look for areas of similarity. You will
notice that the conservation
graph at
the bottom of the window now has several peaks and plateaus
corresponding to the
conserved regions of gp120. Above the sequences there is a ruler
starting at 1
for the
first residue position to the last.!! Keep Jalview open, it will be
used in step 2 below.
- Computing an unrooted tree:
In this part of the exercise we will use Jalview with the
gp120_emma.fas (from above) to
produce a phylogenetic
tree. The tree is built with the neighbour joining algorithm, and
is
based on distances computed from the multiple alignment you just
constructed.
- Use "Jalview->Select->Select all" to select all
sequences.
- Use "Jalview->Select->Calculate->Calculate
tree->Neighbor Joining using BLOSUM62" to calculate the tree.
- You now get att tree view in a new pop up window. Unselect
"View->fit to window" and enlarge it for an easy view.
- Save the treefile by "File->Save As->Newick Format" in multalign
directory and call it "gp120_emma.tree"
- View a plot of the unrooted tree:
There are several programs for visualizing
tree-files like the gp120_emma.tree. Today we will use the java
version of the program Dendroscope,
which
can
be
downloaded
as
OS X
or in Windows.
Jalview is very nice for viewing trees but Dendroscope is the choice for
advanced features, editing and publish ready printing.
- Download and install Dendroscope.
- Start Dendroscope. Dendroscope starts out by asking for a
tree file. Select the gp120_emma.tree
file
- Select "Dendroscope->Tree" and "Draw Radial Phylogram"
which is a proper view since
the tree is not rooted.
(Jalview cannot view in this mode)
- Think for a minute about the implications!
What does this tree tell us about the phylogenetic
relationship of HIV-1, HIV-2 and SIV? Notice especially where the two
different
groups of SIV cluster compared to the two different groups of HIV.
When you've thought about the problem, you can read
a brief
explanation. Additionally, you can find a good description of
HIV
evolution here:
http://evolution.berkeley.edu/evolibrary/article/0_0_0/medicine_04
Bootstrapping a neighbor joining tree
- wEMBOSS also has the possibility of
bootstrapping your neighbor
joining tree:
- Reopen your gp120_emma.fas alignment from before in
"wEMBOSS->PHYLOGENY->SEQUENCE->fseqboot" and run it on default
values. Watch throught the result file in the pop up window. Notice
that it contain the same species resampled 100 times. This will allow
Dendroscope to visualize the bootstrap values, as it willt recognize
bootstrap labels in PHYLIP (fseqboot is from PHYLIP package)
files where the labels are located on branches.In the Trees menu choose
"Bootstrap NJ-tree". This gives you a window where you can change the
number of resampled data sets. The default is 1000, but you may want to
change this to 100 in order not to wait for too long.
- Save the bootstrapped file by right click and save it in
multalign and call it "gp120_bootstrap.fas".
- Open "gp120_bootstrap.fas" in
"wEMBOSS->PHYLOGENY->SEQUENCE->fprotdist" and run it. Save it
by a right click and name it "gp120_protdist.txt". This step calculates
all distanses between the sequences. It might take some time since it
will do it over 1000 resampled datasets.
- Open "gp120_protdist.txt" in
"wEMBOSS->PHYLOGENY->CHARACTERS->Distance
Matrix->fneighbor" and run it. Browse throught the pop up window and
save the treefile named " gp120_protdist.treefile"
and
name it
"gp120_protdist.treefile".
- Open "gp120_protdist.treefile"
in
"wEMBOSS->PHYLOGENY->CONSENSUS->fconsense" and run it. .
Browse throught the pop up window and save the consensus treefile named
" gp120_protdist.treefile" and name it "gp120_protdist.concensus".
- View the bootstrapped tree:
- Load the "gp120_protdist.treefile"
file
in Dendroscope.
Use "Dendroscope->Show->Edge weights" to see the bootstrap
values. Use "Dendroscope->Tree->Radial Phylogram". You should now
be
able to see the tree with the values
attached to all internal branches. Remember that the number tells how
often the data was
divided into the two groups present on either side of the branch.
- Test: What is the bootstrap value on the
internal branch separating
the HIV1/SIVCZ cluster from the HIV2/SIVMK cluster? Note the value on
the form.
Rooting a tree using an outgroup
In this part of the exersize you will use a data set of
20 different
POL-polyprotein sequences isolated from HIV-1, HIV-2, chimpanzee SIV,
and sooty
mangabey SIV. (The Pol gene encodes three different polypeptides:
integrase,
reverse transcriptase, and protease. It is expressed as a single
polyprotein and is
subsequently cleaved by protease into its three separate parts).
First, you will construct a neighbor-joining tree like
before and investigate
whether this new, independent data set confirms the conclusions you
made based on
the alignment of gp120 sequences. Then you will add a POL-polyprotein
sequence from
HTLV-1 to the data set and construct a new tree, that you can then root
using the
HTLV sequence as an outgroup. (HTLV-1 is another member of the family
of
retroviruses and is thus more distantly related to HIV - which was
originally named HTLV-3 by the way)
- Download and have a look at the POL sequence file:
Download the aligned hiv-siv-pol.aln
file to the working directory, and inspect the alignment with a text
editor or alignment viewer as Jalview. As mentioned, this file contains
POL-polyprotein
sequences from HIV-1, HIV-2,
chimpanzee SIV, and sooty mangabey SIV.
- Construct a neighbor-joining tree with no outgroup:
Re-open the Jalview window and load the sequence
file hiv-siv-pol.fasta. Now, start the alignment by
choosing:
- "Jalview->File->Input alignment->From file"
- Browse throught the alignment and apply a colour to it.
- "Jalview->File->Select->All"
- "Jalview->Calculate->Calculate tree->Neighbour
Joining using BLOSUM62"
- "File->Save as->Newick tree", name it "hiv_siv_pol.tree",
it now contains the tree file from the neighbour-joining.
- Inspect the unrooted tree in Dendroscope:
- Open "hiv_siv_pol.tree" in Dendroscope
- "Dendroscope->Tree->Draw radial phylogram"
- This tree has been constructed from an entirely
independent set of sequences.
Does it support the conclusions that could be made from the gp120-based
tree?
- Test: Make a sketch of the POL-based tree
(again, just loosely indicate
the position of the HIV1-cluster, the HIV2-cluster, the HIVCZ sequence
and the HIVSmanga sequences).
- Construct a neighbor-joining tree with an added
outgroup:
-
Download the aligned htlv-hiv-siv-pol.aln
file to the working directory, and inspect the alignment with a text
editor or Jalview.
This file contains the same sequences as the file hiv-siv-pol.fasta
plus an additional
POL-sequence from the
related virus HTLV-1 (the first sequence in this file). The HTLV POL
sequence will be used as an outgroup in this part of the exercise.
- "Jalview->File->Input alignment->From file"
- Browse throught the alignment and apply a colour to it.
- "Jalview->File->Select->All"
- "Jalview->Calculate->Calculate tree->Neighbour
Joining using BLOSUM62"
- "File->Save as->Newick tree", name it "htlv-hiv-siv-pol.tree",
it now contains the tree file from the neighbour-joining.
- Inspect the unrooted tree in Dendroscope:
- Open "htlv-hiv_siv_pol.tree" in Dendroscope
- "Dendroscope->Tree->Draw rectangular phylogram"
- Observe how the outgroup HTLV is located quite
distantly from the other
sequences.
- Define outgroup:
We will now use the same data for constructing a
rooted tree, using the HTLV sequence as a way of defining where to
place the root.
Open "htlv-hiv-siv-pol.phb" in Dendrogram if it is not open.
For this purpose mark the HTLV branch (it will become
red marked) of the tree with the mouse and select "EDIT" -> "Reroot"
The outgroup will be used to place the root of the
tree. The rationale
is as follows: our data set consists of sequences from HIV-1, HIV-2,
SIV
and HTLV. We know from other evidence that the lineage leading to HTLV
branched off before any of the remaining viruses diverged from each
other. The root of the tree connecting the organisms investigated here,
must therefore be located between the HTLV sequence (the "outgroup")
and
the rest (the "ingroup"). This way of finding a root is called
"outgroup
rooting", and constructs a tree where the outgroup is a monophyletic
sister group to
the ingroup.
The results from the rooting service shows first the original tree(s)
and in the bottom the constructed rooted tree(s).
Test: On the sketch you made before,
indicate which branch the
root is located on. Was this were you expected it?
- What can generally now be said about the evolution of
HIV and in particular related to humans?