Inferring horizontal gene transfer

Matt Ravenhall1, Nives Škuncatba, Christophe Dessimoztba *

1To be added

2To be added


 * Corresponding author, email: to be added

Horizontal or Lateral Gene Transfer (HGT or LGT) occurs when a host genome obtains foreign DNA in a process that circumvents vertical inheritance. In contrast to vertical transfer, HGT is not strictly intraspecific; the presence of HGT events can therefore complicate investigations of evolutionary history. Furthermore, HGT events are often implicated in the transfer of antibiotic resistance and pathogenicity.

Computational identification of HGT events relies upon the investigation of sequence composition or evolutionary history of genes. Sequence composition-based (parametric) methods search for deviations from the genomic average whereas evolutionary history-based (phylogenetic) approaches identify genes whose evolutionary history significantly differs from that of the host species. Benchmarking for both types of methods typically relies upon simulated genomes. Currently, different inference methods tend to identify conflicting sets of HGT events, and it can be difficult to ascertain all but very simple HGT events.

Introduction
Initial discoveries of horizontal gene transfer events relied upon observation of a particular trait, such as virulence, moving from trait-positive to trait-negative organisms. Most prominently, the first evidence that genetic information could pass between bacteria was witnessed in 1928 when Frederick Griffith demonstrated that virulence was able to pass to non-virulent strains of "Streptococcus pneumoniae" in what is now known as "Griffith's Experiment" . Later evidence for conjugation and transduction, the other known methods of horizontal gene transfer, was found in the 1940s and 1950s through similar observations. Given that HGT is not limited to the transfer of externally visible traits, other methods are required to detect these events. Contemporary methods rely upon genomic data and can broadly be separated into two groups: parametric and phylogenetic.

Parametric methods identify sections of a genome that differ significantly from a genomic average, such as GC percentage or codon usage, whilst phylogenetic approaches examine evolutionary histories and identify conflicting phylogenies. Phylogenetic methods can be further divided into those that reconstruct and compare phylogenetic trees explicitly, and those that use surrogate measures in place of the phylogenetic trees. Whilst the parametric approaches benefit from only requiring the genome under study they are limited, due to amelioration of transferred sequences, to only discovering recent HGT events and must take intra-genomic variation into account to reduce false positives. On the other hand, the phylogenetic approaches benefit from many recently sequenced genomes and hold an improved ability to identify the specific donor, time and direction of horizontal transfer but also bring significant computational demands, especially when considering explicit methods.

Significant improvements have occurred through combining different parametric methods, offering a potential solution to discrepancies between different methods. Future advances may occur through wider combinations, perhaps with parametric and phylogenetic methods together. For this a database could effectively collate benchmarking data for future evaluation. Advances in HGT detection should also benefit from future increases in computational power but must be founded upon effective calculation of false positive and negative results. Simulated genomes with known non-native insertions provide a powerful evaluation tool but it is vital that non-native sequences are incorporated as realistically as possible to draw meaningful conclusions.

Parametric methods


Many aspects of genome sequence composition are species-specific "genomics signatures" which can be utilised to identify sequences that have arrived through horizontal transfer. Commonly used signatures include nucleotide composition, oligonucleotide frequencies , or structural features of the genome. Parametric HGT inference methods identify fragments of a genome with atypical signatures.

A requirement for parametric methods is that the host's genomic signature is clearly recognizable whilst still taking intra-genomic variability into account, so as to reduce false positives. For example, it has been observed that the GC content of the third codon position is lower closer to the replication terminus. False positives may also occur for genes with significantly high or low rates of expression as GC content has been found to be higher with greater expression. Larger sliding windows can account for this variability at the cost of a reduced ability to detect smaller HGT regions.

Just as importantly, horizontally transferred segments need to exhibit the donor's genomic signature, although specific identification of the donor still may not be possible. This requirement can represent an issue for ancient transfer events as the transferred segments will have been subjected to the same mutational processes as the rest of the host genome, causing their distinct signatures to ameliorate. A parametric approach will therefore be restricted to the identification of more recent transfers. Similarly, if the inserted segment was previously adapted to the host's genome, as is the case for prophage insertions, the power of sequence composition methods in detecting HGT is reduced.

One notable example of the potential flaws with parametric methods is that of Bdellovibrio bacteriovorus, a predatory δ-Proteobacterium. Its initial analysis, based on homogeneous GC content, found that its genome is resistant to HGT. However, subsequent research with phylogenetic analysis identified a number of ancient HGT events in the genome.

Nucleotide composition


Bacterial GC content falls within a wide range with Carsonella ruddii having GC content of 16.5% and Anaeromyxobacter dehalogenans having GC content of 75% (see ). Even within a closely related group of α-Proteobacteria values range from about 30% to about 65%. These differences can be exploited when detecting HGT events as a significantly different GC content for a genome segment can be an indication of foreign origin (see ).

Oligonucleotide frequencies
Oligonucleotide frequency varies less within a genome than between genomes and therefore represents a valid genomic signature. Any deviation from this signature suggests that a genomic segment may have arrived through horizontal transfer. This discriminatory power relies upon the large number of possible oligonucleotides. To demonstrate, if 'n' is the size of the vocabulary and 'w' is oligonucleotide size, the number of possible distinct oligonucleotides is nw; for example, there are 44=256 possible tetranucleotides.

One of the first detection methods used in methodical assessments of HGT was codon usage bias, which uses trinucleotide frequencies. This approach requires a host genome which contains a strong bias towards certain synonymous codons (different codons which code for the same amino acid) which is distinct from the bias found within the donor genome. In contrast, the simplest oligonucleotide used as a genomic signature is the dinucleotide, for example the third nucleotide in a codon and the first nucleotide in the following codon represents the dinucleotide least restricted by amino acid preference and codon usage.

Optimising the size of the sliding window is of great importance as a larger sliding window can better account for the variability in the host genome (see ) at the cost of a reduced ability to detect smaller HGT regions. To balance reliability with computational demand, one suggested optimal length is tetranucleotide frequencies in a sliding window, such as 5kb with a step of 0.5kb

A more complex method of characterising a genomic signature utilises a set of typical host genes. In the case of a Markov model-based approach a transition probability matrix is derived using typical genes, for a Bayesian model the posterior probabilities of a sequence are calculated based upon the typical genomic signatures.

Structural features
Just as the nucleotide composition of a DNA molecule can be represented by a sequence of letters, its structural features can be encoded in a numerical sequence. The structural features include interaction energies between neighbouring base pairs, the twist that makes two bases of a pair non-coplanar , or DNA deformability induced by the proteins shaping the chromatin. The autocorrelation analysis of this numerical sequence shows characteristic periodicities in complete genomes. In fact, upon detecting archaea-like regions in the thermophilic bacteria Thermotoga maritima, periodicity spectra of these regions were compared to the periodicity spectra of the homologous regions in the archaea Pyrococcus horikoshii. The revealed similarities in the periodicity were strong supporting evidence for a case of massive HGT between two kingdoms: bacteria and archaea.

Genomic context
The existence of genomic islands, short (typically 10-200kb long) regions of a genome which have been acquired horizontally, lends support to the ability to identify non-native genes by their location in a genome. For example, a gene of ambiguous origin which forms part of a non-native operon could be considered to be non-native. Alternatively, flanking repeat sequences or the presence of nearby integrases or transposases can indicate a non-native region. A context-aware approach has been considered as a secondary identification method, after removal of genes which are significantly native or non-native through the use of other parametric methods.

Explicit phylogenetic methods


The aim of explicit phylogenetic methods is to compare phylogenetic trees for various genes with the tree for their associated species. Significant differences between the two can be suggestive of a HGT event (see ). Such an approach can produce more detailed results than parametric approaches because the involved species, time and direction of transfer can potentially be identified.

As discussed in more details below, phylogenetic methods range from simple and efficient methods of discordance identification to complex mechanistic models that infer probable sequences of HGT events. An intermediate strategy consists of deconstructing the gene tree into smaller parts until it matches the species tree (genome spectral approaches). Explicit phylogenetic methods rely upon the accuracy of the input species and gene trees. However, the computational complexity of reconstructing well-resolved, rooted gene tree or a species tree can be a challenge.

Even if there is no doubt in the input tree, the conflicting phylogenies can be the result of evolutionary processes other than HGT, such as duplications and losses. These can result in undetected paralogy or incomplete lineage sorting. An additional complication arises, if the donor species is not represented among the set of species (or their ancestors) considered.

Tests of topologies
To detect sets of genes that fit poorly to the reference tree, one can use statistical tests of topology, such as Kishino-Hasegawa (KH), Shimodara-Hasegawa (SH) , and Approximately Unbiased (AU). These tests assess the likelihood of the gene sequence alignment when the reference topology is given as the null hypothesis.

The rejection of the reference topology is an indication that the evolutionary history for that gene family is inconsistent with the reference tree. When these inconsistencies cannot be explained using a small number of non-horizontal events such as gene loss or mutational change, a HGT event is inferred.

check for likelihood ratio tests as well include EEEP, uses bayesian poserior probability

One such analysis checked for HGT in groups of homologs, as best bidirectional hits, of the γ-Proteobacterial lineage. Here six reference trees were reconstructed using either the highly conserved small subunit ribosomal RNA sequences, a consensus of the available gene trees or concatenated alignments of orthologs. The failure to reject the six evaluated topologies, and the rejection of seven alternative topologies, was interpreted as evidence for a small number of HGT events in the selected groups.

Tests of topology provide a way to account for the uncertainty in tree reconstruction but they do not indicate the locations that any HGT events may have occurred. For that, genome spectral or subtree pruning and regraft methods are required.

Genome spectral approaches
In order to identify the location of HGT events, genome spectral approaches decompose a gene tree into substructures (such as bipartitions or quartets) and identify those that are consistent with the gene tree.

Bipartitions Removing one edge from a reference tree produces two unconnected sub-trees, each a disjoint set of nodes (a bipartition). If a bipartition can exist on both the gene and species tree it is compatible, otherwise it is conflicting. These conflicts can indicate an HGT event, or may be the result of uncertainty in gene tree inference. To reduce uncertainty, bipartition analyses typically focus on strongly supported bipartitions such as those associated with a branch with a bootstrap value above a certain threshold. Any gene family found to have one or several conflicting, but strongly supported, bipartitions is considered as a HGT candidate. By considering how a particular bipartition conflicts with the reference tree (e.g. which 'leaves' are on the 'wrong' side), a plausible HGT scenario can be inferred.

Quartet decomposition Quartets are trees consisting of four leaves. In bifurcating (full resolved) trees, each internal branch induces a quartet whose leaves are either subtrees of the original tree or actual leaves of the original tree). These are often utilised in the construction of larger phylogenies . By deconstructing candidate phylogenies into quartets and comparing these to all candidate trees, potential HGT events can be flagged within incompatible quartets..

Subtree pruning and regrafting
A mechanistic way of modelling an HGT event on the reference tree is to first cut an edge 'prune the tree' and then regraft the sub-tree to another edge. If the gene tree was topologically consistent with the original reference tree, the editing results in an inconsistency. Similarly, when the original gene tree is inconsistent with the reference tree, it is possible to prune and regraft the reference tree to obtain a consistent topology. By interpreting the edit path of pruning and regrafting HGT candidate nodes can be flagged and the host and donor genomes inferred.

As SPR is NP-Hard solving the problem is considerably more difficult as more nodes are considered. The computational challenge lies in finding the optimal edit path, the one that requires the least number of steps, and different strategies are used in solving the problem. For example, the HorizStory algorithm reduces the problem by first eliminating the consistent nodes ; recursive pruning and regrafting reconciles the reference tree with the gene tree and optimal edits are interpreted as HGT events.

Implicit phylogenetic methods
In contrast to explicit phylogenetic methods, which rely upon the creation and compatibility of phylogenetic trees, implicit methods compare evolutionary distances. Here an unexpected distance from a given reference, which can be the gene family or genomic average, is suggestive of a HGT event (see ). Due to tree construction not being required, implicit approaches are faster and arguably more robust than explicit methods.

Implicit methods can however be limited by disparities between the phylogeny and evolution distances, or over-reliance upon top BLAST hits that reflect closely-related rather than donor species. Additionally whilst a list of top sequence similarity hits is used to create phyletic patterns that detect if a gene was lost from the genome, considering gene remnants as lost genes could inflate the predicted number of HGT events.

In principle implicit methods can detect all three types of HGT events: insertion of a novel gene, paralog or xenolog in orthologous gene displacement. However analysis is limited to xenologous displacement detection if exceptionally small groups or those that without representative in all taxa, such as 'ORFans', are omitted.

Sequence similarity
The potential identification of HGT events through sequence comparison is achieved when the top-scoring BLAST hits are associated with a distantly related species. For example, phyletic profiles of the bacteria Thermotoga maritima have shown that most of the best BLAST matches are in archaea rather than closely related bacteria ; these predictions were later supported by an analysis of the structural features of the DNA molecule.

However, this approach can be limited to uncovering relatively recent HGT events as speciation after a transfer will result in the top BLAST hit being a more closely related species, therefore potentially registering as a false negative.

Outliers within orthologous groups
For a group of orthologs the molecular clock hypothesis states that the evolutionary distances of genes are proportional to the evolutionary distances of their respective genomes. If a group of orthologs contains xenologs the proportionality of evolutionary relationships will only hold for orthologs, not the xenologs.

One approach finds violations of the expected evolutionary distances by ranking similarity scores of Open Reading Frames (ORFs) to a "virtual genome", a collection of ORFs of the respective strains from the GenBank database. If an ORF's evolutionary distance to the virtual genome was inconsistent with the distances of other ORFs from the same genome, the authors inferred an HGT event. Another algorithm for HGT detection compares all pairs of genes in predefined groups of orthologs : if a likelihood ratio test of the HGT hypothesis and a hypothesis of no HGT rejects the null, a putative HGT event is inferred. In addition, a pair-wise comparison allows inference of potential donors and provides an estimation of the time since the HGT event.

Phyletic profiles
A group of orthologs or homologs can be analyzed in terms of the presence/absence of group members in the reference genomes; such patterns are called phyletic profiles. To find HGT events, phyletic profiles are scanned for an unusual distribution of genes. Absence of a homolog in a group of closely related species is an indication that the examined gene might have arrived via a HGT event. For example, the three facultatively symbiotic Frankia sp. strains are of strikingly different sizes: 5.43 Mbp, 7.50 Mbp and 9.04 Mbp, depending on their range of hosts. Marked portions of strain-specific genes had no significant hit in the reference database, and were possibly acquired by HGT transfers from other bacteria. Similarly, the three phenotypically diverse E. coli strains (uropathogenic, enterohemorrhagic and benign) share about 40% of the total combined gene pool, with the other 60% being strain-specific genes and HGT candidates. Further evidence for these genes being present due to HGT was shown as strikingly different codon usage patterns from the core genes and a mostly conserved gene order.

Impact of polymorphic sites
It is commonly considered that genes are the units transferred through an HGT event, however it is also possible for recombination to occur within genes. For example, it has been shown that horizontal transfer between closely related species often results in the exchange of ORF fractions. The analysis of a group of four E. coli and two Shigella flexneri strains also revealed that the sequence stretches common to all six strains contain polymorphic sites, consequences of homologous recombination. This method of detection is, however, restricted to the sites in common to all analysed species, limiting the analysis to a group of closely related organisms.

Evaluation
Assessing the methods used for detecting HGT events is crucial for the interpretation of their results but represents a significant challenge. Heterogenity of the current methods has so-far prevented a comprehensive assessment of all principles although case studies on nitrogen fixation genes and the use of artificial genomes have shown the potential for benchmarking and subsequent refinement. There is therefore potential for a database of benchmarking results to be used in the development of new methods but for now conclusions about the power of various HGT detection principles largely depend upon the theoretical considerations employed.

One major issue with existing HGT detection methods is the high rate of false positive and false negative results. Related to this is the tendency for some inference methods to identify conflicting groups of potentially non-native genes. To determine these rates the amount of non-native genes within a genome must be known, whilst some HGT mechanisms leave tell-tale clues in the genome obtaining an unbiased benchmark is hindered by the large evolutionary scale on which HGT operates. Strategies of evaluation will therefore utilise artificial genomes and phylogenetic trees to simulate known HGT events.

Artificial genomes Inserting known donor genes into a known position in the host genome results in a chimeric genome. These donor and host sequences can either be obtained from a sequence database or can be simulated in silico. Artificial genomes are obtained, for example, using Markov models or by simulating whole-genome evolution. These altered genomes benefit from the number of non-native genes being a known value and therefore allow for both type I and type II errors to be identified.

Subtree pruning and regrafting The presence of an HGT event will cause the phylogenetic tree for that gene to conflict with the reference tree for that the host species. With this considered, the effectiveness of a phylogeny-based method can be determined through the creation of trees which simulate HGT events. For example, by switching branches within a phylogenetic tree, HGT events of known values are simulated allowing explicit phylogenetic methods to be tested.

Acknowledgements
Fran, Jelena, Daniel, Steffan provided constructive comments.