User:Nick Williams/Draft of Inferring Horizontal Gene Transfer: Methods and Benchmarking

Horizontal Gene Transfer (HGT) occurs when a host genome obtains foreign DNA in a process that circumvents vertical inheritance.

Introduction
Evolutionary processes include mutations, duplications, deletions and insertions of genes and gene fragments. An insertion event that involves two distinct species, where the donor's genetic material is inserted into the host genome, is referred to as Lateral (or Horizontal) Gene Transfer (LGT or HGT). As LGT is a historical event working on a large evolutionary scale, often the only evidence of an LGT is in the genome sequence itself: by analyzing it, computational methods infer lateral gene transfer events. Detecting LGT is not straightforward due to its reach in shaping genomes: the donor of genetic material is not constrained by the relatedness to the host, and transfers occur between kingdoms, as well as within a phylum. Based on the principles employed in inferring LGT events, computational methods can be divided in two broad groups (Figure 1). We start our survey with the principles that rely on the atypical sequence composition of LGT candidates in the host genome (Figure 1A). Second, we present the principles that examine the evolutionary history of the host with respect to the LGT candidate (Figure 1B). We finish our survey by commenting on the challenges in the evaluation of these methods (Figure 2).



Computational methods used in detecting Lateral Gene Transfer (LGT)
Computational methods used in genome-wide detection of LGT are more often applied to Prokaryotes than to Eukaryotes, resulting in the unequivocal acceptance of LGT's influence on Prokaryotic evolution. Although LGT is known to shape the Eukaryotic genomes, the extent of the influence appears to be lower. Nevertheless, apart from the computational obstacles—not negligible when analyzing the much larger Eukaryotic genomes—most computational methods can be adapted to find LGT in Eukaryotes (e.g. ).

Sequence composition methods
A number of computational methods for detecting LGT are based on sequence composition: nucleotide composition, oligonucleotide frequencies, or structural features (Figure 1 A). In the context of these methods, if a fragment of the genome strongly deviates from the genomic average, it is a possible lateral transfer.

Nucleotide composition: GC content and codon usage bias
The Bacterial GC content falls within a wide range (Fig. 2): on the one end is Carsonella ruddii with GC content of 16.5% and on the other is Anaeromyxobacter dehalogenans with GC content of 75%. Even within a closely related group of α-Proteobacteria values range from about 30% to about 65%. Such differences are exploited in detecting LGT events: a strikingly different GC content of a genome segment is an indication of its foreign origin. The evolutionary processes determining GC content are reflected in the host's preference for certain synonymous codons—codon usage bias. In fact, codon usage bias was the detection method used in one of the first methodical assessments of LGT in Escherichia coli. Two features characterized LGT candidate genes: first, they had a strong codon usage bias, and second, the preferred codons of LGT candidate genes were distinct from those preferred in the host's genome.

Genomic signature
GC content and codon usage bias rely on mononucleotide frequency, either in the entire genome or in the particular positions in the genome; the extended vocabulary of oligonucleotide frequency should allow more discriminatory power. In fact, the oligonucleotide frequency varies less along a genome than between genomes, leading to the concept of a genomic signature : a deviation from the genomic signature makes the fragment an LGT candidate.

Oligonucleotide frequencies
The simplest oligonucleotide used as a genomic signature is a dinucleotide. For example, a genomic signature based on the dinucleotide frequencies of the third nucleotide in a codon and the first nucleotide in the following codon accounts for the dinucleotide least restricted by amino acid preference and codon usage. Similarly, tetranucleotide frequencies in a sliding window—e.g. 5 kb with a step of 0.5 kb —can be used to find fragments that are possible lateral transfers.

Modeling the genomic signature
A more complex way of capturing the genomic signature uses, instead of the frequency, a model of a set of typical host genes. In case of a Markov model-based approach, the model includes a transition probability matrix derived using typical genes ; in case of a Bayesian model, the posterior probabilities of a sequence are calculated based on the typical genomic signatures.

Structural features
Just as the nucleotide composition of a DNA molecule can be represented by a sequence of letters, its structural features can be encoded in a numerical sequence. The structural features include interaction energies between neighboring base pairs, the twist that makes two bases of a pair non-coplanar , or DNA deformability induced by the proteins shaping the chromatin. The autocorrelation analysis of this numerical sequence shows characteristic periodicities in complete genomes. In fact, upon detecting Archaea-like regions in the thermophilic Bacteria T. maritima, periodicity spectra of these regions were compared to the periodicity spectra of the homologous regions in the Archaea Pyrococcus horikoshii. The revealed similarities in the periodicity were strong supporting evidence for a case of massive LGT between two kingdoms, Bacteria and Archaea.

Limitations of sequence composition methods
To detect LGT, sequence composition methods need the host's average signature to be clearly recognizable: not accounting for the host's intra-genomic variability results in overpredictions—flagging native segments as possible LGT events. For example, the GC content of the third codon position is lower close to the replication terminus. Just as important, the transferred segments need to exhibit the donor's signature. However, this might not be the case for ancient transfers: the transferred segments are subject to the same mutational processes as the rest of the host genome so their distinct signatures ameliorate. Similarly, if the inserted segment was previously adapted to the host's genome, as is the case for prophage insertions, the power of sequence composition methods in detecting LGT is reduced. A notable example for a reduced power of the sequence composition methods is the one of Bdellovibrio bacteriovorus, a predatory δ-Proteobacterium. The first analysis, based on the Bacterium's homogeneous GC content, found that its genome is resistant to LGT. However, subsequent research using phylogenetic analysis identified a number of ancient LGT events in the genome.

Phylogenetic methods
The use of phylogenetic analysis in the detection of LGT was advanced by the availability of many newly sequenced genomes. Phylogenetic methods detect inconsistencies in gene and genome evolutionary history in two ways: 1) by reconstructing the gene tree and reconciling it with the reference species tree (explicit methods) or 2) by implicitly examining gene history, e.g. its pattern of presence/absence in species or expected/unexpected evolutionary distance from its gene family (Fig. 1 B).

Explicit phylogenetic methods
To find LGT events, explicit phylogenetic methods first reconstruct the reference tree and then reconcile the gene and reference tree, both computationally demanding procedures. To reduce computational complexity, explicit phylogenetic methods either focus on smaller phylogenetic groups such as one phylum (tests of topologies), or borrow from the dynamic programming paradigm by deconstructing the problem, the gene tree, into smaller parts (genome spectral approaches and subtree pruning and regrafting).

Tests of topologies
Gene trees and the reference tree can be directly compared using likelihood-based tests of topology, e.g. Kishino-Hasegawa (KH), Shimodara-Hasegawa (SH), and Approximately Unbiased (AU) test. Each test compares the alignment of homologs used to create the gene tree to the reference topology. If the reference topology is rejected by the alignment, evolutionary histories are inconsistent. When these inconsistencies cannot be explained using a small number of non-lateral events such as gene loss or mutational change, an LGT event is inferred. One such analysis checked for LGT in groups of homologs—best bidirectional hits—of the γ-Proteobacterial lineage. Reference trees were reconstructed in one of three ways: 1) using the highly conserved small subunit ribosomal RNA (SSU RNA) sequences, 2) using a consensus of the available gene trees, or 3) using concatenated alignments of orthologs. The combination of data and tree building methods resulted in six reference tree topologies. The failure to reject the six evaluated topologies—and the rejection of seven alternative topologies—was interpreted as evidence for a small number of LGT events in the selected groups.

Genome spectral approaches
The selection of reference topologies in the example above was guided by the principle of 'best practice', limiting the number of analyzed reference trees to 13 out of 13 749 310 575 possible unrooted tree topologies. In an ideal case, all reference tree topologies should be evaluated, a feasible task when the analysis includes four or five taxa. However, the number of possible topologies for a larger number of taxa increases dramatically (TODO Fig. 3 A). Methods that deconstruct the tree into smaller pieces—bipartitions or quartets—were introduced to circumvent the problem of unreasonable computing time.

Bipartitions
When the reference tree is deconstructed by removing one edge, two unconnected sub-trees represent a bipartition. If two bipartitions can exist on one gene tree, they are compatible. Otherwise, they are conflicting. The first step in the bipartition analysis is selecting strongly supported bipartitions: those created by deconstructing a node of high confidence—e.g. a node with a bootstrap value above a threshold. In the second step, each strongly supported bipartition is mapped to all available gene trees, counting the compatible and conflicting bipartitions. Finally, a gene family found to have many conflicting strongly supported bipartitions is considered to have LGT candidates.

Quartet decomposition
A similar principle of deconstructing the tree was used in quartet decomposition. All possible four-taxa sub-trees of the reference tree were compared with the available gene trees. If the topology of the quartet is embedded in the gene tree, the quartet is compatible with the gene tree. Similar to the bipartition analysis, candidate LGT events were flagged when the strongly supported embedded quartet disagreed with the gene tree.

Subtree Pruning and Regrafting (SPR)
A mechanistic way of modeling an LGT event on the reference tree is to first cut an edge (prune the tree) and then regraft the sub-tree to another edge. If the gene tree was topologically consistent with the original reference tree, the editing results in an inconsistency. Similarly, when the original gene tree is inconsistent with the reference tree, it is possible to prune and regraft the reference tree to obtain a consistent topology. By interpreting the edit path of pruning and regrafting one can flag LGT candidate nodes and infer the host and the donor genomes. The computational challenge lies in finding the optimal edit path—the one that requires the least number of steps, and different strategies are used in solving the problem. For example, the HorizStory algorithm reduces the problem by first eliminating the consistent nodes; recursive pruning and regrafting reconciles the reference tree with the gene tree and optimal edits are interpreted as LGT events.

Limitations of explicit phylogenetic methods
The main limitation of the explicit phylogenetic methods is the reference tree topology, as the selection of its reconstruction method is still controversial due to the statistical bias involved in creating the reference tree. Even if there is no doubt in the reference tree, the conflicting phylogenies can be the result of an unrecognized paralogy or a gene loss, as well as an LGT event. Moreover, inferred LGT scenarios are not necessarily unique. For example, SPR can provide conflicting scenarios with the same number of minimal tree edits. Similarly, the most parsimonious solution in tree reconciliation does not need to be the correct one. Explicit phylogenetic methods allow a direct inference of the donor node. The limitation to keep in mind is that this node also represents all the extinct and unsequenced taxa, so the primary donor is not exactly pinpointed. The availability of genomes included in the analysis limits the application for two more reasons: first, the computational complexity of reconstructing a gene tree or a species tree is still a challenge. Second, long branch attraction will confound the interpretation when fast evolving orthologous groups are considered.

Implicit phylogenetic methods –evolutionary distance
In comparative genomics, estimations of evolutionary distance are equivalent to estimating sequence similarity—the number of per-site substitutions since the genes diverged from their common 9 ancestor. If a gene has unexpected evolutionary distance from the reference, e.g. its gene family or the genomic average, implicit phylogenetic methods infer an LGT event.

Correlating sequence similarity
For a group of orthologs—homologs that started diverging after a speciation event—the molecular clock hypothesis states that the evolutionary distances of genes are proportional to the evolutionary distances of the respective genomes. If a group of orthologs contains xenologs—genes obtained in an LGT event—the proportionality of evolutionary relationships will hold for orthologs, but not for laterally transferred genes. One approach finds violations of the expected evolutionary distances by ranking similarity scores of Open Reading Frames (ORFs) to a "virtual genome"—a collection of ORFs of the respective strains from the GenBank database. If an ORF's evolutionary distance to the virtual genome was inconsistent with the distances of other ORFs from the same genome, the authors inferred an LGT event. Another algorithm for LGT detection compares all pairs of genes in predefined groups of orthologs : if a likelihood ratio test of the LGT hypothesis and a hypothesis of no LGT rejects the null, a putative LGT event is inferred. In addition, a pair-wise comparison allows inference of potential donors and provides an estimation of the time since the LGT event.

Phyletic profiles—presence and absence of genes
A group of orthologs or homologs can be analyzed in terms of the presence/absence of group members in the reference genomes; such patterns are called phyletic profiles. To find LGT events, phyletic profiles are scanned for an unusual distribution of genes. Arguably, the most straightforward use of phyletic profiles in the detection of LGT is the presence of top-scoring BLAST hits in an unrelated species. For example, when phyletic profiles of the Bacterium T. maritima showed that most of the best BLAST matches are found in the Archeal species and not the closely related Bacteria, Nelson et al. obtained a list of LGT candidate genes ; the predictions were later analyzed using structural features of the DNA molecule (Section XX). Absence of a homolog in a group of closely related species is a clue that the examined gene might have arrived via an LGT event. For example, the three facultatively symbiotic Frankia sp. strains are of strikingly different sizes: 5.43 Mbp, 7.50 Mbp and 9.04 Mbp, depending on their range of hosts. Marked portions of strain-specific genes had no significant hit in the reference database, and were possibly acquired by LGT transfers from other Bacteria. Similarly, the three phenotypically diverse E. coli strains, uropathogenic, enterohemorrhagic and benign, share about 40% of the combined gene pool, the core genes; the 60% of strain-specific genes are likely LGT candidates. To strengthen the case for the strain-specific genes originating from LGT, Welch and coworkers showed they have a strikingly different codon usage pattern from the core genes and a mostly conserved gene order.

Scanning for polymorphic sites
Larger stretches of sequence identity allow for multiple homologous recombination sites, so lateral transfer between closely related species often results in the exchange of ORF fractions. The analysis of a group of four E. coli and two Shigella flexneri strains revealed that the sequence stretches in common to all six strains contain polymorphic sites, consequences of homologous recombination. The method is, however, restricted to the sites in common to all analyzed species, limiting the analysis to a group of closely related organisms.

Limitations of implicit phylogenetic methods
In principle, implicit phylogenetic methods can detect any of the three sorts of LGT events: insertion of a new gene, insertion of a paralog, and insertion of a xenolog in orthologous gene displacement. However, practical considerations can result in limitations. Evolutionary distance approaches analyze groups of homologs, usually created by grouping best bidirectional hits. If the analysis leaves out groups that do not have a representative in all taxa (ORFans—ORFs present in only one genome—being an extreme example) and/or groups that are too small, only xenologous displacement can be detected. Moreover, computational search for homologs (and orthologs in particular) is often based on sequence similarity—best BLAST bidirectional hits; the most appropriate candidate, however, does not need to be the top BLAST hit ). A list of top sequence similarity hits is used to create phyletic patterns that detect if a gene was lost from the genome; considering gene remnants as lost could inflate the number of LGT events.

Evaluating LGT inference
Assessing the methods for the detection of LGT is crucial for the interpretation of their results. However, this is not a trivial task. Although some LGT mechanisms leave telltale clues in the genome and can readily be recognized, obtaining an unbiased benchmark is hindered by the large evolutionary scale on which LGT operates. Just as there are two different computational bases for detecting LGT events—sequence composition and phylogeny—there are corresponding computational strategies for their evaluation (Fig. XX). One strategy is sequence-based. Inserting known donor genes in the known position of the host genome results in a chimeric genome. The donor and the host sequences can either be obtained from the sequence database (real genomes) or they can be simulated (artificial genomes). Artificial genomes are obtained, for example, using Markov models or by simulating whole-genome evolution. The other strategy is based on swapping branches on the phylogenetic tree. This implementation is simpler than the sequence-based evaluation. However, the major drawback is that only explicit phylogenetic methods can be tested.



Conclusions
Computational methods used in inferring lateral gene transfer are based on two principles: one is sequence composition and the other is evolutionary history. Sequence composition-based methods look for deviations from the genomic average and need only the genome under study for the analysis. Straightforward as they are in the principles, they cannot detect the donor genome, and the signal is lost for ancient transfers. Phylogenetic methods are able to pinpoint the donor and are not sensitive to amelioration of the donor sequence; they are, however, often restricted to detecting xenologous gene displacement, either by the principle itself (explicit phylogeny-based methods) or practical considerations (implicit phylogeny-based methods). Due to the heterogeneity of the methods used in detecting LGT, a comprehensive assessment of all principles used in the detection of LGT still does not exist. However, case studies on nitrogen fixation genes and studies using artificial genomes show the potential of benchmarking and subsequent refinement in providing a clearer picture of the direction and the impact of lateral gene transfer.

For now, the conclusions about the power of various LGT detection principles largely depend on the theoretical considerations of the detection principles employed. We therefore see potential for a database with the benchmarking results. It would provide a reference set, not only to establish the weak and strong points of the current methods, but also to put exact numbers on the accuracy and guide the development of new methods. Most importantly, we would obtain a more precise estimate of LGT and gain insight in understanding the evolutionary history of modern genomes.

Acknowledgements
Fran, Jelena, Daniel, Steffan provided constructive comments.