[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Subject Index][Author Index]
Re: Morpho v molecular (was Re: Tinamous: living dinosaurs)
- To: dinosaur@usc.edu
- Subject: Re: Morpho v molecular (was Re: Tinamous: living dinosaurs)
- From: evelyn sobielski <koreke77@yahoo.de>
- Date: Fri, 1 Jul 2011 11:29:08 +0100 (BST)
- Authentication-results: msg-ironport2.usc.edu; dkim=neutral (message not signed) header.i=none
- In-reply-to: <BANLkTikjT+aAyu7OdXKoFUocYYtAMKH9sg@mail.gmail.com>
- Reply-to: koreke77@yahoo.de
- Sender: owner-DINOSAUR@usc.edu
> So do you string all the genes together, and analyze the
> concatenated
> multigene dataset as a single "supergene" using common
> parameters -
> and hope that all the quirks of individual genes "come out
> in the
> wash" to generate the "true" phylogeny. Or do you try
> and deal with
> potential heterogeneities across different genes by
> partitioning the
> multigene dataset - and analyze the different genes
> separately, each
> with its own parameters, and combine them all at the end as
> a kind of
> gene 'supertree'. Both approaches have their pros and
> cons, and each
> has its own proponents and detractors. To be honest,
> I don't know
> which approach is better at capturing the true
> phylogeny.
Ask me again in two years, but until then I can only say that partitioned
analyses at least allow to exclude those genes where you find a bad SNR. I.e.
those which yield phylogenies that are categorically refuted by other evidence
(fossils for example).
The problem with multigenes (either method) is also that it combines sequences
evolving at different speeds. If you compare cytB and RAG-1 and ND2 sequences
for Accipitridae, you'll find that each locus has the best resolution (where
support is consistently high except for very short branches) at a different
area in the phylogeny, corresponding to different periods in time. Before that
period (i.e. further from the present) the noise is too large. After that
period (i.e. closer to the present) the signal becomes too weak. Either way the
SNR drops off.
It is easy to see that depending on your choice of loci, "more genes" can mean
"making the SNR decline badly across all periods of time". If I combined 1
kilobasepair of the three sequences above, I'd probably get worse results than
I'd get from each locus individually: with individual loci, you'd get good
resolution in a short period of time and crap resolution otherwise. Using loci
with non-overlapping "optimal resolution periods" you'll have the crap
resolution cancel out the good resolution.
For cytB, RAG-1/2 and ND2/4 in accipitrids, this problem is not so bad as to
render the results problematic, but you already note some branches are less
well resolved in the combined dataset versus the best-resolving individual
locus. E.g. cytB resolves best back at about 5-20 MA, whereas the other two
don't (they evolve more slowly), and combining the three will decrease
resolution at the intrageneric/sister genus levels.
If you go further back in time - say, base of Neoaves - it gets risky indeed.
The Hackett et al. study was monumental in its scope, but honestly nobody knows
what of the numerous strange sister groupings they found are good and which are
artefacts. And it's sad to see that the International Ornithological Congress
fairly blindly relied on the results for their taxonomy. Because almost none of
it has been tested (only "Coronaves/Metaves" has, and the results were not
encouraging).
And this is another problem of these huge multigene studies: I'd prefer half
the number of loci BUT ALSO some dedicated attempt at hypothesis-testing versus
a huge number of loci but only the barest of testing (if any at all) at all
times.
The fact that we can do high-throughput sequencing doesn't alleviate the need
to test, test, test.
If you find a weakly-supported "clade" that is unusual, don't just claim it's
good and true and submit your results to Science or Nature (which will publish
any crap these days as long as it's not too obviously flawed and as long as
it's sufficiently "novel" to splash the journal's name all across the daily
papers).
Rather, try to determine whether you have actually found something that all
previous researchers have overlooked, or whether it's simply due to some error
somewhere. In the above example, remove either branch of the "clade" from the
dataset and see what happens to the other. If you find it'll clade in another
position with a) better (quantitative) support in the DNA study and b) better
(qualitative) support by the fossil record and fossil and modern biogeography
(essentially, minimizing the number of open-ocean crossings), your new "clade"
should NOT be touted as genuine.
But this is not how it's being done. "Mihi itch" is not a disease of the past;
people simply seem to like their own name behind a nomen too much to exercise
restraint (the 6 authors of _Raptorex_ come to mind).
But as I said, ask me again in 2 years; hopefully I can then give some more
detailed examples. Until then, you may want to consult the literature on
Hoatzin relationships, where we had precisely this phenomenon: each locus
yields a phylogenetic hypothesis with better-than-equivocal (but still not
entirely convincing) support, but taken together and from a falsificationist
perspective, they simply cancel each other out leaving NO satisfying hypothesis
at all.
Perhaps the only feasible approach, before this phenomenon has been studied in
detail in and by itself, is building simple additive supertrees from those
clades which in each particular dataset show extremely high supports (95% and
above). Passeriformes phylogeny could be well resolved by this approach
(Jonsson & Fjeldsa supertree).
Regards,
Eike