[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Subject Index][Author Index]

Re: [dinosaur] Phylogenetics in general was Re: Placental Mammal Diversification Across the K-Pg Boundary (free pdf)



ÂEl dom., 1 de dic. de 2019 a la(s) 15:23, David Marjanovic (david.marjanovic@gmx.at) escribiÃ:

That said, I wonder how expensive it really is. It costs a lot more in person-hours; but the work itself is much cheaper: no expensive machines, no expensive chemicals, no unbelievably expensive electrophoresis gel every week, no ultrafreezers, not even liquid nitrogen. Just pay someone's costs of living and a few travels, and the work will get done.

That's only true for some types of data collection; high-throughput CT scanning, for example, is not exactly cheap, and judging from a couple of recent papers, the next big thing in morphological data collection may well be neutron tomography, where that holds to an even greater extent.

On the other hand, calling sequencing supplies "unbelievably expensive" does a disservice to just how insanely cheap sequencing has gotten over the last 20 years. That whole thing about how it cost $3 billion to sequence a whole human genome back in 2003 (when the HGP was completed) but about $200 now has become a clichÃ, but the point is undeniable. The amount of data stored in GenBank has been growing at a superexponential rate, which is simply not true for morpho phylogenetic datasets, where even a well-studied group like the tetraodontiforms can have over a decade's worth of studies recycling the same ~200-character matrix.
Â
Does anybody understand what this "ascertainment bias" is, and if it applies to anything but maximum likelihood?

It applies to all model-based phylogenetics (ML, Bayesian inference, and distance methods insofar as they use models to correct the raw, observed distances) and it refers to a situation where constant characters â characters for which every taxon has the same state â are not included in the dataset for some reason, usually because the very nature of the characters makes the notion ill-defined. Model-based analyses need constant characters to estimate branch lengths (in non-clock analyses, which confound rates and times and treat their products as i.i.d. variables) or branch rates; if they don't get them, these parameters will be overestimated, and the bias might then propagate to other parameters of interest (topology, divergence times).

This is also known as "acquisition bias", and it first came up for restriction sites, where it was dealt with by Felsenstein (1992) using conditional likelihoods. Of course, it's also inherent in analyzing discrete morphological data: even if people were in principle willing to score constant characters for their datasets, there would be no way of knowing when they scored the right number of them, which is an issue that just doesn't arise for sequence data. Therefore, starting with Lewis's (2001) paper, most attempts to estimate phylogenies from discrete morphological characters in a model-based framework have relied on corrections implemented using Felsenstein's method. RAxML, MrBayes, BEAST 2, and RevBayes all do it that way, sometimes with the extra option of specifying whether to condition on variable characters or on parsimony-informative characters only.

Refs:

Felsenstein J 1992 Phylogenies from restriction sites: a maximum-likelihood approach. Evolution 46(1): 159â173

Lewis PO 2001 A likelihood approach to estimating phylogeny from discrete morphological character data. Syst. Biol. 50(6): 913â25


Best,
--Â
David ÄernÃ