Phylogeny simulation 101

Species across a landscape, as simulate by Boucher et al. It's this pretty stained-glass-esque diagram that inspired me to do this!

Species across a landscape, as simulate by Boucher et al. It’s this pretty stained-glass-esque diagram that inspired me to do this!

Lots of people have written code to simulate phylogenies, and yet more have written code that simulate traits across phylogenies. I’m not claiming any great novelty in what I’ve just done, but sim.bd.tree simulates a phylogeny under given speciation and extinction rates, and sim.bd.tr.tree simulates a phylogeny, and a trait that affects speciation and extinction rate.

What I like about this is that the dependence on trait values is actually a dependence on what values of that trait other species have – in other words, it’s a niche-packing kind of model. Again, I’m not claiming any great novelty in these sorts of models (read this excellent paper, for example), but this was my first stab in what I hope will be a long stream of work. This is very raw code (a quick skim of it makes me realise a refactor would more than halve its length) but it does get the job done (I hope!).

My first impression is that this stuff isn’t very hard, so if you have any interest you should definitely give it a go. Moreover, the insight I gained from it was quite important – the shape and size of a phylogeny changes a great deal over different simulations, and while on some level I knew this it was only while checking to see if I’d messed something up that I really gathered a true appreciation for it.

Next post… either a model incorporating communities/biogeography (–> some model of allopatric speciation), or a vague attempt at fitting estimting the parameters I set once the phylogeny/traits are calculated. I have a feeling those two will be much harder!…

Advertisements

Taxonomic cleaning and building phylogenies

The Uma Thurman wasp, courtesy of crazy taxonomist Donald Quicke (well, he's not that crazy, but there we go). Via geekosystem.

The Uma Thurman wasp, courtesy of taxonomist Donald Quicke. Have you any idea how hard it is to find a picture to accompany a post on taxonomy cleaning? Via geekosystem.

I spend most of my time cleaning up species names, and using them to build phylogenies (have I screamed the word phyloGenerator at you recently?). However, most of the time if you’ve got a pretty small species set and you’re studying something well-known, you can get away without building a tree at all if you’re smart with how you match all the names up.

A number of people have been emailing me asking for my taxonomy scripts, and so I’ve now pushed some phylogeny making and taxonomic cleaning scripts (based heavily on the code that ships with phyloGenerator, so please cite that if you use it) into willeerd. It takes a list of plant names, cleans them up, and then sticks them into a phylogeny you specify (I’d recommend this plant phylogeny). You can also use the functions congeneric.merge and make.composite.with.polytomies to put missing species into any kind of phylogeny, either based on genus names or based on binding groups (intended to be genera) in to replace something else in a phylogeny. They all, by default, follow the bladj algorithm, which is a very fancy way of saying that inserted nodes are evenly spaced wthin a clade.

If you’re going to use these functions, please be careful. Taxonomy != phylogeny, and there is no guarantee that using taxonomy to fill in for phylogeny won’t cause major, major problems. Indeed, if taxonomy were sufficient to fill in for phylogeny, you should ask yourself why you’re even trying to use a phylogeny in the first place. The dating of nodes within a phylogeny is really hard, and while there are excellent methods to try and correct for a lack of information, it’s still really hard. I’d strongly recommend using a program that incorporates DNA data (like phyloG– OK, I’ll shut up), and even then be careful. This stuff is hard for a reason…!

As with the plotting functions I put up a few posts back, I use this code quite a lot myself, so please don’t be surprised if it changes in the coming days/weeks/months. Let me know if there’s anything in particular you’d like to see up there – I take requests (this post is one), so just let me know.

Kill your unit test fetish

They truly are everywhere.

They truly are everywhere.

I’ve been looked at strangely for saying that I’m not a big fan of unit testing. Since I don’t have time to write any code this week, I thought I’d formalise this a little. Let me start by saying, in big bold font, unit testing is really helpful and you should definitely do it.

Unit testing is a way of formally asserting that your code works. In R, we have the great testthat package, which allows your to say things like:

expect_that(my_new_factorial_function(3), equals(6))

…and you can be damn sure that your new factorial function will return the correct value when it’s given a value of 3. When working on big projects with lots of people it’s a great way of making sure nobody breaks someone else’s code; if all the tests pass at the end of your edits, everything is fine. It’s great for stuff only you are doing too – my shiny database must be right because all the tests assert test data is loaded correctly.

However, quantity is not quality when it comes to testing. Testing a function that computes a factorial twenty times to make sure it fails when given phylogenetic trees and character strings is not the same as testing to make sure the function works. Test the things that matter first – if someone’s going to pass your function a character string and they need a specific error for that, there’s probably nothing you can do. Achieving complete coverage of tests is no use if all of your checks are nonsense.

I think the most important thing in programming is writing good code, and that many have fetishised unit testing to the point where it’s no longer helpful for maintaining good code. So, in the spirit of being helpful, here are some general practices I think would make everything better:

  • Check the interactions.
    At a certain level, your code isn’t going to fail because you missed a comma somewhere, it will fail because two functions will interact in a way you hadn’t predicted (…and therefore couldn’t have written a test for). Check how the individual components of your program interact with one-another, and then (go nuts) write some tests for that.
  • Test your assumptions.
    If you know what will break your program, either fix it, or write a check for that conditions and make it fail gracefully. Everyone has confirmation bias – if you’re not finding errors when you thoroughly check your code with different input values, then you’re either the best programmer in the world or you’re not being honest with yourself. Find bugs and squash them, don’t write lots of line of tests that make you feel better about the bugs you’re pretending you don’t have.
  • Write DRY code.
    Don’t Repeat Yourself. You wouldn’t have to write so many damn tests if you abstracted and re-factored your code once in a while. Modularise everything so that you are quicker writing new things. The whole purpose of computers is to make our lives easier – grab discrete modules of code you know already work (and are tested), and use them to get your current task done.
  • Listen to feedback.
    When I leave a bug report, it’s because I’m hoping to help someone. Even if I’m being stupid, can you say in all honesty that you couldn’t make your program a little clearer? I often leave bug reports because I want to draw the authors’ attentions to things that could cause problems further down the line; these aren’t bugs, they’re just things that might be helpful. Such ‘bug’ hunters probably like you and your work, so chill out!
  • You’re not a developer.
    You’re a scientist. So you’re probably not going to be able to write meaningful tests and then write a program that fulfills them, because you’re exploring and treading new ground. Don’t worry about it – just make sure your code works. And yes, that probably means some unit tests.

Most of these pieces of advice involve unit testing. There’s a good reason for this: when done correctly, unit testing is helpful. When fetishised, it’s incredibly dangerous, because people think they can churn out more and more tests and so any kind of spaghetti code and all will be fine. I promise you, it won’t.

Modelling Communities as DNA (in C++, sorry again!)

Inachis io © Eddie John (from UKBMS website)

Inachis io © Eddie John (from UKBMS website)

A while ago I was lucky enough to spend a few months in the Exelixis lab in Heidelberg, with rather grand hopes of modelling turnover in species composition as if they were nucleotides in a DNA molecule. C++ (sorry, not R, I know…) code implementing this method is up online here (to build you will need the Boost library, and ‘make install’ should build it all for you).

You can read a more detailed description of the method here (I’m about to put it on arXiv), but the general principle is that, at each time-step, an individual can reproduce, be replaced by another individual of another species, or die and not be replaced. New individuals can join a community, allowing the overall abundance to grow. The first two of these events (reproduction and replacement) are analogous to models of DNA substitution in phylogenetics; in phylogenetics such models help us figure out what has happened in the past based on extant DNA.

In this method we need this model to estimate the historical event that took place in a community; there is a circularity because you need the model to estimate the events, and the events to estimate the model (we never see when an individual replaces another individual, right?). While the method doesn’t work so well on simulated data (albeit data I’ve simulated to show the method’s failings), there is some hope that the program detects signal in real, biological data. I’ve decided to put this up online because I don’t really have time to do anything more with this, but if you’re interested here are a few things I’d like to try:

  • Re-write the whole thing and use MCMC; I did some experiments with Filzbach a while ago and it looks like integrating out the uncertainty in figuring out the events, as well as having a search procedure that’s less-likely to get caught in local optima, works wonders.
  • Stop treating the individuals in this model as real individuals, and instead view them as positions in niche space that are being fought for among species. This niche space battle can then be a ‘hidden model’ that generates observed abundances.
  • Apply this to some real data in simpler systems. I’m convinced that species like butterflies, where (generally) there’s an alternation of generations with little overlap, have a lot of fluctuations that are modelled quite nicely by something like this where abundances are essentially a random draw from a slightly biased bag.

Let me know what you think! If you’re interested in doing something with this… let me know!

Cartoon phylogenies and circular tip labels

A cartoon fan phylogeny with spaced-out tip labels. More fiddly to do than you might think!

A cartoon fan phylogeny with spaced-out tip labels. More fiddly to do than you might think!

I love the ape phylogenetics package, but nothing is perfect for everything and I sometimes find myself messing around a . I’ve put many of those modifications up onlineand written a quick ‘cartoon’ polytomy plotting script (cartoon.plot). The figure above shows off a few of the things (pretty polytomies, radially-spaced tiplabels) you can do with these functions.

It’s not very quick, but you can use it to automatically plot terminal polytomies as if they were just big ‘cartoon’ blocks, or you can specify tips and the function will find the monophyletic clade to group for you. The figure above uses the auto-detect feature, along with willeerd.tiplabel‘s radial.adj to put circles at distances around the phylogeny.

Maybe some of you are aware of this already, but you can use a line like

pp <- get("last_plot.phylo", envir = .PlotPhyloEnv)

…to get information about the last phylogeny you plotted. This makes manually adding things to tips and nodes much easier, because you don’t have to reverse-engineer where you think ape put them! I found this nugget while poking around inside ape and it’s made my life much easier!

Evolutionary transitions

tiktaalik-transitional-fossil

It’s very hard to find pictures of evolutionary transitions! Taken from transitional-fossil.com, which made me smile.

I’ve been thinking a little about transitions between discrete characters, in part because of the excellent corHMM package. corHMM is great for measuring the rate at which discrete characters transition from one to the other, but you can’t tell whether certain transitions tend to happen under certain conditions. For example, it’s one thing to know how frequently plants transition between dormant and non-dormant seeds, but another to know under what global temperatures that tends to happen.

In theory, that’s where transition.calc comes in. It takes a discrete (e.g., seed dormancy) and continuous (e.g., cold tolerance) character, reconstructs ancestral states and transitions, and then plots one against the other. I prefix all of this with “in theory” because:

  • It’s a very, very dirty hack, and I’m certain there’s an analytical solution for this. Moreover, I’m not convinced that taking the modal transition that happens along a branch and working with that will actually work.
  • I pulled the code together from some other scripts and I’m not convinced that it matches up the discrete and continuous reconstructions properly (seems important, right?…)
  • I haven’t validated the script with simulated data. To do that, I need to write a flexible way of defining models of evolution across a phylogeny (hopefully, that’s next fortnight’s post…)

…however, I do have some hope that this works because of the scripts I pilfered it from (…which have proven remarkably hard to generalise, thank you very much Past Will). I found what appear to be significant results in a seed dataset that weren’t there when I randomly evolved two separate traits across the phylogeny. That’s not to say I haven’t introduced another error somewhere, but it’s a good sign. I’m currently having a debate with co-authors as to whether there’s a difference between this and just regressing the continuous variable against the discrete; I’d be grateful for feedback!

Phylogenetics in Julia! Not R, sorry…

...and now for something completely different.  The Monty Python foot, taken from  ecogirlcosmoboy

…and now for something completely different. The Monty Python foot, taken from ecogirlcosmoboy

When I first read about Julia, I dismissed it as a nice idea that was coming too late into an already crowded niche space of programming languages. Fast forward almost two years and Doug Bates (of mixed effects models fame) is using it, and even Wired is is talking about it. So I decided to give it a try by making a phylogenetic library; don’t get too excited because it only sort-of loads a Newick phylogeny and not much else.

My first impressions were very good. It is fast – my code is dreadful (I wrote it with a few beers) and even with recursion it’s usable. It’s also very readable; despite a few quirks, it’s probably easier to read than R. The features that impressed me (in vague order of increasing nerdiness) are:

  • Easy to use multiple processors
  • Package management is all linked into GitHub, so there’s no messing around with CRAN-like central repositories (…yes, that could also be bad)
  • The typing system is well-done and saved me time. If I say a function only takes a phylogeny, it only takes a phylogeny, and is very vocal about it without me writing checking code
  • Types are defined explicitly, and there aren’t three kinds of class (R…!) to mess around with
  • Emacs (through ESS) already gives me a good coding environment. I couldn’t get JuliaStudio to work on my Ubuntu computer, but I’m not a fan of RStudio anyway so it was no biggie

That said, there are kinks, and while it’s only at version 0.2 some are surprising gives it’s over two years old. You’re going to find yourself on developer discussion pages quite quickly because the help files are still being built. You can’t delete or redefine variables or types, which is a nightmare when you’re experimenting. I’ve had a rather vocal falling out with Julia’s regular expression matching (matchall, not match, is your friend), the debugger hasn’t been touched in a while (…but it works), and there’s no standardisation of graphics yet. More fundamentally, Julia needs to take statistical formulae seriously; the expression notation is sort of alright, but not every function listens to it. I want to be able to plot(y ~ x), damnit!

Bottom line: it’s not ready yet, it made me grow a few grey hairs, and The Queen isn’t dead so long live R. However, take a quick look at any package up on GitHub, and you’ll find something you won’t see on any R packages – there’s no C code. I’m tired of people talking about how great it is that you can easily use C++ in R – I learnt R specifically because I wanted to move away from C++. If you have to solve your problem using another language, maybe you weren’t using the right language to begin with. I don’t think Julia will supplant R any time soon, but I am going to keep plugging away at phylogenies in it; I want to do some big simulations and I really don’t want to return to C++. I doubt I’m the only one.