pez: Phylogenetics for the Environmental Sciences

You’re not famous until they put your head on a pez dispenser. Or something. Image from pez.com; hopefully advertisement isn’t copyright infringement…

I have a new R package up on CRAN now (pez: Phylogenetics for the Environmental Sciences). We worked really hard to make sure the vignette was informative, but briefly if you’re interested in measuring:

  • a combined data structure for phylogeny, community ecology, trait, and environmental data (comparative.comm)
  • phylogenetic structure (e.g., shape, dispersion, and their traitgram options; read my review paper)
  • simulating phylogenies and ecological assembly (e.g., scape, sim.meta, ConDivSim)
  • building a phylogeny (phy.build)
  • applying regression methods based on some Jeannine Cavender-Bares and colleagues (eco.xxx.regression, fingerprint.regression)

…there’s a function in there for you. For what it’s worth, I’m already using the package a lot myself, so there is at least one happy user already. I had a lot of fun working on this, mostly because all the co-authors are such lovely people. This includes the people who run CRAN – I was expecting a hazing for any kind of minor mistake, but they’re all lovely people!

I learnt a few things while getting this ready to go which might be of interest if, like me, you’re very naive as to how to do collaborative projects well. I don’t think much of this is R-specific, but here are things that I was surprised by the importance of…

  • People are sometimes too nice.
    • So you have to needle them a bit to be constructively nasty. Some (I’m looking at you, Steve Walker) are so nice that they feel mean making important suggestions for improvements. Some  feel things are lost in writing them down and prefer talking over Skype (e.g., me), others are quicker over email.
    • Everyone has different skills, and you have to use them. Some write lots of code, others write small but vital pieces, some check methods, others document them, and some will do all of that but be paranoid they’ve done nothing. Everyone wants to feel good about themselves, and if you don’t tell them what you want from them, they won’t be happy!
  • Be consistent about methods.
    • I love using GitHub issues, but that meant the 5% of the time I was just doing something without making an issue about it… someone else was doing the same thing at the same time. Be explicit!
    • If you’re going to use unit tests make sure everyone knows what kind of tests to be writing (checking values of simulations? checking return types?…), and that they always run them before pushing code. Otherwise pain will ensue…
    • Whatever you do, make sure everyone has the same version of every dependency. I imagine at least one person has made some very, very loud noises about my having an older version of roxygen2 installed…
  • Have a plan.
    • There will never be enough features, because there will never be an end to science. Start tagging things for ‘the next version’; you’ll be glad of it later.
    • Don’t be afraid to say no. Some things are ‘important’, but if no one cares enough to write the code and documentation for it, it will never get done. So just don’t do it!

Dataframes in Ruby: always double-check online

D’oh! If only I had checked beforehand! Courtesy of Simpson Crazy; apparently a hand-traced image and so OK for copyright…!

I absolutely love the Ruby programming language; I wouldn’t necessarily say I’m very good at it (or any language for that matter), but I always smile as I type ‘irb’ at a console. I find the language is more expressive, the naming conventions easy to use, and there are none of the silly indentation issues you find with Python. So, when faced with a solo project, I of course chose Ruby, and when I couldn’t find a reasonable data.frame gem (the Ruby equivalent of a package) I saw an opportunity, not a problem!

Behold! data_frame was born! Marvel! At how it’s very similar to a hash but with only a few extra features. Gaze adoringly! At how it can load CSV and Excel (xls and xlsx) files! Scream in shock! When you discover an identically-named package already available on rubygems, that happens to be much nicer (albeit without the Excel features). D’oh! If only I’d Googled more thoroughly earlier!

On a more positive note, I found the new GitHub-Zenodo integration really convenient for getting a citable DOI, and I’ll definitely be using that for all projects in the future. Moreover, making a gem (documentation and all) and getting everything ready took a single afternoon with a relaxed glass or two of wine. This is going from scratch, mind you, and included the time taken to re-install Ruby, get everything into the right gem format, figure out jeweler, and get everything online. I somehow can’t imagine having the same experience working with R…

Load that package! Etc.

I’m travelling back to the UK this weekend (yaaaaay!) and so, while I might write some (buggy) code on the plane, I thought it would be a push to get something new up this week. So, instead, I’ve “checked over” the willeerd package, and you can now install it like so:

require(devtools)
install_github(username="willpearse", repo="willeerd")

Tah-dah! My hope, in the next few weeks, is to have a few more posts with actual code within the page (like the above, but slightly less trivial). I might even veer off into posting about the usual sorts of R/ecology/evolution questions I get asked a lot, so if you have any preferences please do let me know!

Kill your unit test fetish

They truly are everywhere.

They truly are everywhere.

I’ve been looked at strangely for saying that I’m not a big fan of unit testing. Since I don’t have time to write any code this week, I thought I’d formalise this a little. Let me start by saying, in big bold font, unit testing is really helpful and you should definitely do it.

Unit testing is a way of formally asserting that your code works. In R, we have the great testthat package, which allows your to say things like:

expect_that(my_new_factorial_function(3), equals(6))

…and you can be damn sure that your new factorial function will return the correct value when it’s given a value of 3. When working on big projects with lots of people it’s a great way of making sure nobody breaks someone else’s code; if all the tests pass at the end of your edits, everything is fine. It’s great for stuff only you are doing too – my shiny database must be right because all the tests assert test data is loaded correctly.

However, quantity is not quality when it comes to testing. Testing a function that computes a factorial twenty times to make sure it fails when given phylogenetic trees and character strings is not the same as testing to make sure the function works. Test the things that matter first – if someone’s going to pass your function a character string and they need a specific error for that, there’s probably nothing you can do. Achieving complete coverage of tests is no use if all of your checks are nonsense.

I think the most important thing in programming is writing good code, and that many have fetishised unit testing to the point where it’s no longer helpful for maintaining good code. So, in the spirit of being helpful, here are some general practices I think would make everything better:

  • Check the interactions.
    At a certain level, your code isn’t going to fail because you missed a comma somewhere, it will fail because two functions will interact in a way you hadn’t predicted (…and therefore couldn’t have written a test for). Check how the individual components of your program interact with one-another, and then (go nuts) write some tests for that.
  • Test your assumptions.
    If you know what will break your program, either fix it, or write a check for that conditions and make it fail gracefully. Everyone has confirmation bias – if you’re not finding errors when you thoroughly check your code with different input values, then you’re either the best programmer in the world or you’re not being honest with yourself. Find bugs and squash them, don’t write lots of line of tests that make you feel better about the bugs you’re pretending you don’t have.
  • Write DRY code.
    Don’t Repeat Yourself. You wouldn’t have to write so many damn tests if you abstracted and re-factored your code once in a while. Modularise everything so that you are quicker writing new things. The whole purpose of computers is to make our lives easier – grab discrete modules of code you know already work (and are tested), and use them to get your current task done.
  • Listen to feedback.
    When I leave a bug report, it’s because I’m hoping to help someone. Even if I’m being stupid, can you say in all honesty that you couldn’t make your program a little clearer? I often leave bug reports because I want to draw the authors’ attentions to things that could cause problems further down the line; these aren’t bugs, they’re just things that might be helpful. Such ‘bug’ hunters probably like you and your work, so chill out!
  • You’re not a developer.
    You’re a scientist. So you’re probably not going to be able to write meaningful tests and then write a program that fulfills them, because you’re exploring and treading new ground. Don’t worry about it – just make sure your code works. And yes, that probably means some unit tests.

Most of these pieces of advice involve unit testing. There’s a good reason for this: when done correctly, unit testing is helpful. When fetishised, it’s incredibly dangerous, because people think they can churn out more and more tests and so any kind of spaghetti code and all will be fine. I promise you, it won’t.