(This is an edited version of a rant that started life on Twitter. I may add some links later.)
Can we talk for a few minutes about the weird academic-integrity hostage situation going on in CS research right now?
We share a lot of data here at Mozilla. As much as we can – never PII, not active security bugs, but anyone can clone our repos or get a bugzilla account, follow our design and policy discussions, even watch people design and code live. We default to open, and close up only selectively and deliberately. And as part of my job, I have the enormous good fortune to periodically go to conferences where people have done research, sometimes their entire thesis, based on our data.
Yay, right?
Some of the papers I’ve seen promise results that would be huge for us. Predicting flaws in a patch prereview. Reducing testing overhead 80+% with a four-nines promise of no regressions and no loss of quality.
I’m excitable, I get that, but OMFG some of those numbers. 80 percent reductions of testing overhead! Let’s put aside the fact that we spend a gajillion dollars on the physical infrastructure itself, let’s only count our engineers’ and contributors’ time and happiness here. Even if you’re overoptimistic by a factor of five and it’s only a 20% savings we’d hire you tomorrow to build that for us. You can have a plane ticket to wherever you want to work and the best hardware money can buy and real engineering support to deploy something you’ve already mostly built and proven. You want a Mozilla shirt? We can get you that shirt! You like stickers? We have stickers! I’ll get you ALL THE FUCKING STICKERS JUST SHOW ME THE CODE.
I did mention that I’m excitable, I think.
But that’s all I ask. I go to these conferences and basically beg, please, actually show me the tools you’re using to get that result. Your result is amazing. Show me the code and the data.
But that never happens. The people I talk to say I don’t, I can’t, I’m not sure, but, if…
Because there’s all these strange incentives to hold that data and code hostage. You’re thinking, maybe I don’t need to hire you if you publish that code. If you don’t publish your code and data and I don’t have time to reverse-engineer six years of a smart kid’s life, I need to hire you for sure, right? And maybe you’re not proud of the code, maybe you know for sure that it’s ugly and awful and hacks piled up over hacks, maybe it’s just a big mess of shell scripts on your lab account. I get that, believe me; the day I write a piece of code I’m proud of before it ships will be a pretty good day.
But I have to see something. Because from our perspective, making a claim about software that doesn’t include the software you’re talking about is very close to worthless. You’re not reporting a scientific result at that point, however miraculous your result is; you’re making an unverifiable claim that your result exists.
And we look at that and say: what if you’ve got nothing? How can we know, without something we can audit and test? Of course, all the supporting research is paywalled PDFs with no concomitant code or data either, so by any metric that matters – and the only metric that matters here is “code I can run against data I can verify” – it doesn’t exist.
Those aren’t metrics that matter to you, though. What matters to you is either “getting a tenure-track position” or “getting hired to do work in your field”. And by and large the academic tenure track doesn’t care about open access, so you’re afraid that actually showing your work will actively hurt your likelihood of getting either of those jobs.
So here we are in this bizarro academic-research standoff, where I can’t work with you without your tipping your hand, and you can’t tip your hand for fear I won’t want to work with you. And so all of this work that could accomplish amazing things for a real company shipping real software that really matters to real people – five or six years of the best work you’ve ever done, probably – just sits on the shelf rotting away.
So I go to academic conferences and I beg people to publish their results and paper and data open access, because the world needs your work to matter. Because open access plus data/code as a minimum standard isn’t just important to the fundamental principles of repeatable experimental science, the integrity of your field, and your career. It’s important because if you want your work to matter to people, then you’d better put it somewhere that people can see it and use it and thank you for it and maybe even improve on it.
You did this as an undergrad. You insist on this from your undergrads, for exactly the same reasons I’m asking you to do the same: understanding, integrity and plain old better results. And it’s a few clicks and a GitHub account for you to do the same now. But I need you to do it one last time.
Full marks here isn’t “a job” or “tenure”. Your shot at those will be no worse, though I know you can’t see it from where you’re standing. But they’re still only a strong B. An A is doing something that matters, an accomplishment that changes the world for the better.
And if you want full marks, show your work.