A few years ago I got asked to investigate a claimed software-license violation – I was the principal author of our licensing runbook at the time – in which some meaningful chunk of C code had, the claim went, been inappropriately re-purposed into a differently-licensed context. The twist in the tale was that the C in question was being re-used as Javascript.
Strange, but plausible? If you decide to write them both a particular way they’re syntactically close enough that this is believable, and while I don’t think this has ever been tested in court for code, as far as text is concerned translation doesn’t obviate copyright. And the bar we’d set for ourselves wasn’t “the law might be arguably maybe ambiguous so we’ll do whatever we can maybe get away with”, it was “we will act as good citizens of the larger open source software community”, so the investigation was necessary.
As far as I could tell there aren’t (or at least weren’t at the time) any available tools that would look at two codebases written in different languages and tell you that any two pairs of files are similar… ish. I thought about the problem for a while but couldn’t figure it out; I’m not particularly good at any sort of statistical analysis, so while I had sort of intuited that there might to be a path to the answer through compression – you’d think that pairs of somewhat-similar files would compress better than pairs of very dissimilar files, was my guess? – I couldn’t figure out how to make that approach work. So, without some sort of available tools written by somebody much smarter than me, I found myself left with the option of a manual examination, comparing all the files in one codebase to all the other files in the other, and I really didn’t want to do that.
But I didn’t see any other way to do it and be sure. I don’t mind tedious work, but this was an MxN problem for not-quite-tiny M and N, and it was going to be a slog. Sometimes you’ve just got to grind out the job however unpleasant it is, I get it, but there’s a line where tedium turns into impracticality and this felt like it was definitely on the wrong side of it.
Eventually while pondering in the office kitchen I put the problem to my friend Blake, who immediately came up with the approach that worked.
His idea was that because you can easily get a measure of how disjoint files are with diff, and you’d expect almost every file in a codebase in one language to be either completely different from every file in the other one or (e.g. license text, small files that are mostly header) completely identical. So if you carved these files up went character by character and your diff wasn’t about the same length in lines as the two files put together, that pairing was worth investigating.
And that insight, it turned out, was very easy to implement in the shell.
- Run all the files in both codebases through sed, so they’re all one character per line and stripped of whitespace.
- Take a diff of every file in one repo from every file in the other and count how many lines it is with “wc -l”.
- Your output is each pair of filenames and their similarity, as expressed by “length of diff in lines” / “sum of lengths of both files”.
- And finally, go looking for inliers rather than outliers; those places where the fractional difference is anything other than “under 1%” or “over 99%”.
If you’re even slightly comfortable with a few command-line tools that are available everywhere, this is a couple of lines of bash calling out to diff, wc and bc, not difficult to implement at all. I didn’t even need to think about parallelizing it – all computers are insanely fast now, if you get out of their way! – but you can see how easy it would be to farm this off across however many cores you have to hand if the job got large enough.
[hundreds of lines of "0% file1 file2"]
[hundreds of lines of "1% file1 file2"]
22% fileX fileY
43% fileX fileY
48% fileX fileY
84% fileX fileY
[dozens of lines of "99% file1 file2"]
And, let me tell you, it works like a charm. I wish I could share the details (and the protagonists, and the real output) with you, but it’ll have to be enough for me to say: that “bullseye on the first shot” feeling, that’s a pretty good feeling. In any case, the situation was resolved to everyone’s satisfaction most of a decade ago. But I wanted to introduce you to the technique we used for finding similarities between syntactically-similar languages in disparate codebases, in case you ever find yourself needing it.
Thinking on it, this whole exercise is sort of a microcosm of whatever success I’ve had in my career. I don’t know that I have any meaningful advice to give anyone about anything, but “be competent assembling novel tools out of the vast, if often ill-fitting, selection tools free software offers” and “have really smart friends” has carried me a very long way. The second one in particular; definitely have smart friends. If you have the opportunity, be somebody’s smart friend.