Spell checking Java source code

Dammit, Jim, I’m an Engineer, not an English major!

API documentation is incredibly important, but unlike tutorials or other reference documents that can be written in conventional text editors or word processors with nice built in spell checkers, the public API of a Java project is conveyed through JavaDoc comments that are contained in source code right next to each class and method as they are declared. We necessarily find ourselves with tons of generated web pages covered in prose… and full of spelling mistakes.

So I’ve had it in mind for a while to run aspell over the java-gnome sources. Raving insanity, that’s for sure. Spell check source code? Are you mad?

Well, being suitable euphoric on New Year’s Day, I decided “what the hell, why not” and gave it a try. One thing that was obvious was that I was going to end up with a ridiculous number of unknown words that were going to need adding. Rather than filling my personal dictionary (the one in $HOME) with tons of project specific crud, I used aspell‘s -p option to specify a new word list in the project’s top level directory. Easy enough with bzr — they’ve got a bzr root command that tells you the path of the project root. Nice.

Here’s the command line I used:

aspell -x -c -p `bzr root`/.aspell.en.pws -H Button.java

It worked pretty well. The tokens sure did add up in a hurry, though. Java language keywords? Ok, no problem. Class names? Sure, makes sense to add them — many had already turned up when spell-checking other documentation in the project. But uh oh: it wants to know about getLabel( and num and x and every other bit of source code. Yikes. Quite the pain to add all that stuff while working through the files just to get to the JavaDoc and normal comments in order to fix the spelling in the text there.

But worth it… we now have spell checked API documentation!

What would really be neat is to write a little module for aspell that adds a mode that understands to only spell check stuff between /* and */ characters. The -H flag above tells aspell to ignore HTML markup, and there are modes for LaTeX and others. So hopefully a “source code” mode would be feasible, and I could start again and have a slightly better signal-to-noise ratio :)

Happy New Year!