Enabling DP development with a developer VM

Getting started doing development on the DP code can be quite challenging. You can get a copy of the source code quite readily, but creating a system to test any changes gets complicated due to the code dependencies — primarily its tight integration with phpBB.

For a long time now, developers could request an account on our TEST server which has all the prerequisites installed, including a shared database with loaded data. There are a few downside with using the TEST server, however. The primary one being that everyone is using the shared database, significantly limiting the changes that could be made without impacting others. Another downside is that you need internet connectivity to do development work.

Having a way to do development locally on your desktop would be ideal. Installations on modern desktops are almost impossible, however, given our current dependency on magic quotes, a “feature” which has us locked on PHP 5.3, a very archaic version that no modern Linux desktop includes.

Environments like this are a perfect use case for virtual machines. While validating the installation instructions on the recent release I set out to create a DP development VM. This ensured that our instructions could be used to set up a fully-working installation of DP as well as produce a VM that others could use.

The DP development VM is a VMware VM running Ubuntu 12.04 LTS with a fully-working installation of DP. It comes pre-loaded with a variety of DP user accounts (proofer, project manager, admin) and even a sample project ready for proofing. The VM is running the R201601 release of DP source directly from the master git repo, so it’s easy to update to newer ‘production’ milestones when they come out. With the included instructions a developer can start doing DP development within minutes of downloading the VM.

I used VMware because it was convenient as I already had Fusion on my Mac and that VMware Player is freely available for Windows and Linux. A better approach would have been VirtualBox1 as it’s freely available for all platforms. Thankfully it should be fairly straightforward to create a VirtualBox VM from the VMware .vmdk (I leave this as an exercise to another developer).

After I had the VM set up and working I discovered vagrant while doing some hacking on OpenLibrary. If I had to create the VM again I would probably go the vagrant route. Although I expect it would take me a lot longer to set up it would significantly improve the development experience.

It’s too early to know if the availability of the development VM will increase the number of developers contributing to DP, but having yet another tool in the development tool-box can’t hurt.

1 Although I feel dirty using VirtualBox because it’s owned by Oracle. Granted, I feel dirty using MySQL for the same reason…

A new release of the DP site code, 9 years in the making

Today we released a new version of the Distributed Proofreaders code that runs pgdp.net! The announcement includes a list of what’s changed in the 9 years since the last release as well as a list of contributors, some statistics, and next steps. I’ve been working on getting a new release cut since mid-September so I’m pretty excited about it!

The prior release was in September 2006 and since that time there have been continuous, albeit irregular, updates to pgdp.net, but no package available for folks to download for new installations or to update their existing ones. Instead, enterprising individuals had to pull code from the ‘production’ tag in CVS (yes, seriously).

In the process of getting the code ready for release I noticed that there had been changes to the database on pgdp.net that hadn’t been reflected in the initial DB schema or the upgrade scripts in the code. So even if someone had downloaded the code from CVS they would have struggled to get it working.

As part of cutting the release I walked through the documentation that we provide, including the installation, upgrade, and configuration steps, and realized how much implied knowledge was in there. Much of the release process was me updating the documentation after learning what you were suppose to do.1 I ended up creating a full DP installation on a virtual machine to ensure the installation steps produced a working system. I’m not saying they’re now perfect, but they are certainly better than before.

Cutting a release is important for multiple reasons, including the ability for others to use code that is known to work. But the most important to me as a developer is the ability to reset dependency versions going forward. The current code, including that released today, continues to work on severely antiquated versions of PHP (4.x up through 5.3) and MySQL (4.x up to 5.1). This was a pseudo design decision in order to allow sites running on shared hosting with no control over their middleware to continue to function. Given how the hosting landscape has changed drastically over the past 9 years, and how really old those versions are, we decided it’s time to change that.

Going forward we’re resetting the requirements to be PHP 5.3 (but not later, due to our frustrating dependency on magic quotes) and MySQL 5.1 and later. This will allow us to use modern programming features like classes and exceptions that we couldn’t before.

Now that we have a release behind us, I’m excited to get more developers involved and start making some much-needed sweeping changes. Things like removing our dependency on magic quotes and creating a RESTful API to allow programmatic access to DP data. I’m hoping being on git and the availability of a development VM (more on that in a future blog post) will accelerate development.

If you’re looking for somewhere to volunteer as a developer for a literary2 great cause, come join us!

1 A serious hat-tip to all of my tech writer friends who do this on a daily basis!

2 See what I did there?

Development leadership failure

Last night I did some dev work for DP. Mostly some code cleanup (heaven knows we need it) but also rolling out some committed code to production. I’ve made a concerted effort to get committed-but-not-released code deployed — some of which has been waiting for, literally, years.

Even worse, we have reams of code updates sitting uncommitted (and slowly suffering from bitrot) in volunteers’ sandboxes waiting for code review. In the case of Amy’s new quizzes, for almost 5(!!!!) years. In other cases volunteers have done a crazy amount of legwork to address architectural issues that remain unimplemented due to no solid commitment that if they did the work it would be reviewed, committed, and deployed — like Laurent’s site localization effort.

These are clear systematic failures by development leadership, ie: me. It’s obvious why even when the project attracts developers, we can’t retain them.

The first step is to get through the backlog of outstanding work. I have Laurent’s localization work almost finished. This will allow the site to be translated into other languages — I think Portuguese and French are already done. Next up is getting Amy’s new quizzes pushed out. She’s done a marvelous job of keeping her code up to date with HEAD based on my initial work last night. Now to get them committed and rolled out. Then a site-wide change on our include()s required to get full site localization implemented.

After all that, we need to address how to better keep code committed and rolled out. I think we as a team suffer from “don’t commit until it’s perfect, then wait until it’s simmered before rolling it out”. Where “simmered” means “sitting in CVS with no active testing done on it”. We need to move to a more flexible check-in criteria or a more liberal roll-out. There’s no good reason why the bar is so crazy high on both ends of that.

But first – the backlog.

Mystery of the terrible throughput (or how I solved a TCP problem)

It all started out with a simple single stream reading test. Just a simple request for the entirety of an 8GB file. We do this stuff all the time. Except this time instead of 700 MB/s I was getting 130 MB/s. What?

Usually we test with jumbo frames (9000 MTU) but for this exercise we were using standard frames (1500 MTU). Still, there’s no way that was the difference. After 2 days I discover a method to consistently reproduce the problem: while the streaming test is running, toggle the LRO flag on the server’s network interface. This is just as crazy as making your car go faster by removing your soda from the cupholder. There’s no way that it has anything to do with it, but for some reason it does. Consistently. At last I have a reproducible, if ludicrous, defect.

Fast forward through 5 days of eliminating nodes, clients, switches, and NFS overcommits. Add in packet traces, kernel debugging output, and assorted analysis. Eventually Case catches the first real clue: the packet congestion window between the ‘fast’ and ‘slow’ states are distinctly different. In the ‘fast’ state, the congestion window stays fairly constant. In the ‘slow’ state, the window oscillates wildly – starting at the MTU growing really large, and starting over.

The LRO trick worked by causing enough retransmits that the stack dropped into slow start mode — one mystery solved. The reason we haven’t seen this before is that after a node-client pair get into the fast state, the slow start threshold is retained in the TCP hostcache between connections which is why we haven’t clearly identified this before — another mystery solved.

Fast forward through a few more days of slogging through TCP code down the path of blaming slow start threshold (or rather the lack of slow start in the slow state). By this time I’m way more familiar with the TCP code, and our kernel debugging framework, than I want to be. I notice that every time the congestion window drops back to the MTU it’s caused by an ENOBUFS error. It’s very unlikely we’re running out of buffer space though. Checking the called function reveals that the error would show up not only when we’re out of buffers, but also if we can’t return one immediately. We surmise the problem is some contention causing an inability to immediately get the requested buffer. So I change the code to reduce the congestion window by a single segment size (aka MTU) instead of dropping it all the way down to the segment size. The assumption being the next time we request a buffer of this size, we’re likely to get one.

And performance shoots up to 900 MB/s — even higher than the previous fast state.

The reason we’re unable to return the requested buffer immediately is unclear, and frankly above my paygrade. I’ll happily let the kernel devs work on that (it involves slabs and uma and things geekier than me).

The core of the problem remains “why aren’t we able to return the requested buffer immediately” but until the devs conquer that one we have a valid, shippable, workaround. And a lowly tester found, identified, and fixed it!

A geek and his keyboard

Simply accepting the death of one keyboard and the failure of its backup was simply not an option, so I started off this morning with my trusty screwdriver.

I opened up the bottom of the dead keyboard and studied its innards. From top to bottom the keyboard consists of:

  1. keys
  2. translucent rubber layer
  3. flexible transparent layer with printed circuit
  4. flexible transparent buffer layer with no circuit
  5. flexible transparent layer with printed circuit
  6. 3 large white plastic structural pieces
  7. 1 PCB

Given the simple structure it is apparent that the PCB is the failing component of the backup keyboard. The PCB design and rev number differ between the two keyboards, but I thought swapping them out would be worth a shot. Fortunately the physical structure of both keyboards is identical. Unfortunately swapping them didn’t work and examining the circuit layers (#3) it’s obvious why: they changed the circuit layout to the PCB.

I went with Plan B which was determining why those specific keys on the dead keyboard were dead. One look at layer #3 confirmed that all the dead keys are on the same circuit. Bringing out my trusty multimeter I discovered a break in the circuit to the PCB. But how to fix that? The transparent circuit layers are on a plastic layer so even if I had my soldering iron here in Denver, there was no way that was going to work. The dead gap wasn’t all that large, just a couple of millimeters, I just needed something to bridge it. A small piece of wire wasn’t optimal as it wouldn’t be flat and it would be hard to secure. Then the light bulb went off: aluminum foil. Conductive, easily trimmed down to the right size, and flat. Throw in a small piece of scotch tape and a few minutes later I have my first hardhack:

And thus far it works beautifully. As a bonus I moved layers #3-5 and #7 to the shell of the backup keyboard so I get the pearly white keys of the backup with the tried-and-true workings of the original.

I’m a bit concerned that the failure of that one circuit is simply a foreshadowing of things to come with different circuits. By the looks of the backup keyboard’s circuits it’s clear that the degradation isn’t from use but with age (which makes perfect sense anyway). We’ll see how long my hardhack works and if there are future failures elsewhere. Who knows, by the time I’m through maybe I’ll have a completely rebuilt keyboard full of aluminum foil.

The effective lifespan of a Microsoft Natural keyboard: ~13 years

I’m sad to report the demise of my Microsoft Natural keyboard (not the Elite, or the Pro or the MultiMedia – the original circa 1995). I turned the computer on today and the keys 67yhnujm no longer work. Given that I’ve had it for minimum of 13 years, it’s had a good run.

Never one to be left unprepared I went to the basement and brought up my spare. Yes, I have a spare Microsoft Natural keyboard for just this circumstance. I love the keyboard so much that when I heard they were no longer making it and replacing it instead with the much inferior Elite, I purchased a spare. It’s been in its box for a good 8 or more years. (Don’t ask about the lengths I’ve gone to keep an original Logitech TrackMan Marble working, it’ll just make me sound obsessive.) Anyone who spends as much time in front of a computer as I do will completely understand about the attachment to specific input devices. The rest of you will call us freaks.

I plugged in the spare keyboard, gently caressing the plam rest, and marveling at the perfectly white keys only to discover that there’s Something Wrong with it. Yes, my backup keyboard failed. Upon certain key combinations the keyboard starts sending escape sequences. Suck a duck.

So now I’m left typing on a crappy Dell keyboard and trying to figure out where to go from here. Looks like I need to crack open both keyboards and see if I can’t merge the two together to make one workable version.

And here I thought I was set for another 13 years…

Inkscape development dependencies on Fedora 12

Just FYI, if you’re wanting to compile Inkscape from source, you’ll need (at least) the following dependency RPMs on Fedora 12:

  • gc-devel
  • glib-devel
  • gtk+-devel
  • gsl-devel
  • libxml
  • libxml-devel
  • poppler-devel
  • poppler-glib-devel
  • libsigc++20-devel
  • glibmm24-devel
  • cairomm-devel
  • pangomm-devel
  • gtkmm24-devel
  • ghostscript
  • ghostscript-devel
  • jasper-devel
  • ImageMagick-devel
  • ImageMagick-c++-devel
  • libwpd-devel
  • libwpg-devel