A new release of the DP site code, 9 years in the making

Today we released a new version of the Distributed Proofreaders code that runs pgdp.net! The announcement includes a list of what’s changed in the 9 years since the last release as well as a list of contributors, some statistics, and next steps. I’ve been working on getting a new release cut since mid-September so I’m pretty excited about it!

The prior release was in September 2006 and since that time there have been continuous, albeit irregular, updates to pgdp.net, but no package available for folks to download for new installations or to update their existing ones. Instead, enterprising individuals had to pull code from the ‘production’ tag in CVS (yes, seriously).

In the process of getting the code ready for release I noticed that there had been changes to the database on pgdp.net that hadn’t been reflected in the initial DB schema or the upgrade scripts in the code. So even if someone had downloaded the code from CVS they would have struggled to get it working.

As part of cutting the release I walked through the documentation that we provide, including the installation, upgrade, and configuration steps, and realized how much implied knowledge was in there. Much of the release process was me updating the documentation after learning what you were suppose to do.1 I ended up creating a full DP installation on a virtual machine to ensure the installation steps produced a working system. I’m not saying they’re now perfect, but they are certainly better than before.

Cutting a release is important for multiple reasons, including the ability for others to use code that is known to work. But the most important to me as a developer is the ability to reset dependency versions going forward. The current code, including that released today, continues to work on severely antiquated versions of PHP (4.x up through 5.3) and MySQL (4.x up to 5.1). This was a pseudo design decision in order to allow sites running on shared hosting with no control over their middleware to continue to function. Given how the hosting landscape has changed drastically over the past 9 years, and how really old those versions are, we decided it’s time to change that.

Going forward we’re resetting the requirements to be PHP 5.3 (but not later, due to our frustrating dependency on magic quotes) and MySQL 5.1 and later. This will allow us to use modern programming features like classes and exceptions that we couldn’t before.

Now that we have a release behind us, I’m excited to get more developers involved and start making some much-needed sweeping changes. Things like removing our dependency on magic quotes and creating a RESTful API to allow programmatic access to DP data. I’m hoping being on git and the availability of a development VM (more on that in a future blog post) will accelerate development.

If you’re looking for somewhere to volunteer as a developer for a literary2 great cause, come join us!

1 A serious hat-tip to all of my tech writer friends who do this on a daily basis!

2 See what I did there?

Secret sauce to including data_files in python bdist_rpm

2017/02/10 update: I’ve discovered a better solution (using an updated version of setuptools) and have updated the post below.

To add additional data files to python installation files — such as service files in /etc/init.d for RH-based distros — all the docs tell you to simply add a data_files line to your setup.py file, like:

    data_files=[('/etc/init.d', ['init.d/service_name'])],

If you do this, however, might get this lovely error when attempting to create the RPM:

$ python setup.py bdist_rpm
< snip >
running install_data
error: can't copy 'init.d/service_name': doesn't exist or not a regular file
error: Bad exit status from /var/tmp/rpm-tmp.DRsp99 (%install)


RPM build errors:
    Bad exit status from /var/tmp/rpm-tmp.DRsp99 (%install)
error: command 'rpmbuild' failed with exit status 1

This will perplex you for ages considering the file clearly exists in the path relative to setup.py, per the instructions:

$ ls -l setup.py
-rw-rw-r--. 1 cpeel cpeel 1233 Mar  9 14:37 setup.py
$ ls -l init.d/service_name
-rw-rw-r--. 1 cpeel cpeel 1242 Mar  9 14:40 init.d/service_name

The problem is that there are (at least) two different flavors of setup() functions and they are not completely alike. Both distutils and setuptools provide a setup() function and both packages have changed over time. See this StackOverflow example discussing the python setup tools landscape as of January 2017.

I don’t know what version of distutils I was using when I wrote this blog post, but it was probably some version with python 2.6. The version of distutils bundled with python 2.7.12, and presumably later, does not generate the RPM build error.

I also get the above RPM build error when using setup() from older versions of setuptools (at least 20.1.1). Updating to the most recent version of setuptools (34.1.1 as of February 2017) resolves the issue.

It appears that the general internet consensus, and official Python Packaging User Guide, is to use setuptools which can be updated independently of the python version you are using, so just be sure you’re using the latest version.

If you’re stuck on an older version of setuptools or distutils, the missing piece that took me hours to discover, is that even with data_files in setup.py, you need a MANIFEST.in file that tells them to include the files in the package:

$ cat MANIFEST.in 
recursive-include init.d *

Decreasing bitmap resolution when exporting PDFs in Inkscape

When exporting a document with bitmap images in it, Inkscape won’t downsample them to a lower resolution, despite the “Resolution for rasterization (dpi)” setting in the PDF export dialog. That export dialog setting only applies to bitmaps that have had filters applied against them. See bug 246677.

4 years ago I figured out how to work around this and today I had to relearn it. So as to not repeat this in another 4 years, I’m documenting this here for posterity.

The workaround is to apply an identity filter to each bitmap before saving it. An identity filter is one that doesn’t do anything, but because it is a filter it forces Inkscape to do the downsampling upon export. Because an identity filter doesn’t actually do anything, there isn’t one available in the Filters menu, but we can create one.

To create a reusable identity filter:

  1. Open a new Inkscape file
  2. From the Filters menu, select Filter Editor…
  3. In the Filter Editor pane, click the New button. This will add the filter “filter1” to the Filter list
  4. Double click “filter1” and rename it “Identity”
  5. To the right of the Add Effect: button there is a drop-down. Change it to Color Matrix, and hit the Add Effect: button
  6. Save the document as “Identity filter.svg”
  7. Now put the file where Inkscape can find it:
    • Linux/OS X: ~/.config/inkscape/filters
    • Windows XP: C:\Documents and Settings\[USERNAME]\Application Data\Inkscape\filters
    • Windows Vista and later: C:\Users\[USERNAME]\AppData\Roaming\Inkscape\filters
  8. The ‘filters’ directory may not exist, in which case just create it.
  9. Close and restart Inkscape
  10. This should add the Personal submenu under the Filters menu which should include our Identity filter.

To use the identity filter:

  1. In your Inkscape file, select the bitmaps you want to downsample. Be sure not to select any vector images or text since you don’t want to rasterize that.
  2. From the Filters menu, select Personal > Identity
  3. Now when you save the document as a PDF, adjust the “Resolution for rasterization (dpi)” setting. 90 is a decent number for documents that are being viewed on a computer screen. 150 is probably the smallest you want to go for anything that is being printed.

iozone and caching – the bane of benchmarking

iozone is a common tool used by companies and researchers when benchmarking storage systems. What most iozone users don’t seem to realize is that unless care is taken, the test may exercise only the storage system’s cache and not the underlying system.

iozone supports different test types, the most common used are:

  • -i0 – sequential write
  • -i1 – sequential read
  • -i2 – random write / random read

These are often used together in a single iozone execution, like:

iozone -i0 -i1 -i2 [...]

This will do a sequential writes test, followed by a sequential read test, followed by a random write test, followed by a random read test. The problem with this approach is that it will exercise both the client’s VM cache as well as whatever front-end cache the storage system is using. Depending on the storage system, in addition to going to nonvolatile storage, the initial sequential write may also go into the storage system’s front-end cache, so the subsequent sequential read may not hit the nonvolatile storage at all making the read test a pure cached test. Ditto the random read test. Similarly, the client may cache some or all of the file as well making the results even less useful.

The client cache is why the iozone docs recommend using a file size 2x the size of the client’s memory. At Isilon we find this unwieldy — particularly when you have 24GB clients. Moreover it does’t solve the problem of the storage system cache at all.

Instead, we have a wrapper script that will run each test in isolation and between test runs flush both the client’s cache (assuming the client is Linux – sysctl vm.drop_caches=1, or just unmount and remount the storage) and the Isilon cluster’s cache (isi_for_array isi_flush) between each run. This allows us to use smaller file sizes while getting results from the underlying storage and not caches.

The above “-i0 -i1 -i2” test gets broken down executed like this (but in a loop of course):

Isilon cluster cache flush
client cache flush
iozone -i0
Isilon cluster cache flush
client cache flush
iozone -i1
Isilon cluster cache flush
client cache flush
iozone -i2

There’s only one other problem with this approach: random writes and reads are done with the same test operation (-i2). This prevents iozone from being able to provide cache-free results as the random read immediately follows the random write. At Isilon we’ve modified our iozone binary to split the random write and random read operations into separately runable tests to work around this limitation.

If your intent is to test a system’s cache, then by all means run all the iozone tests in the same execution. But if you’re wanting to test the underlying storage you need to run them separately and flush caches between the executions.

Hadoop OutOfMemory errors

If, when running a hadoop job, you get errors like the following:

11/10/21 10:51:56 INFO mapred.JobClient: Task Id : attempt_201110201704_0002_m_000000_0, Status : FAILED
Error: Java heap space

The OOM isn’t with the JVM that the hadoop JobTracker or TaskTracker is running in (the maximum heap size for those are set in conf/hadoop-env.sh with HADOOP_HEAPSIZE) but rather the separate JVM spawned for each task. The maximum JVM heap size for those can be controlled via parameters in conf/mapred-site.xml. For instance, to change the default max heap size from 200MB to 512MB, add these lines:

   <property>
       <name>mapred.map.child.java.opts</name>
       <value>-Xmx512m</value>
   </property>
   <property>
       <name>mapred.reduce.child.java.opts</name>
       <value>-Xmx512m</value>
   </property>

I find it sad that this took me a day to figure out. I kept googling for variations of “hadoop java out of memory” which were all red herrings. If I had just googled for the literal error “Error: Java heap space” plus hadoop I’d have gotten there a lot faster. Lesson learned: don’t try to outsmart google with the actual problem.

CSS box model woes

This weekend I spent a couple of hours struggling with a new DIV-based layout design for peelinc.com (it’s not up yet, don’t bother looking). The core of my problem was that I was designing for the border box model but the browser was rendering it with the W3C content box model. It wasn’t until I realized that I needed to explicitly state that I wanted the border box model that things started working correctly.

And really W3C – content box model? Trying to develop a fully-fluid layout using the content box model is nuts.

Except I couldn’t get IE8 to render using the border box model even though it’s supported. And yes, I had the !DOCTYPE specified so it was suppose to be rendering it in standards compliance mode. After enough digging I figured out that despite the DOCTYPE, it was rendering it in IE7 compatability mode instead. Arg. I was able to get IE8 to cooperate by using the IE document compatibility meta tag:

<meta http-equiv=”X-UA-Compatible” content=”IE=edge” >

Yet another reason why I hate hate hate IE.

But now, thankfully, the page renders exactly as I had intended on Safari, Chrome, Firefox, and IE8 without any browser hacks (if you don’t consider telling IE8 to use the freakin’ standards a hack). No idea what it’ll look like on IE6 or IE7, but frankly I’m not worried if it looks a bit wonky in those — the content will still be perfectly readable.

gridengine qmon font problems

Even after you get gridengine installed, you’ll have some lovely font problems when using qmon. After trying a few things, the easiest is just to follow the advice at this web page and set the fonts to fixed. Note that copying $SGE_ROOT/qmon/Qmon to $HOME is an important part — updating the file directly won’t work. This should be done on the system with the qmon binary (should be obvious, but you never know).

If you decide to try and fix the fonts themselves, remember that the fonts need to exist on the Xserver rendering the qmon dialog, not necessarily the one on the system where qmon is being run from. In other words, if you’re doing X11 forwarding from a remote host to your desktop, it’s your desktop’s X11 server that needs the fonts, not the fonts on the remote host. Thanks to this post clarifying that pesky detail.

And finally, if you decide to poke around and get non-fixed fonts working, you’ll probably need these RHEL packages (via):

  • xorg-x11-font-utils
  • xorg-x11-fonts-100dpi
  • xorg-x11-fonts-75dpi
  • xorg-x11-fonts-misc

and possibly futzing with xfs and fc-cache. I tried this approach and fixed the initial error messages from the X server (yay!) but resulted in boxes instead of glyphs (boo) and failed back to fixed fonts.