Hadoop OutOfMemory errors

If, when running a hadoop job, you get errors like the following:

11/10/21 10:51:56 INFO mapred.JobClient: Task Id : attempt_201110201704_0002_m_000000_0, Status : FAILED
Error: Java heap space

The OOM isn’t with the JVM that the hadoop JobTracker or TaskTracker is running in (the maximum heap size for those are set in conf/hadoop-env.sh with HADOOP_HEAPSIZE) but rather the separate JVM spawned for each task. The maximum JVM heap size for those can be controlled via parameters in conf/mapred-site.xml. For instance, to change the default max heap size from 200MB to 512MB, add these lines:

   <property>
       <name>mapred.map.child.java.opts</name>
       <value>-Xmx512m</value>
   </property>
   <property>
       <name>mapred.reduce.child.java.opts</name>
       <value>-Xmx512m</value>
   </property>

I find it sad that this took me a day to figure out. I kept googling for variations of “hadoop java out of memory” which were all red herrings. If I had just googled for the literal error “Error: Java heap space” plus hadoop I’d have gotten there a lot faster. Lesson learned: don’t try to outsmart google with the actual problem.