Wednesday, July 2, 2014

Hey, Don't Fire that Guy

A few weeks ago at Silicon Valley DevOps Days, in an open spaces discussion about outages, somebody asked a question that has been on my mind ever since. The question was as follows, "has anyone ever fired someone for causing an outage?" It's not a new question, but the verdict is still out on the right answer.

Some of my favorite insight on the topic comes from a post by John Allspaw. In the post, he states that in order to learn from a mistake without inhibition, there must be not only complete trust but also no fear of retribution. I take this view to heart and I'm a devoted defender of the concept.

However, I feel like the view comes with a few provisos. There is possibly a greater philosophy at play in the question. What really got me thinking about this was one of the stories I heard at the conference. The story was about a site outage caused by a technician on the customer support team running a tool written by the dev team, which was known to have a very high risk of causing site outages. The technician had been trained in how to use the tool and, according to the story teller, knew full well the consequences of his actions. So, for the company it was a straightforward decision to fire him. ...Or was it?

Human Error

If you read about human error, you learn that situations within a system and organization can foster human errors. This is often the first explanation of a failure induced by a human. In our investigation here, it's difficult to imagine any resident situation in the system or org which would have fostered a technician to *knowingly* induce an adverse outage of the application he was paid to support. So, that lead me next to consider that perhaps the technician felt running the script would have no connection to an outage. But again, having been fully trained in the consequence of the script, this felt unlikely. Perhaps a final explanation might be that the technician felt the risk of outage justified the outcome of running the tool. Maybe; but from the storyline, it didn't seem like that was a likely decision.

So, were the theories about human error just somehow inapplicable to this situation?

Theory Prerequisites

That's when I realized something about each of the questions. I realized that they were posited on the assumption that the technician was qualified for and satisfied with his job. I realized that if the technician didn't care about his job or hated his boss, then all of my thinking was flawed. All the explanations are based on the belief that the technician wants what is best for the company. If he had no thought for the success of the company, then it would make easy sense for him to take the shortest path to accomplishing an immediate goal despite negative side effects. So, rethinking the first question -- could an organizational, non-technical situation foster human errors? Yes, I believe so. A non-technical situation could certainly foster feelings of dissatisfaction or frustration resulting in a technician who didn't care about the outcome of running a tool that would solve his immediate problem, but potentially cause a site outage.

Excellent! So, the key to success is maintaining an inspired work force, right? Partly, yes; but not exactly.

For the Workforce

If the company is fostering a disgruntled workplace, then the solution is straightforward: keep them happy and inspired. But, what if your employees are happy? How could the tech's actions be explained?

The next idea that occurred to me was perhaps it wasn't the organizational situation at all, but actually the employee himself. What if the employee was generally just a disgruntled or careless person? DevOps and other strong cultures are built upon trust, but that trust needs to be built on the prerequisite that you hire the right people. If poor hiring choices are made, the whole ecosystem breaks down.

A Different Type of Outage

As it turns out, although hiring the right people is such an important task to do well, it's often underdone and rushed. And why? For the exact same reasons we end up with technical debt: urgency. If we've learned anything from technical debt, we know that we create it to solve an immediate problem, but that solution leaves us in a much worse long term situation. If we know this is true, then maybe we should treat firing an employee the same way we treat a site outage. Specifically, we should postmortem all the steps that lead up to hiring: the phone screen, the recruiter's notes, the interviews, the round-up, the closing. I think that if firing were treated as a much costlier action, then perhaps hiring would be conducted with a much higher bar. Ultimately, in the case of the fired technician, perhaps the fault lays more with the hiring manager for putting the tech in the situation to use the tool at all.

So, to bring it back to the original question: would you ever fire anyone who caused an outage? My answer is a lot more complicated now. I will continue to apply the concepts of human error and would never fire an employee only because of an outage. Would I fire an employee who consistently underperforms, is habitually unhappy and careless, costs more than he or she produces, continues to fail at personal performance improvement plans, and is uninterested in other roles within the company? Very possibly. Whatever the case, if an engineer is fired for any reason, I want to postmortem how we came to the situation, starting with the initial phone screen in the hiring process.

Open Questions

What metrics can drive improvement?

Sunday, May 4, 2014

Debugging Gitosis "Read Access Denied"

I recently set up gitosis to serve some side projects that I'd like to share with a few friends. I've used it in the past professionally and really enjoy the sanity it brings to managing users and permissions.

Things started off pretty well and I was committing and pushing changes in no time. A week or so passed and I wanted to add a new user to a project. I made the necessary changes to my clone of the gitosis-admin project, but when I tried to push my changes upstream, I suddenly I was unable to push! This was a major issue since the admin project is the heart of the configuration.

I put on my spelunking hat and ssh-ed to the box and switched to the git user to start debugging. The first thing I did was revert the gitosis.conf file back to it's original state. You can find this file in ~/git/repositories/gitosis-admin.git/gitosis.conf. Changing it had no effect.

I took a closer look at the error message from my failed push command, and noticed that it was complaining "Read Access Denied," but for a different user name (I could see this because I had loglevel = DEBUG). There are a total of three users involved in the projects, one of which I'd just added and only locally. So, on the server, there were only two users at play. OK, so maybe that user is causing issues. I next removed his key file from the server. This file was at ~git/repositories/gitosis-admin.git/gitosis-export/keydir/.

I tried to push again. No luck.

Hmm, next I looked into ~/git/.ssh/authorized_keys. I found there was a still a reference to the user there, so I deleted that line.

I tried to push again. It worked!

Ok, so are things working now? I tried to fetch. No dice.

So, when I pushed, gitosis re-applied the configuration and undid all of my debugging steps. Essentially reverting the system back to the previous state, including new additions for the new user.

At this point, it dawned on me to check my ssh agent identities. Lo and behold, I had two identities and one of them was for the other user! Oops! This was completely my mistake. I had generated his keys a few weeks ago and tested them to ensure they worked. Apparently I had not been so thoughtful as to delete the identity when done.

After running ssh-add -D, things started working again.

Monday, March 31, 2014

Use Jenkins REST API to Update Job Configurations Automatically

I manage several Jenkins jobs for a single git repository that has several modules (i.e. directories). Each of the modules has the same structure and the job for each of the modules is essentially the same, differentiated only by the name of the directory.

I recently needed to make a change to each of the job configurations and rather than do so by hand, I thought I'd investigate the possibility of doing so automatically using the Jenkins REST API.

I had originally created all the jobs using the API by posting to the generic the http://jenkins.example.com/createItem URI. I had since moved the jobs into folders and hitting createItem only created new jobs and did not update the proper jobs.

I did some searching on the internet and couldn't find any help. I eventually found my answer on the api page for the folder. I'm posting my code below for anyone else with similar questions.

# First, get the http://jenkins.example.com/job/folder-name/job/sample-job--template/configure looking like you want

read -s token
# type token from http://jenkins.example.com/user/$userName/configure

# Download the configuration XML for the template job (which will be our model template)
curl -v -u "bvanevery:$token" http://jenkins.example.com/job/folder-name/job/sample-job--template/config.xml > generic-config.xml

# My modules
declare modules=('module1' 'module2' 'module3')

# POST the updated configuration XML to Jenkins
for m in ${modules[@]}; do 
   echo "module $m";
   sed "s/MODULE/$m/g" generic-config.xml > $m-config.xml; 
   curl -v -X POST --data-binary @$m-config.xml -u "bvanevery:$token" \
        -H 'Content-Type: application/xml' \
        "http://jenkins.example.com/job/folder-name/job/$m/config.xml ;
done


Tuesday, January 7, 2014

Script to Fix Gerrit: LDAP floods log for gerrit-only users

We recently upgraded to Gerrit 2.7 and started to see lots of LDAP related errors in the logs. We tracked it down to this bug report: https://code.google.com/p/gerrit/issues/detail?id=1640.

I wrote a quick script to fix the issue and thought I'd share it.

read -s pwd

echo "SELECT external_id FROM account_external_ids WHERE external_id LIKE 'gerrit:%';" | mysql -h db.example.com -u gerrit -p${pwd} reviewdb | sed 's/^gerrit://' > usernames.txt

for u in $(< usernames.txt); do
if ! id $u > /dev/null 2>1; then
   echo "DELETE FROM account_external_ids WHERE external_id = 'gerrit:$u' LIMIT 1;" | mysql -h db.example.com -u gerrit -p${pwd} reviewdb
fi; 
done

Wednesday, December 4, 2013

How To Build Gerrit Replication Plugin

After far too long of a wait, we've finally upgraded to Gerrit Code Review 2.7.

In versions prior to 2.5, the replication feature was packaged with the main war file. No longer is this the case. Now, if you want Gerrit to replicate your changes upstream to other repositories, you'll need to add the replication plugin using the command line tool `plugin add`. Sadly, I could not find the replication jar hosted anywhere and it appears that you need to build it by hand.

I ran into some confusion [1] with this end to end process and didn't find a sufficiently succinct answer online, so I'm writing up my own =) Here are the steps that lead me to successfully building the replication jar and installing it.

  1. git clone --recursive https://gerrit.googlesource.com/gerrit
    • You need the --recursive here because the plugins are actually git submodules and won't otherwise be cloned along with your repo.
    • If you're already cloned, you can run `git submodule init; git submodule update`
  2. cd gerrit
  3. git checkout -b stable-2.7 origin/stable-2.7
  4. mvn install -DskipTests=true -Dmaven.javadoc.skip=true
    • It's not necessary to skip the tests or generating Java Doc, but it will greatly improve your compile time and decrease the amount of memory maven uses
  5. cd gerrit-plugin-api
  6. mvn package -Dmaven.javadoc.skip=true
    • This creates the jar that will be necessary for the replication plugin to get built
  7. cd plugins/replication
  8. mvn package -Dmaven.javadoc.skip=true
  9. At this point, you have compiled and packaged the replication jar! All you need to do now is register it with your Gerrit server. For simplicity, I'll pretend your gerrit server is running at gerrit.example.com.
  10. scp target/replication-2.7.jar gerrit.example.com:/tmp/
  11. ssh -p 29418 gerrit.example.com gerrit plugin install -n replication /tmp/replication-2.7.jar

I hope this helps out anyone who was struggling with the same issues as I!

PS

Our Gerrit Code Review server runs inside an environment with no outside internet access. When upgrading Gerrit, the service assumes that it has internet access and tries to download any jars that are not packaged into its bundle. In my upgrade situation, it tried to download mysql-connector-java-5.1.21.jar from http://repo2.maven.org/maven2/. It obviously failed.

In order to resolve this issue, I downloaded it to a system that had external access and then scp-ed the jar to $review_site/lib and restarted the gerrit.war init upgrade process.

FOOTNOTES

[1] -- Some errors I saw:
  • Maven out of memory
  • Child module gerrit/plugins/commit-message-length-validator/pom.xml of gerrit/pom.xml does not exist
  • Child module gerrit/plugins/replication/pom.xml of gerrit/pom.xml does not exist
  • Child module gerrit/plugins/reviewnotes/pom.xml of gerrit/pom.xml does not exist


Friday, November 22, 2013

Talkin’ ’bout my Generation, or How I Learned More Than I Ever Wanted About JVM Memory

Profiling a Java application is an experience many developers may never encounter. Identifying the source of a memory leak is probably even more rare. Those kinds of investigations are typically handled by teams dedicated to the subject, or just deferred by throwing more memory at the problem. Until recently, I’ve been one of those lucky developers, blissfully ignorant of what goes on within a JVM. In this article, I seek to share my journey in tracking down a performance issue and what I learned along the way.
The journey begins with an open source scala project built on the play framework running on a 64 bit SL6 box with Java hotspot 1.6. The code base has a small footprint and is not overly complex. Its external resources consist of a mysql database, an internal solr index for searching, and minimal file system interaction. The critical feature of the app is a REST endpoint that handles very large HTTP PUT requests.
We’ve been running the app for some time in production under light load with no complaints. Just recently we started importing a large amount of records into the system via the aforementioned REST endpoint. That’s when we started to observe problems. Periodically the service would crash. The error was consistently “java.lang.OutOfMemoryError: PermGen space”. We observed that given enough time, this error was guaranteed to occur. Critically, it wasn’t simply time: it was after enough requests. I determined that the issue wasn’t going to go away and we needed to face it head on. And thus I embarked on my journey.
The first step was understanding what “PermGen” even means. Research revealed that PermGen stands for “Permanent Generation,” but before that could make sense, I needed to know a little bit more about JVM garbage collection (GC), which is whence the term “generation” comes. Put simply, garbage collection is the process by which old objects get removed from memory (aka “the heap”). Old objects consist of things like class instances that have been created within a function or class scope. When those scopes go away then so should those instances. That makes sense. So in more detail, the actual process of garbage collection consists of sweeping through the heap and determining each object’s classification. The GC has four major classifications for objects named Eden, Survivor 1, Survivor 2, and Old. These classifications are known as Generations. Objects are born into Eden and progressively promoted into the Old Generation at which point they are referred to as tenured [Footnote 1].
OK, so where does the Permanent Generation fit in? The Permanent Generation is a space outside the heap reserved for storing information about the Java classes your app uses. The PermGen has a fixed size that can be set with a JVM option. Class information in the PermGen is managed by ClassLoader instances, the most common of which is the Java system class loader or sun.misc.launcher.AppClassLoader. Generally, the loaded class information takes up a small, fixed memory footprint and you don’t need to put much thought into it.
Classes get loaded into the PermGen on demand, such that only the classes used by your app will take up space. The twist is when you start using Java Reflection. The Java Reflection package is an extremely powerful meta programming tool that gives you the ability to query class information and even create new classes at runtime. Reflection works in Java by calling through the Java Native Interface to get back the loaded class information from the JVM. By default, if information about the same class is requested more than 15 times [Footnote 2], then a new class will be created to hold that information. This process is called inflation and the new class information is stored in a DelegatingClassLoader.
Application frameworks that make heavy use of reflection will observe several instances of DelegatingClassLoaders in their PermGen. One such framework is the scala play framework.
Now that we have a pretty good idea of the PermGen’s role, it’s important to understand how garbage collection works in the PermGen space. I read in a few places that “classes are forever,” but in fact that is not the case. When garbage collection runs in the PermGen, it is true that it will not collect classes. However, what it will collect are ClassLoader instances that no longer hold references to any active classes. So, if all the classes in a given ClassLoader instance have expired, then that instance and all classes it references will go away. This is pretty much never going to happen with the Java system class loader [Footnote 3], but it could very well happen with a DelegatingClassLoader.
With all this knowledge in hand, I finally felt ready to start investigating what might be going on. The JDK comes with a tool called jmapjmap is an excellent tool for analyzing what’s happening with the memory in a JVM. The first command I tried out was jmap -permstat[Footnote 4]. Wow! Pretty cool; this shows me every class loader in the PermGen, the number of classes for which its responsible, the space it’s occupying, and its health. The first thing to jump out at me was that I observed a good deal of dead DelegatingClassLoader instances.
Weird. So, why aren’t those getting garbage collected if they are dead? After some more researching into JVM options, I discovered the following two options: -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC. In the absence of these two flags, garbage collection will not remove unused class loader instances from the PermGen space [Footnote 5].
Alright! Now we’re getting somewhere. These flags might be just what I’m missing. In order to find out, I set up a test system on which I could simulate the initial problem and then apply the new JVM options, rerun the simulation, and see how it holds up.
For the simulation, I wanted to start up my app with a small PermGen and keep my eye on the usage. The PermGen size is configurable at JVM startup with the option: -XX:MaxPermSize=68M. I choose 68M due to my observation that typically on startup the app immediately used up just under 60MB and quickly grew to near 68MB after a few requests. To keep my eye on the PermGen usage, I used jmap -heap, which shows both the total PermGen allocated as well as the current PermGen usage. Both of these values are important because although the PermGen has a max size, it will not use it all right away. Next up, jmap displays a whole load of useful information, but I was only interested in PermGen, so a little grep filtering brought me to jmap -heap $pid | grep -A 4 'Perm Gen'. Finally, I want to watch this as it changes, so throwing watch into the mix produced 
declare pid=$(pidof java); 
watch -n 2 "sudo /usr/java/jdk1.6.0_26/bin/jmap -heap $pid 2>&1 | grep -A 4 'Perm' "
With my watch in place, I started simulating load with a while loop in bash to hit the web endpoint with curl requests. The PermGen quickly rose and eventually peaked at 100% and then crashed after about 500 total curl requests. I set 500 as my base line for comparison and restarted the JVM with the GC flags -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC. Re-running my load simulation script, I watched the PermGen rise just like before. This time however, at around 500 requests, the PermGen usage dropped. The load simulation continued and I reached around 2000 requests before the PermGen finally ran out.
My conclusion from the experiment is that the GC flags successfully configured the JVM to clean up the PermGen space and consequently improved performance over time. The new parameters have been deployed into production and we haven’t seen a crash!
My journey isn’t over yet. I’m not 100% satisfied with this solution because I still see dead DelegatingClassLoader instances in my jmap -permstat output and I’m mildly suspicious there may be a memory leak with the framework. I will continue to monitor PermGen usage and see how things progress. As an ultimate fall back option, I can disable inflation.
…and that’s it! Hopefully this can serve a helpful starting point for anyone facing similar situations.
Some More Notes
I’m also including some more notes here that I learned along the way, but weren’t really pertinent to the discussion about heap dump and analysis. 
  • The JVM can dump your heap to a file when it crashes. To do this enable +XX:HeapDumpOnOutOfMemoryError and set a path for the dump file, -XX:HeapDumpPath=/path/for/dumps/java-.hprof [Footnote 6]. 
  • I used the Eclipse MAT tool to analyze my dump file and had a great experience. It’s worth reading the manual on its basic functions before using it. The default reports available gave me an immense amount of actionable information. If the OOM errors recur, I will likely be turning to MAT to dig to the bottom of the (potential) leak.
  • This article was substantially valuable in wrapping my head around how the JVM interacts with native memory, http://www.ibm.com/developerworks/java/library/j-nativememory-linux/

FOOTNOTES:
Footnote 1: To see details of tenure calculations, enable -XX:+PrintTenuringDistribution
Footnote 2: See a quick description of how to control the process with system properties (sun.reflect.noInflation and inflationThreshold = 15), http://anshuiitk.blogspot.com/2010/11/excessive-full-garbage-collection.html
Footnote 3: This actually not completely out of the question. Application frameworks that employ the “hot deploy” method of swapping in new jars without stopping the JVM will ideally recycle the Java system class loader. See http://frankkieviet.blogspot.com/2006/10/classloader-leaks-dreaded-permgen-space.html for a good description how this can get you into undesirable situations. To read about more people experiencing this problem, http://stackoverflow.com/questions/5066044/java-lang-outofmemoryerror-permgen-space-on-web-app-usage
Footnote 4: The permstat information is expensive to calculate, and I’d suggest redirecting the output to a file so you can manipulate it. http://docs.oracle.com/javase/1.5.0/docs/tooldocs/share/jmap.html
Footnote 6: JVM Options http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html WARNING You must ensure there is no file with the same name at the dump path or the JVM will refuse to overwrite it. To increase entropy in the dump file name, use the placeholder in the path you pass.

Tuesday, October 29, 2013

There's No Room for Heroes

I've been thinking lately about how to successfully grow & mature an engineering team. What are the changes and shifts in mindset and practice that need to occur, naturally or synthetically, to remain effective, productive, and happy? One idea that's caught my attention is the shift from a culture of all-nighter heroes to a culture of on-time delivery of 100% finished projects.

A culture of all-nighter heroes revolves around the star players. The ones who get all the glory consistently saving the day. Every startup at which I've worked has had at least one person like this. You know you can turn to them whenever there's a problem and be confident that no matter the issue, it will get fixed. These are the guys or gals who're answering questions in the chat room 24*7, know every intricacy of your architecture, and have memorized the root password. These are the people the whole team looks up to and for whom you say a prayer every night hoping that they won't quit.

In a bootstrapped startup, the time for long term thinking is brief. Solutions are slapped together and shoved out the door just in time to evade catastrophe. Work is never quite finished. Instead the Pareto principle of 80% good enough is put to the test. Tech debt is an ever looming, always growing shadow of which no one wants to speak. "We'll fix it later" is the general motto. This is the haven of the hero: an undocumented, uncontrolled environment in which only the native knows their way around. Business becomes increasingly reliant on your heroes' ability to keep the technical ship afloat and battle scars are signs of prestige.

Some might argue that this is unavoidable, and perhaps even necessary. Yet as business becomes more stable and clients become less tolerant of failure, this method of operation becomes unsustainable. Suddenly, 80%-done work isn't good enough, tech debt has become a substantial hindrance, and that missing documentation is causing daily problems. The architecture has reached a point in which no group of heroes can reasonably hold it together.

This is the tipping point.

Simply acknowledging this is critical, but the biggest challenge to survival is the cultural shift. The heroes of this new business environment are the types that recognize long term goals. They realize that good solutions aren't invented over night, but via healthy design, debate, experimentation, and measurement. They create simple solutions that eliminate tech debt and avoid gotchas. They share their approaches and engage the rest of the team. They automate every possible touch point. Ultimately, the heroes of this new field do everything they can to be replaceable.

So, your challenge is to embrace automation, celebrate diligently crafted and deployed solutions, and put a damper on rewarding superhuman efforts riddled with tech debt. In my opinion, this a major inflection point in the professional development of a young team of engineers to a senior team. It must be closely shepherded and properly incentivized.