There was a time when I did some JVM tuning. It’s time to reactivate the part of the brain responsible for this activity.
The key principle is to prepare a baseline. What is it?
The baseline refers to:
- the first measuring mechanisms (np. gc.log) – we should measure on different levels (cpu, I/O, database etc) if in doubt follow thus rule: “more is better”
- a load test script that makes JIRA work (e.g. Apache JMeter script),
- a test environment – should be identical as production (different environment create different baseline), and,
- a desired goal (we tend to tune solutions not having any goal). Never tune just for tuning.
The next required value is time… a lot of time. The reason is simple. We make only one change at a time and when it is completed, we launch load scripts and use metrics to see if we’re going in the right direction. The whole operation is based on making several steps forward followed by a few steps back.
Our recent goal was to make JIRA stable, despite working slow (no restart during the day).
We launched the analysis of gc.loc. Due to short pauses, we used the Concurrent Mark Sweep collector. Unfortunately, there are two sides of a coin. As CMS does not compact the memory, you may not be able to place any object in free spaces despite having free memory.
This usually results in the death of JVM – when looking for some free space, you do some cleaning, which takes pretty long time. This causes a “slashdot effect”. As a result you need more and more time, because of which the main GC works and the remaining part is stuck -> reset!
OK, we update two parameters causing CMS to start working earlier – usually, it is launched when the old generation is filled up to 68% (exact value vary depending on JVM version):
Because of that, the whole commotion happens on the bottom of the stack. On the top, there is a place for larger objects. Obviously, you may (and will) experience a problem with stack fragmentation.
We launch load tests. The situation was better, but….
Unfortunately, we killed JIRA. The next decision is to limit the work of GC by reducing the pipe. We lower the maxThreads parameter on tomcat. We launch tests on one of the machines. Nothing happen. We add the second machine and start acting more aggressively. Result: although slow, Jira responds in a stable manner. This is what we were hoping for!
I may still try two other options:
They were added in java 6u21 and 6u20. It is an interesting option, because it optimizes String-type objects, so that – whenever possible – it used 1 byte tables instead of 2 byte char tables. Most of the texts can be squeezed in a single byte. We check the memory usage histogram (previously 98% – char). This time a byte takes precedence over a char.
In the meantime, the jConsole operator screams with surprise: “Pedro, what have you done?!”. This draws our attention (we think we’ve f… up something). Fortunately, all this hassle was about a 50% drop in memory usage. Uff, this is yet another confirmation that our switch is working.
Time to finish. We note down our ideas for later. Migration is in just two days. Quick help of our colleague gave us a new machine. We’ve finally got a place to use the G1 collector.
@pedrowaty Good to know about new string switches, thanks!
The jConsole operator screams with surprise: “Pedro, what have you done?!”