OverviewI've spent a good few months working with the Loway team trying to track down a performance problem in QueueMetrics and it looks like we have finally made a breakthrough. I'm currently testing a "beta" version which is looking to be very promising. I thought I would post some of the history and some of the useful information I've gathered over time. Even though I believe the improvements Loway have made mostly contribute to the overall solution, your Java performance settings play a key role as well.
For simplicity I will be referring to QM as the application. Obviously it is served by Tomcat which uses Java. Between Tomcat and Java is where most of the troubleshooting and setting changes need to happen, but the action of running QM is what causes Tomcat and Java to become unstable.
Typical symptoms I was experiencing were either all or a combination of the following:
1) QueueMetrics GUI becomes terribly slow or inaccessible
2) High CPU usage caused by Java
3) Out of memory errors in catalina.out
4) High run time values recorded in catalina.out
5) XMLRPC queries time out
For a number of clients simply setting up a cron job to restart Tomcat once a day was generally enough to prevent slowdowns from occurring (might still happen once or twice a month). This unfortunately did not work for the larger sites with 400+ agents, where I'd often have to restart Tomcat multiple times during office hours.
Java Visual VMSo where does one start? The first thing you want to do is get your Java Visual VM monitoring working. This is detailed in the QM Advanced Manual:
http://www.queuemetrics.com/manuals/QM_AdvancedConfig-chunked/ar01s07.htmlThe 3 things you want to look at on the Monitor page are:
1) CPU
2) (Memory) Heap
3) (Memory) PermGen
Memory Settings - HeapAfter discussion with Loway they require 5/6Mb of RAM in the Heap per agent accessing the GUI. On top of that you need to allow overhead for Java as well as your reporting. At one client site I had about 400 agents. So 400 x 6 = 2400. I'm not sure how much to allocate for reports so I played it safe and rounded up to 4096 as they do pull large reports. You then use this value to set your Xms and Xmx values. You can read how to set them in the QM Manual:
http://queuemetrics.com/manuals/QM_UserManual-chunked/ar01s02.html#_understanding_queuemetrics_memory_requirements (I think this section of the manual may need a revisit in terms of memory allocation advice). Loway suggested that I set the Xms and Xmx values the same. Thus I used: -Xms4096M -Xmx4096M. You also want to make sure to add -server as this changes the compiler in Java. Read more here:
http://stackoverflow.com/questions/198577/real-differences-between-java-server-and-java-clientNote: Be sure that your memory settings are within the limits of your physical RAM (bearing in mind that your OS and other applications like MySQL also need resources). I have 12GB of RAM in my 400 Agent server of which 8GB is in use (mostly Tomcat and MySQL).
Memory Settings - PermGenNext thing to look at is PermGen. Often the OutOfMemory events are in fact not from Heap, but PermGen. You might see this in the catalina.out log:
Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: PermGen space.
I hadn't realised that just like the Heap you can also set the PermGen size. By default this seems to be about 80Mb. I experimented with 256Mb and eventually settled on 512Mb. So add these settings to your config for tomcat: -XX:PermSize=512M -XX:MaxPermSize=512M. This change made a significant difference to the stability of QM. You can read more about PermGen here:
https://blogs.oracle.com/jonthecollector/entry/presenting_the_permanent_generationGarbage CollectionNext up is Garbage Collection. When you start reading about Garbage Collection there is a lot of information and a lot of it differs between Java versions, so make sure your reading matches your Java version. The default collector in Java 6 is selected based on your hardware and OS, but you can force which collector to use by adjusting your tomcat settings. For single CPU setups use a serial collector: -XX:+UseSerialGC for multi CPU servers use a parallel (aka throughput collector): -XX:+UseParallelGC. Before I discovered my PermGen size problem I also tried a concurrent collector: -XX:+UseConcMarkSweepGC. This seems to perform better where PermGen size is limited. Once I increased my PermGen size I went back to UseParallelGC as Loway recommended this. My server has 2 x quad core CPUs with HT, so it makes sense to use it.
While we are talking about GC let's also look at some additional logging you can turn on for GC. You can add the following to your tomcat settings: -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails. This adds additional logging to your catalina.out file. Often when QM was in a hung state I would only see GC log events in catalina.out this generally coincided with Heap being maxed out. Later when I paid more attention to PermGen and CPU I would see the same effects when they were maxed. You can also add settings to alert you when Java runs out of memory. Add the following to tomcat settings: -XX:OnError=/bin/javaerrormailgen.sh -XX:OnOutOfMemoryError=/bin/javaerrormailmem.sh. The scripts can contain anything you like (you could for instance trigger a restart of Tomcat). In my case I just used them to send me email. e.g.
javaerrormailgen.sh
#!/bin/sh
#
echo `date` | mail -s "SITENAME Java Error: General" me@domain.com
javaerrormailmem.sh
#!/bin/sh
#
echo `date` | mail -s "SITENAME Java Error: OutOfMemory" me@domain.com
You can read more about GC here:
http://java.sun.com/performance/reference/whitepapers/6_performance.html and here
http://techfeast-hiranya.blogspot.com/2010/11/taming-java-garbage-collector.htmlTroubleshootingOnce you have these things in place you can now start monitoring JavaVVM and Tomcat logs and capture details for feedback to Loway. Capturing jstack and jmap is detailed in the same document and the JavaVVM setup, but I will list some changes to these commands which I found worked better.
jstack -F -l 21472
-F Forces the thread dump. I often found that in a hung state I was unable to get a thread dump without this.
-l Prints a long listing with more info
21472 Is the Java (Tomcat) PID
jmap -F -dump:live,format=b,file=heap.bin 21472
-F Forces the thread dump.
-dump Dumps into a binary format file called heap.bin. Make sure you have disk space available as this file can get very large. It does compress reasonably well using bz2 if you need to upload it somewhere for Loway.
21472 Is the Java (Tomcat) PID
Note: I have found that both these commands will pause Tomcat while the information is extracted, so running this on a working system will cause it to stop while it executes. Obviously if the system is already hung, it doesn't matter
Once I had a larger PermGen set I did see an improvement in the sense that no longer would QM simply hang, but it would still slow down. This was evident in the JVVM where you could see as PermGen usage climbed so did the CPU. In the past when PermGen was maxed out it would eventually cause QM to become completely unresponsive. Once you have more overhead in PermGen it can actually recover. So better, but not quite fixed
Throughout this process I tested a number of different combination of settings and QM versions from Loway each time sending them back jstack and jmap dumps so they could locate what was slowing things down and make improvements to their code. I'll leave the detailed fix(es) up to Loway to explain (it's Greek to me) but essentially it came down to the handling of unique stings which we slowing down when using the intern() function so they replaced it with ChmInterner (hope I got that right).
Final SettingsFor a quick copy and paste here are my final settings for a 400+ Agent server with 2 x Quad CPU and 12GB RAM running Tomcat, MySQL & Apache.
Essentials:
-Xms4096M -Xmx4096M -server -XX:+UseParallelGC -XX:PermSize=512M -XX:MaxPermSize=512M
With extra logging, JVVM and Java alerts:
-Xms4096M -Xmx4096M -server -Dcom.sun.management.jmxremote.port=9003 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -XX:+UseParallelGC -XX:PermSize=512M -XX:MaxPermSize=512M -XX:OnError=/bin/javaerrormailgen.sh -XX:OnOutOfMemoryError=/bin/javaerrormailmem.sh