Why is my munin slow and how to speed it up
At $work we are monitoring a network of hundreds of servers, and that means that we end up recording hundreds of thousands of variable values every five minutes.
After a while, the server started slowing down, taking more than 300 seconds to collect the data. Since it has a whole-system lock, that means the next collection is simply skipped, which leads to missing data and ugly holes in graphs.
The most frustrating thing was that the time to collect the values varied from 270 to 340 seconds without any changes to the configuration, so I started wondering where munin was spending it’s time.
I wrote a little tool that drew a gantt-like chart of when in the collection cycle each host was collected. To make things clearer, I colored the successful collections green, and the problematic ones red. The tool reads
/var/log/munin/munin-update.log as it is written (using my
File::Tail module), and, whenever a collection cycle is completed, it outputs an HTML file with a chart. Using CSS instead of graphics, I was able to make that chart scale with the browser window:
Using this tool for a while (and using other tools in conjunction with it), I learned several things about munin:
- With a few hundreds of machines, it uses around 42 seconds burning 100% of CPU on one core. I suspect the time is spent building and writing to disk the big data-structure that contains the information about all the variables on all the hosts, and the limits associated with them. On my system, that file is more than a gigabyte in size. Note that all the graphing tools have to read, and keep in memory, that huge structure. That blob of data is a major cause of munin slowdowns as the amount of information you store grows. I understand munin-2.1 (unstable branch) is using an SQLite database instead (or in addition to? The datafile is still there). I haven’t yet had a chance to compare the speed of 2.1 to 2.0, we will see.
- Failing hosts can cause big delays. Hosts that return an error will usually cause much shorter delays than a host that simply doesn’t respond. If a host that gets polled toward the end of the data collection period times out, your munin process can easily overrun it’s time limit.
- The order in which hosts are collected is more or less random (maybe just the order in which
keys %hash returns the hostnames?) There is no attempt made to optimize the order of collections in order to minimize total collection time.
- Munin has a bug which causes it to allocate exactly half the number of collection processes that you specified in the configuration (the programmers overlooked that
scalar %hash returns the number of keys and values, i.e. twice the number of keys.
- If a munin-async host collects enough data that it will not be able to send it from the time when it is contacted to the time when munin kills the collection process, it will only get worse, and will not be able to send it’s data until you go in, clean out old data by hand, and endure the gaps in the graphs. There is no way to collect only part of the data. There is no way to feed the data in by hand in order to clear the backlog. (I have a fix that greatly reduces this problem, I’ll write about it some other time).
- The intermediate fixes for munin slowing down are
- Use rrdcached – it greatly speeds up writing to the rrd files.
- The directory
/var/lib/munin/cgi-tmp/munin-cgi-graph should definitely be mounted on a RAMdisk.
- Use munin-async: collecting from munin-async hosts is significantly faster than collecting from hosts running just munin-node.
- More cores is better: they won’t help for the initial phase (the gap between zero and 42 in the above graph), but they will enable you to fetch from more hosts at the same time, making it more likely you will finish in 300 seconds.
- Pack plenty of RAM in your server. Each collection process takes memory, the rrdcached eats memory, and the graphing processes eat enormous amounts of memory (on my host, rrdcached is currently taking 3.8 GB, each munin-cgi-html is taking 2.3 GB, and the munin-cgi-graphs processes are taking a mere 900 MB. Each).
- A fast disk is nice, but even an SSD won’t make as much difference as you’d hope. I wish I was able to specify that
/var/lib/munin/datafile.storable should go to a RAMdisk. But they are in the main directory (can’t mount it elsewhere without taking the whole shebang there) and they get recreated every 300 seconds, so I don’t think symlinks will work.
- Replace generic SNMP modules with dedicated modules that use
multigraph and collect as many SNMP variables in one go as possible (this needs a cost-benefit judgement and depends on how good you are at module programming…)
Here is a tar.gz with the source for the Gantt chart tool, and the template it uses to generate the HTML/css combination that makes it possible to draw the chart without using any graphics. It writes to the /var/www/munin/gantt directory by default, and
latest.html is always the link to the latest file it wrote.