Aikido, programming, system administration, and other things I find interesting

Why is my munin slow and how to speed it up

At $work we are monitoring a network of hundreds of servers, and that means that we end up recording hundreds of thousands of variable values every five minutes.

After a while, the server started slowing down, taking more than 300 seconds to collect the data. Since it has a whole-system lock, that means the next collection is simply skipped, which leads to missing data and ugly holes in graphs.

The most frustrating thing was that the time to collect the values varied from 270 to 340 seconds without any changes to the configuration, so I started wondering where munin was spending it’s time.

I wrote a little tool that drew a gantt-like chart of when in the collection cycle each host was collected. To make things clearer, I colored the successful collections green, and the problematic ones red. The tool reads /var/log/munin/munin-update.log as it is written (using my File::Tail module), and, whenever a collection cycle is completed, it outputs an HTML file with a chart. Using CSS instead of graphics, I was able to make that chart scale with the browser window:

gantt

Using this tool for a while (and using other tools in conjunction with it), I learned several things about munin:

  • With a few hundreds of machines, it uses around 42 seconds burning 100% of CPU on one core. I suspect the time is spent building and writing to disk the big data-structure that contains the information about all the variables on all the hosts, and the limits associated with them. On my system, that file is more than a gigabyte in size. Note that all the graphing tools have to read, and keep in memory, that huge structure. That blob of data is a major cause of munin slowdowns as the amount of information you store grows. I understand munin-2.1 (unstable branch) is using an SQLite database instead (or in addition to? The datafile is still there). I haven’t yet had a chance to compare the speed of 2.1 to 2.0, we will see.
  • Failing hosts can cause big delays. Hosts that return an error will usually cause much shorter delays than a host that simply doesn’t respond. If a host that gets polled toward the end of the data collection period times out, your munin process can easily overrun it’s time limit.
  • The order in which hosts are collected is more or less random (maybe just the order in which keys %hash returns the hostnames?) There is no attempt made to optimize the order of collections in order to minimize total collection time.
  • Munin has a bug which causes it to allocate exactly half the number of collection processes that you specified in the configuration (the programmers overlooked that scalar %hash returns the number of keys and values, i.e. twice the number of keys.
  • If a munin-async host collects enough data that it will not be able to send it from the time when it is contacted to the time when munin kills the collection process, it will only get worse, and will not be able to send it’s data until you go in, clean out old data by hand, and endure the gaps in the graphs. There is no way to collect only part of the data. There is no way to feed the data in by hand in order to clear the backlog. (I have a fix that greatly reduces this problem, I’ll write about it some other time).
  • The intermediate fixes for munin slowing down are
    • Use rrdcached – it greatly speeds up writing to the rrd files.
    • The directory /var/lib/munin/cgi-tmp/munin-cgi-graph should definitely be mounted on a RAMdisk.
    • Use munin-async: collecting from munin-async hosts is significantly faster than collecting from hosts running just munin-node.
    • More cores is better: they won’t help for the initial phase (the gap between zero and 42 in the above graph), but they will enable you to fetch from more hosts at the same time, making it more likely you will finish in 300 seconds.
    • Pack plenty of RAM in your server. Each collection process takes memory, the rrdcached eats memory, and the graphing processes eat enormous amounts of memory (on my host, rrdcached is currently taking 3.8 GB, each munin-cgi-html is taking 2.3 GB, and the munin-cgi-graphs processes are taking a mere 900 MB. Each).
    • A fast disk is nice, but even an SSD won’t make as much difference as you’d hope. I wish I was able to specify that  /var/lib/munin/datafile and /var/lib/munin/datafile.storable should go to a RAMdisk. But they are in the main directory (can’t mount it elsewhere without taking the whole shebang there) and they get recreated every 300 seconds, so I don’t think symlinks will work.
    • Replace generic SNMP modules with dedicated modules that use multigraph and collect as many SNMP variables in one go as possible (this needs a cost-benefit judgement and depends on how good you are at module programming…)

Here is a tar.gz with the source for the Gantt chart tool, and the template it uses to generate the HTML/css combination that makes it possible to draw the chart without using any graphics. It writes to the /var/www/munin/gantt directory by default, and latest.html is always the link to the latest file it wrote.

Related Posts

A munin plugin to monitor each CPU core separately

Monitoring each core separately may seem like a waste – after all, we have an overall CPU usage already available under “system” in munin, isn’t that enough? It turns out that it isn’t. Sometimes, when using top on a multicore/multicpu machine, you can see a process pegged at 100%, while other processes are comfortably using […]

Read More

Keeping a bunch of processes running

From time to time, I need some processes that keep running. It they were simple daemons, I could use something like monit, but what if I need X instances of worker A and Y instances of worker B? I whipped up a quick script that makes it pretty easy to do that, when needed: #!/usr/bin/perl […]

Read More

3 Comments

  • Henrik on May 25, 2017

    Hi,
    The link seems to be dead, is your gantt tool available elsewhere?

    • Matija on May 25, 2017

      Thank you for pointing it out. It is an old script, I didn’t think that anybody would be interested anymore.

      The link should be OK now, and if you are still having trouble let me know and I will send you the .tar.gz by email (it is fairly small).

      • Henrik on May 25, 2017

        Thank you very much!

Leave a Reply

Your email address will not be published. Required fields are marked *