Aikido, programming, system administration, and other things I find interesting

A munin plugin to monitor each CPU core separately

Monitoring each core separately may seem like a waste – after all, we have an overall CPU usage already available under “system” in munin, isn’t that enough?

It turns out that it isn’t. Sometimes, when using top on a multicore/multicpu machine, you can see a process pegged at 100%, while other processes are comfortably using 150% or 240% percent. That’s an indication of a single thread process using as much as it can. Looking at the system graph will not show you such processes, but looking at a CPU-per-core graph will. Of course, if the process is not using 100% of CPU all the time, it’s hard to spot even in top. But look at the picture below:

cpu_per_core-day

That really shows that there is something unusual going on, right?

But CPU-per-core came in handy in diagnosing a more subtle problem: we were testing our streamers with 10GB ethernet cards, and we found we couldn’t stream more than 3-4 Gbit/s. Everything looked OK: the system was about 30% CPU loaded, there was no IOwait, we had plenty of memory – but the streaming wouldn’t go any faster. The only indication of possible trouble was the interrupts graph – one of the interrupts was pretty high. Once we installed the cpu-per-core plugin, we could see that one core was used 100%, while the others were using much less.

It turns out that core was servicing ALL the interrupts for the 10G card, while the other cores were sitting almost idle. Once we saw that, it took minutes to find that the 10GB card supported multiple interrupts and I wrote a little tool that spread the interrupts over the cores. Once that was done, the cores no longer poked out of the graph so outrageously, and the whole system streamed at much better speed (then it was the streamer developer’s turn to optimize stuff, and he did it wonderfully, but that is another story).

This plugin has one small problem: you can either view the graph for all cores (daily and weekly) for all cores in the host page, or you can view the graph for each core (but not for all cores together) in the cpu-per-core page – but you can’t get the monthly and yearly summary of all cores. It’s a munin limitation connected with the way it generates the URLs for the multigraph plugins: the URL that would normally show the daily, weekly, monthly and yearly graphs is used, instead, to show the graphs for individual cores.

There are two ways to fix that problem: either print the totals twice (once for the display in the main page, and again for the each-core page, or have ALL the graphs in the main page (which would be really messy if you have a twin CPU 12 core system). You chose which one you want by manipulating the names you print in the “multigraph” line, but I’ll write about that in another post.

Here is the code for the plugin / or you can download it here:

 
#!/usr/bin/perl -w
# -*- cperl -*-
use JSON;

=head1 NAME

cpu_per_core - plugin to monitor CPU usage for each CPU core

=head1 CONFIGURATION

=head1 NOTES

=head1 AUTHOR

Matija Grabnar

=head1 LICENSE

GPLv2

=head1 MAGIC MARKERS

 #%# family=system
 #%# capabilities=autoconf

=cut

use strict;
use Munin::Plugin;

my $cache = "/tmp/cpu_per_core.json";

my( $cpu,
    $user,
    $nice,
    $system,
    $idle,
    $iowait,
    $irq,
    $softirq,
    $steal,
    $guest,
    $guest_nice);
my @cpu;

sub print_values {
  my ($json,$str);
  if (open(CACHE,"<","$cache")) {
    my $str=;
    eval {
      $json = decode_json($str);
    };
  }
  print "multigraph cpu_per_core\n";
  open(INP,"<","/proc/stat") || die "Can not open /proc/stat/: $!\n";
  while () {
    next unless /^cpu(\d+)\s+(\d+)(\s+\d+)?(\s+\d+)?(\s+\d+)?(\s+\d+)?(\s+\d+)?(\s+\d+)?(\s+\d+)?(\s+\d+)?(\s+\d+)?\s+/;
    $cpu     = $1;
    $user    = $2;
    $nice    = $3 || 0;
    $system  = $4 || 0;
    $idle    = $5 || 0;
    $iowait  = $6 || 0;
    $irq     = $7 || 0;
    $softirq = $8 || 0;
    $steal   = $9 || 0;
    $guest   = $10 || 0;
    $guest_nice = $11 || 0;
    push(@cpu,{
	       cpu     => $1,
	       user    => $2,
	       nice    => $3 || 0,
	       system  => $4 || 0,
	       idle    => $5 || 0,
	       iowait  => $6 || 0,
	       irq     => $7 || 0,
	       softirq => $8 || 0,
	       steal   => $9 || 0,
	       guest   => $10 || 0,
	       guest_nice => $11 || 0,
	      });
    if (defined($json->[$cpu])) {
      $user =    $cpu[$cpu]->{user}    - $json->[$cpu]->{user};
      $nice =    $cpu[$cpu]->{nice}    - $json->[$cpu]->{nice};
      $system =  $cpu[$cpu]->{system}  - $json->[$cpu]->{system};
      $idle =    $cpu[$cpu]->{idle}    - $json->[$cpu]->{idle};
      $iowait =  $cpu[$cpu]->{iowait}  - $json->[$cpu]->{iowait};
      $irq =     $cpu[$cpu]->{irq}     - $json->[$cpu]->{irq};
      $softirq = $cpu[$cpu]->{softirq} - $json->[$cpu]->{softirq};
      $steal =   $cpu[$cpu]->{steal}   - $json->[$cpu]->{steal};
      $guest =   $cpu[$cpu]->{guest}   - $json->[$cpu]->{guest};
      $guest_nice = $cpu[$cpu]->{guest_nice} - $json->[$cpu]->{guest_nice};
    } else {
      $user = $cpu[$cpu]->{user};
      $nice = $cpu[$cpu]->{nice};
      $system = $cpu[$cpu]->{system};
      $idle = $cpu[$cpu]->{idle};
      $iowait = $cpu[$cpu]->{iowait};
      $irq = $cpu[$cpu]->{irq};
      $softirq = $cpu[$cpu]->{softirq};
      $steal = $cpu[$cpu]->{steal};
      $guest = $cpu[$cpu]->{guest};
      $guest_nice = $cpu[$cpu]->{guest_nice};
    }
    my $usage = int(100-100*($idle/($user+$nice+$system+$idle+$iowait+
				    $irq+$softirq+$steal+$guest+$guest_nice)));
    print sprintf "cpu%d_usage.value %d\n",$cpu,$usage;
  }

  foreach my $cpu (sort {$a->{cpu}<=>$b->{cpu}} @cpu) {
    if (defined($json->[$cpu->{cpu}])) {
      $user =    $cpu->{user}    - $json->[$cpu->{cpu}]->{user};
      $nice =    $cpu->{nice}    - $json->[$cpu->{cpu}]->{nice};
      $system =  $cpu->{system}  - $json->[$cpu->{cpu}]->{system};
      $idle =    $cpu->{idle}    - $json->[$cpu->{cpu}]->{idle};
      $iowait =  $cpu->{iowait}  - $json->[$cpu->{cpu}]->{iowait};
      $irq =     $cpu->{irq}     - $json->[$cpu->{cpu}]->{irq};
      $softirq = $cpu->{softirq} - $json->[$cpu->{cpu}]->{softirq};
      $steal =   $cpu->{steal}   - $json->[$cpu->{cpu}]->{steal};
      $guest =   $cpu->{guest}   - $json->[$cpu->{cpu}]->{guest};
      $guest_nice = $cpu->{guest_nice} - $json->[$cpu->{cpu}]->{guest_nice};
    } else {
      $user       = $cpu->{user};
      $nice       = $cpu->{nice};
      $system     = $cpu->{system};
      $idle       = $cpu->{idle};
      $iowait     = $cpu->{iowait};
      $irq        = $cpu->{irq};
      $softirq    = $cpu->{softirq};
      $steal      = $cpu->{steal};
      $guest      = $cpu->{guest};
      $guest_nice = $cpu->{guest_nice};
    }
    my $total = $user + $nice + $system + $idle + $iowait + $irq +
      $softirq + $steal + $guest + $guest_nice;

    my $factor = 100/$total;

    print sprintf "multigraph cpu_per_core.cpu%d\n",$cpu->{cpu};
    print sprintf "cpu%d_system.value %3.6f\n",$cpu->{cpu},$system * $factor;
    print sprintf "cpu%d_user.value %3.6f\n",$cpu->{cpu},$user * $factor;
    print sprintf "cpu%d_nice.value %3.6f\n",$cpu->{cpu},$nice * $factor;
    print sprintf "cpu%d_idle.value %3.6f\n",$cpu->{cpu},$idle * $factor;
    print sprintf "cpu%d_iowait.value %3.6f\n",$cpu->{cpu},$iowait * $factor;
    print sprintf "cpu%d_irq.value %3.6f\n",$cpu->{cpu},$irq * $factor;
    print sprintf "cpu%d_softirq.value %3.6f\n",$cpu->{cpu},$softirq * $factor;
    print sprintf "cpu%d_steal.value %3.6f\n",$cpu->{cpu},$steal * $factor;
    print sprintf "cpu%d_guest.value %3.6f\n",$cpu->{cpu},$guest * $factor;
    print sprintf "cpu%d_guest_nice.value %3.6f\n",$cpu->{cpu},$guest_nice
       * $factor;
  }

  $str = encode_json(\@cpu);
  open(CACHE,">",$cache) ||
    die "Can not write to cache file $cache : $!\n";
  print CACHE $str;
  close(CACHE);
}

need_multigraph();

$ARGV[0]='' unless defined($ARGV[0]);

if ( $ARGV[0] eq "autoconf" ) {
  if (open(INP,"<","/proc/stat")) {
    print "yes\n";
    exit 0;
  } else {
    print "no\n";
    exit 0;
  }
}

if ( $ARGV[0] eq "config" ) {

  # The headers
  print "multigraph cpu_per_core\n";
  print "graph_info Monitoring CPU per core\n";
  print "graph_title CPU per Core usage\n";
  print "graph_vlabel %\n";
  print "graph_category system\n";
  print "graph_scale no\n";
  print "graph_args --upper-limit 100 --lower-limit 0 --rigid\n";
  print "graph_vlabel %\n";

    open(INP,"<","/proc/stat") || die "Can not open /proc/stat/: $!\n";
  while () {
    next unless /^cpu(\d+)\s+(\d+)(\s+\d+)?(\s+\d+)?(\s+\d+)?(\s+\d+)?(\s+\d+)?(\s+\d+)?(\s+\d+)?(\s+\d+)?(\s+\d+)?\s+/;
    $cpu     = $1;
    $user    = $2;
    $nice    = $3 || 0;
    $system  = $4 || 0;
    $idle    = $5 || 0;
    $iowait  = $6 || 0;
    $irq     = $7 || 0;
    $softirq = $8 || 0;
    $steal   = $9 || 0;
    $guest   = $10 || 0;
    $guest_nice = $11 || 0;
    push(@cpu,{
	       cpu     => $1,
	       user    => $2,
	       nice    => $3 || 0,
	       system  => $4 || 0,
	       idle    => $5 || 0,
	       iowait  => $6 || 0,
	       irq     => $7 || 0,
	       softirq => $8 || 0,
	       steal   => $9 || 0,
	       guest   => $10 || 0,
	       guest_nice => $11 || 0,
	      });
    print "cpu${cpu}_usage.label CPU core $cpu - % busy\n";
    print "cpu${cpu}_usage.type GAUGE\n";
    print "cpu${cpu}_usage.max 100\n";
    print "cpu${cpu}_usage.warning 0:85\n";
    print "cpu${cpu}_usage.critical 0:90\n";
  }

  foreach my $cpu (sort {$a->{cpu}<=>$b->{cpu}} @cpu) {
    print sprintf "multigraph cpu_per_core.cpu%d\n",$cpu->{cpu};
    print sprintf "graph_info CPU core %d\n",$cpu->{cpu};
    print sprintf "graph_title CPU core %d usage\n",$cpu->{cpu};
    print "graph_scale no\n";
    print "graph_args --upper-limit 100 --lower-limit 0 --rigid\n";
    print "graph_vlabel %\n";
    print "graph_category mandarina\n";

    print sprintf "cpu%d_system.label system\n",$cpu->{cpu};    
    print sprintf "cpu%d_system.draw AREA\n",$cpu->{cpu};
    print sprintf "cpu%d_system.type GAUGE\n",$cpu->{cpu};
    print sprintf "cpu%d_system.info CPU time spent in system state\n",$cpu->{cpu};

    print sprintf "cpu%d_user.label user\n",$cpu->{cpu};
    print sprintf "cpu%d_user.draw STACK\n",$cpu->{cpu};
    print sprintf "cpu%d_user.type GAUGE\n",$cpu->{cpu};
    print sprintf "cpu%d_user.info CPU time spent in user state\n",$cpu->{cpu};

    print sprintf "cpu%d_nice.label nice\n",$cpu->{cpu};
    print sprintf "cpu%d_nice.draw STACK\n",$cpu->{cpu};
    print sprintf "cpu%d_nice.type GAUGE\n",$cpu->{cpu};
    print sprintf "cpu%d_nice.info CPU time spent in nice state\n",$cpu->{cpu};

    print sprintf "cpu%d_idle.label idle\n",$cpu->{cpu};
    print sprintf "cpu%d_idle.draw STACK\n",$cpu->{cpu};
    print sprintf "cpu%d_idle.type GAUGE\n",$cpu->{cpu};
    print sprintf "cpu%d_idle.info CPU time spent in idle state\n",$cpu->{cpu};

    print sprintf "cpu%d_iowait.label iowait\n",$cpu->{cpu};
    print sprintf "cpu%d_iowait.draw STACK\n",$cpu->{cpu};
    print sprintf "cpu%d_iowait.type GAUGE\n",$cpu->{cpu};
    print sprintf "cpu%d_iowait.info CPU time spent in iowait state\n",$cpu->{cpu};

    print sprintf "cpu%d_irq.label irq\n",$cpu->{cpu};
    print sprintf "cpu%d_irq.draw STACK\n",$cpu->{cpu};
    print sprintf "cpu%d_irq.type GAUGE\n",$cpu->{cpu};
    print sprintf "cpu%d_irq.info CPU time spent in irq state\n",$cpu->{cpu};

    print sprintf "cpu%d_softirq.label softirq\n",$cpu->{cpu};
    print sprintf "cpu%d_softirq.draw STACK\n",$cpu->{cpu};
    print sprintf "cpu%d_softirq.type GAUGE\n",$cpu->{cpu};
    print sprintf "cpu%d_softirq.info CPU time spent in softirq state\n",$cpu->{cpu};

    print sprintf "cpu%d_steal.label steal\n",$cpu->{cpu};
    print sprintf "cpu%d_steal.draw STACK\n",$cpu->{cpu};
    print sprintf "cpu%d_steal.type GAUGE\n",$cpu->{cpu};
    print sprintf "cpu%d_steal.info CPU time spent in steal state\n",$cpu->{cpu};

    print sprintf "cpu%d_guest.label guest\n",$cpu->{cpu};
    print sprintf "cpu%d_guest.draw STACK\n",$cpu->{cpu};
    print sprintf "cpu%d_guest.type GAUGE\n",$cpu->{cpu};
    print sprintf "cpu%d_guest.info CPU time spent in guest state\n",$cpu->{cpu};

    print sprintf "cpu%d_guest_nice.label guest_nice\n",$cpu->{cpu};
    print sprintf "cpu%d_guest_nice.draw STACK\n",$cpu->{cpu};
    print sprintf "cpu%d_guest_nice.type GAUGE\n",$cpu->{cpu};
    print sprintf "cpu%d_guest_nice.info CPU time spent in guest_nice state\n",$cpu->{cpu};
  }

  exit 0;
}

print_values();

Related Posts

Why is my munin slow and how to speed it up

At $work we are monitoring a network of hundreds of servers, and that means that we end up recording hundreds of thousands of variable values every five minutes. After a while, the server started slowing down, taking more than 300 seconds to collect the data. Since it has a whole-system lock, that means the next […]

Read More

Keeping a bunch of processes running

From time to time, I need some processes that keep running. It they were simple daemons, I could use something like monit, but what if I need X instances of worker A and Y instances of worker B? I whipped up a quick script that makes it pretty easy to do that, when needed: #!/usr/bin/perl […]

Read More

Leave a Reply

Your email address will not be published. Required fields are marked *