CoTop

A Monitoring Infrastructure for PlanetLab
Part of the CoDeeN project

What
Is It?


CoMon provides a monitoring statistics for PlanetLab at both a node level and a slice level. It can be used to see what is affecting the performance of nodes, and to examine the resource profiles of individual experiments.

Status


The status page provides several views of PlanetLab, including node-centric, slice-centric, and others. To see more views, click on any of the links shown in the "Summaries:" line at the top of each page. Also available are pages showing the nodes with problems, and the slices with problems, which can be useful for general problem monitoring.

How Do You
Use CoMon?


How you use CoMon depends on what you need. If you suspect that your experiment is acting strangely on some nodes, you may want to use the node-centric view to see if any of its statistics seem out of line with respect to other nodes.

If you're running a long-running test, you can use the slice-centric views to see how your slice is consuming resources, in aggregate or on each node. For example, using the "Slice Max" page, you can see what the maximum amount of memory your slice is using on any node, and then you can click on that value to see a two-day history. This kind of technique is useful for spotting memory leaks, etc.

General
Tips


Practically everything in CoMon is sortable by clicking on the column headings. On the node-centric view, most cells have two values, and so each column has two separate headings, which are both clickable.

Almost every value is also graphable. Clicking on any value loads a two-day history of that value. In the case of the per-slice/per-node pages, the values tend to lag a little. This is normal, and is due to the fact that we simply generate more graphs than the disk can handle. None of the data gets lots, but the graphs do lag a little.

Dynamic
Queries


CoMon also has support for selecting rows based on user-provided criteria. It works by allowing you to add a C-style expression to CoMon queries that selects only the rows that satisfy the expression. The current set of supported operators covers the popular comparison operators, (>, <, >=, <=, ==, and !=), the parentheses, and logical and/or (&&, ||). Column names are derived by using just the alphanumeric characters with no spaces, and constants are also recognized.

To make it easier to use CoMon with various forms of scripting, the data output format can be specified. By default, CoMon's output is html tables in a human-friendly format. Options for script-enabled formats are nameonly for just the row names, formatcsv for comma-separated values, and formatspaces for space-separated values. You can also specify a limit on the number of rows to display, and you can sort by the column of your choice. Example queries using all of these features are shown below.

Sample
Queries


  • Alive - node is responding to CoMon
    Uses select='resptime > 0'
  • Popular nodes - more than 50 in-memory slices, of which at least 10 are actively running
    Uses select='liveslices >= 10 && numslices > 50'
  • Lightly loaded nodes - checks for nodes that have a response time (so aren't dead), that have a 1-minute load less than 5, and have 5 or fewer slices
    Uses select='resptime > 0 && 1minload < 5 && liveslices <= 5'
  • Drifting nodes - clock skew greater than 1 minute
    Uses select='drift > 1m'
  • Overloaded Primary DNS - the secondary DNS seems to be better than the primary DNS
    Uses select='dns1udp >1 && dns2udp >= 0 && dns1udp > dns2udp'
  • Severe DNS problems - both primary and secondary DNS servers look dead
    Uses select='dns1udp > 80 && dns2udp > 80'
  • Low disk - disk space less than 5GB
    Uses select='resptime > 0 && gbfree < 5'
  • SSH Failing - we have not been able to ssh in more than 2 hours
    Uses select='sshstatus > 2h'
  • All problems - merges lists of high skew, DNS problems, disk space, and ssh problems < 5GB
    Uses select='drift > 1m || (dns1udp > 80 && dns2udp > 80) || (resptime > 0 && gbfree < 5) || sshstatus > 2h'
  • Alive, no problems - intersects the live list with the inverse of the problem list
    Uses select='resptime > 0 && ((drift > 1m || (dns1udp > 80 && dns2udp > 80) || gbfree < 5 || sshstatus > 2h) == 0)'
  • Light load, no problems - intersects the lightly-loaded list with the inverse of the problem list
    Uses select='(resptime > 0 && 1minload < 5 && liveslices <= 5) && ((drift > 1m || (dns1udp > 80 && dns2udp > 80) || gbfree < 5 || sshstatus > 2h) == 0)'
  • Sorting on a column - we can sort the results by specifying the column name. In this example, we sort on the 5 minute load. We could reverse the sort order by specifying the negative of the column name
    Adds sort=5minload or sort=-5minload to the URL
  • Pruning nodes per site - we can drop to showing only the best node per site, as they appear in the sort order. In this example, we show only one node per site, but this can be specified.
    Adds persite=1 to the URL
  • Pruning the list - we specify a limit to select fewer rows
    Adds limit=10 to the URL
  • Extracting a few columns - to ease analysis, we can make the data easier to export by asking for it in CSV format instead of HTML, and only requesting the particular fields we want instead of the entire row
    Adds format=formatcsv and dumpcols='name,location,uptime' to the URL
  • Just the names - to ease scripting, we can ask for just the node names, instead of the full table
    Adds format=nameonly to the URL

Data
Access


The best way of accessing the data from CoMon is to use the query interface above. For the most part, we try to keep the column names and representations stable. In the event that you like pain, two daemons run on each node, and make its data available in different ways. The node-centric daemon listens on port 3121, and returns its data as soon as a connection is made - no request needs to be sent. The slice-centric daemon lists on port 3120 and expects an HTTP request for "/cotop". For example, you could generate a request for http://planetlab-1.cs.princeton.edu:3120/cotop

The data from these sensors is what is actually used to generate the data seen in CoMon, and these values are achived into the "monall" (node-centric) and "topall" (slice-centric) dumps. The format of the "monall" dump is that it has a first line indicating the start time, both in seconds since the epoch as well as a human-readable value. Then, each node has a set of name-value lines, with the node's name and IP address included in the lines. Nodes are separated by blank lines. The "topall" dump is more web-like, but each node's data is delimited by a line specifying the start time via the word "Start". Following that is the node name, and then the dump from the daemon.

The archived files contain the results of the checks every 5 minutes. They are available as http://comon.cs.princeton.edu/status/dump_cotop_YYYYMMDD.bz2 for the slice-centric data and http://comon.cs.princeton.edu/status/dump_comon_YYYYMMDD.bz2 for the node-centric data. Obviously, replace YYYYMMDD with the year, month, and day of the data being requested. The current day's file does not have the bz2 extension, and is not compressed. Older data may be moved to another directory or offline -- if you need it, please contact us.

People


KyoungSoo Park and Vivek Pai, with input from lots of others. We may collectively be contacted at princeton_comon at slices.planet-lab.org