RightScale has long used a monitoring system based on a combination of open source and proprietary components. The monitoring system overall consists of three major parts: the monitoring agent running on customer servers to send data to the RightScale monitoring system, the back-end storage system operated by RightScale to store the monitoring data, and the UI to query and visualize the monitoring data and generate alerts. The legacy monitoring system uses collectd as agent, RDDtool plus custom software for storage, and a custom UI generated by Rails with graphs drawn by RRDtool.

Recently, a new monitoring system has replaced the back-end storage component (rrdcached) with TSS (Time Series Storage), keeping the monitoring agent and the UI the same. The following table outlines the configurations of the legacy and TSS-based monitoring systems.

Component Legacy Monitoring System (sketchy) New Monitoring System (TSS)
monitoring agent collectd v4 collectd v4 & v5
back-end storage system RRDtool (rrdcached) TSS
User Interface RRDtool (rrd graph) + Rails <unchanged>

How Does the Monitoring System Work?

The overall process of RightScale's monitoring system is as follows:

  • The Collectd system statistics daemon is installed on an instance at boot time using the SYS Monitoring install RightScript.
  • The website passes the hostname of the monitoring server to the instance using the EC2 launch data (ex: RS_TSS=tss3-1.rightscale.com)
  • Collectd auto-detects data sources (disks, processes, etc), starts collecting data, and sends the data every 20 seconds via UDP to the specified monitoring server.
  • The monitoring server stores the data in an RRD database.
  • When you view a Monitoring tab in the Dashboard for an individual server or deployment, the request is proxied through to the monitoring server.
  • The monitoring server produces graphs using the data in the RRD database.

Can Websites that Scale Way Up View Monitoring Information Too?

Customers whose application scales into the tens, hundreds or even thousands of instances via a scalable server array may wonder if they can still view monitoring data in the RightScale Dashboard. Furthermore, can they view aggregate data from all of the servers? The answer is yes. However, we did have to implement a policy in order to maintain performance while user view their data. Remember that each server instance registers with the load balancers (e.g. HAProxy) as the site scales so that the workload can be evenly balanced between application servers in the array. As an example, you could have 1000 instances registered with two front-end servers running load balancing software. If each registered server instance (1000 in our example) had its own individual set of graphs as well as contributed to an aggregate graph, the amount of data would be overwhelming and take too long to retrieve and graph. Hence, the concept of active and inactive Servers with respect to Monitoring had to be defined. Servers are considered inactive if they have not sent data for a period of one day (or more). The most important points with respect to viewing monitoring data of sites with many servers are:

  • Cumulative graphs show the total activity of all servers but provide detailed information only for the active servers (inactive server information is not available)
  • Active server graphing data is always available (either in thumbnail graphs or by providing a link to produce the graph on demand)

Monitoring System Topics

See also