How Electric Imp Monitors The Public impCloud

The Electric Imp impCloud™ processes considerably more than five billion application-generated messages a month. To ensure that it continues to do so efficiently and is able to process the much larger volumes of messages that we anticipate in the very near future, we need to keep a very close eye on the well-being of the system.

To provide our customers and stakeholders with a view of the health of our system, we host a public status page at http://status.electricimp.com/.

Electric Imp’s public status page

We keep these intentionally simple and easy to read, but behind the scenes we run extensive health checks and collate data from a huge number of metrics in order to give us a moment-by-moment picture of the health of our system.

These metrics are used to pre-emptively identify when hardware or software is misbehaving, so that we can remedy it before it impacts our service. If something does go wrong, we can use the metrics (and log files) to understand what went wrong and prevent it from happening again.

Monitoring imps

In order to update the status page, we run continuous health checks by using ‘monitoring imps’.

These are actually ‘virtual imps’. They run almost exactly the same code as a real imp, but are compiled to run on Linux. This allows us to operate them outside of the Electric Imp impCloud without needing to install and manage physical imps in a remote location.

We have at least one imp for each server. Each imp and its agent runs various continuous health checks, including:

Is the imp connected? Can it send messages to the agent?
Can the agent send messages to the imp?
Does HTTP access to the agent work?
Does HTTP access from the agent work?
Are all of these checks completed within a reasonable time?

We also have other imps and agents which run more specialized health checks, such as checking that the factory flow works correctly.

These health checks are used to update various in-house dashboards — some of which are themselves imp-powered! — and to update the status page.

When any of these health checks fail, the problem is escalated — via a third-party messaging system called PagerDuty — to the human on call, who can quickly evaluate the problem and set about remedying it.

Monit

We use Monit for monitoring the health of our cloud-based daemons.

For example, if a daemon uses too much memory, or there are too many connections to our RabbitMQ message management cluster, this might indicate a problem, so Monit will raise an alert, and an Electric Imp admin can take a look before it impacts our availability.

Nagios

We use Nagios for monitoring the health of all of our hosts and the services running on those hosts. For example, we have a Nagios script that verifies that it can connect to the server that interfaces with all the devices in the field, imp_server, and another that connects to the impCentral API.

Metrics

In the Electric Imp impCloud, we measure everything.

The precise number of metrics we track depends on how you count them, but we record several thousand individual measurements, covering every aspect of our service’s performance from, for example, the number of open connections to each of our Redis instances, to how long each step in the imp connection process takes.

We gather these metrics in a variety of different ways:

We use collectd for machine-level metrics. These include the number of established TCP connections (ie. how many imps are currently connected) and the memory consumption of various processes.
We use statsd for recording and collating metrics from our node.js and C++ components. For example, the number of outstanding agent HTTP requests or timers, or the length of the runnable agent queue.
We use Folsom and Folsomite for recording metrics from our Erlang daemons.

All of these metrics are sent to a Graphite backend for storage and collation. We use Grafana for viewing the various graphs and dashboards. Monit compares a number of these values to pre-defined figures in order to perform its checks and, if necessary, issue alerts (see above).

Logging

We make extensive use of operational logging. Each host and daemon writes to the syslog service, which makes it easy for us to centralize the logging data.

We use Logstash to ingest the logging into an Elastic Search cluster, where we can use Kibana to run queries.

The Result

The combination of continuous health monitoring, gathering of extensive metrics and comprehensive logging means that we can keep a close eye on the impCloud, allowing us to ensure that its availability continues to meet and exceed the expectations of our customers.