What We Measure To Check The Health Of Our Platform
The Electric Imp impCloud™ processes considerably more than five billion application-generated messages a month. To ensure that it continues to do so efficiently and is able to process the much larger volumes of messages that we anticipate in the very near future, we need to keep a very close eye on the well-being of the system.
To provide our customers and stakeholders with a view of the health of our system, we host a public status page at http://status.electricimp.com/. We also issue updates via the @impstatus Twitter account.
Electric Imp’s public status page
We keep these intentionally simple and easy to read, but behind the scenes we run extensive health checks and collate data from a huge number of metrics in order to give us a moment-by-moment picture of the health of our system.
These metrics are used to pre-emptively identify when hardware or software is misbehaving, so that we can remedy it before it impacts our service. If something does go wrong, we can use the metrics (and log files) to understand what went wrong and prevent it from happening again.
In order to update the status page, we run continuous health checks by using ‘monitoring imps’.
These are actually ‘virtual imps’. They run almost exactly the same code as a real imp, but are compiled to run on Linux. This allows us to operate them outside of the Electric Imp impCloud without needing to install and manage physical imps in a remote location.
We have at least one imp for each server. Each imp and its agent runs various continuous health checks, including:
We also have other imps and agents which run more specialized health checks, such as checking that the factory flow works correctly.
These health checks are used to update various in-house dashboards — some of which are themselves imp-powered! — and to update the status page.
When any of these health checks fail, the problem is escalated — via a third-party messaging system called PagerDuty — to the human on call, who can quickly evaluate the problem and set about remedying it.
We use Monit for monitoring the health of our cloud-based daemons.
For example, if a daemon uses too much memory, or there are too many connections to our RabbitMQ message management cluster, this might indicate a problem, so Monit will raise an alert, and an Electric Imp admin can take a look before it impacts our availability.
We use Nagios for monitoring the health of all of our hosts and the services running on those hosts. For example, we have a Nagios script that verifies that it can connect to the server that interfaces with all the devices in the field, imp_server, and another that connects to the impCentral API.
In the Electric Imp impCloud, we measure everything.
The precise number of metrics we track depends on how you count them, but we record several thousand individual measurements, covering every aspect of our service’s performance from, for example, the number of open connections to each of our Redis instances, to how long each step in the imp connection process takes.
We gather these metrics in a variety of different ways:
All of these metrics are sent to a Graphite backend for storage and collation. We use Grafana for viewing the various graphs and dashboards. Monit compares a number of these values to pre-defined figures in order to perform its checks and, if necessary, issue alerts (see above).
We make extensive use of operational logging. Each host and daemon writes to the syslog service, which makes it easy for us to centralize the logging data.
We use Logstash to ingest the logging into an Elastic Search cluster, where we can use Kibana to run queries.
The combination of continuous health monitoring, gathering of extensive metrics and comprehensive logging means that we can keep a close eye on the impCloud, allowing us to ensure that its availability continues to meet and exceed the expectations of our customers.