Real-World SRE
上QQ阅读APP看书,第一时间看更新

Displaying monitoring information

Now that you are collecting your data and writing it into storage, you can start to display it to users. Many of the previously mentioned monitoring tools provide their own visualization systems. Others recommend that you bring your own. One popular open-source tool for this is Grafana.

Whatever you use to visualize and access your metrics, there are four categories of tools that people often use to get and share their data:

  • Arbitrary queries
  • Graphs
  • Dashboards
  • Chatbots

Let us talk about why each is useful and things to keep in mind for each of these features.

Arbitrary queries

Everyone needs the ability to create a query of the metrics and logs you are building. The reason is pretty straightforward—metrics are useless if people cannot access them. While you can build people graphs and dashboards, while people are working, they will often have new and exciting questions, which they will need to write new queries against the metric datastore to get answers to. You want to empower your coworkers to be able to do their job without you. If you're the only one who can run queries, your job will quickly become "query executer" or no one will look at the metrics. Both of those outcomes are not great for your sanity.

The most straightforward query is to get the data points for a metric over a period. More complex queries depend on your datastore, what it allows, and how much it stores. Many log systems let you query by a regular expression over a period. Other data stores let you do queries where you do math, such as dividing the total number of requests grouped by a period (let us say five minutes for this example) by the count of errors grouped by a period (five minutes again). This would give you a table of percentage failures over five minutes, which can be very useful.

These adhoc queries are useful because often you have questions that you have not thought of before. Providing a simple web interface, or command-line interface, that lets anyone query the data helps people to find the value of monitoring.

If someone has a question, you can show them how to query the data and provide them with documentation, so they can run any query they want to, and answer their current and future questions. Making it easy for users to query data, and showing people how, will also help them to write more metrics and become better at deciding how and when to instrument their services.

Often, once users have these tables of data, they want it in a graph so that they can visualize large amounts of data. What is the phrase? If you give a mouse a cookie… they will want to graph cookie consumption over time?

Graphs

Graphs are mostly used as a visual representation of data over time. You probably remember making graphs in school when you were younger or seeing them used to show things on television news channels without any labels, or units, or any of the other things your teachers told you were required for a good graph. Graphs are common, and most people know how to interpret them, and while Miller's law states that the average human can hold less than ten objects in their head (thus making long tables of numbers difficult), you can display thousands of numbers at once using a graph.

The classic rules for a graph are as follows:

  • The x axis (the bottom edge of the graph) contains time
  • There must be a key for whatever is graphed
  • All axes must be labeled
  • All numbers must have units

Not breaking these rules may be difficult. Some monitoring systems do not support units. Some graph systems only allow you to title graphs, and do not support adding labels to the axes, adding labels to individual lines, or adding keys. The majority of graph systems get the x axis thing right though.

I mention this because, often, after you generate a graph from whatever query you have made, there is a possibility that you will want to distribute it. You might put it in a document, send it in an email, put it in a group chat, or, if you are feeling particularly aggressive, print it out and slam it against the glass wall of a conference room and yell, "How do you like them apples?" as if you were Matt Damon in Good Will Hunting. Before you do some or all of these things though, remember that the graph will live on, often without you there to explain it. Working at Google, I often read through postmortem documents that had graphs that broke these rules, and the person who made them had left the organization long ago. With no way to know how to interpret a graph, it loses its original value.

I also suggest that if your graphing service can output static copies of graphs, you share the static image of the graph, along with a link to the query. As we will mention again in a bit, often, data will be lost or sampled down in a way where you lose the granularity, or accuracy, of your graph. If someone tries to figure out what you were talking about six months from now, even if your link to a graph no longer works, they will have the static image available. Also, it's worth mentioning that if you are going to share a link, make sure it is to an absolute time, not a relative one, so when people click on it, it links to the same time you were talking about.

Sometimes, you do not want to link to a single graph though. You want to share a collection of graphs. Alternatively, there are certain queries that you always want quick access to. This is what dashboards are for.

Dashboards

Dashboards are a collection of graphs that are usually all synchronized around a certain period. There are a lot of philosophies and opinions on what makes a good dashboard. Should a dashboard be on a TV? Should it be readable on mobile? Should there be more than five graphs? All of these questions are up for debate.

The best rule for building dashboards is to let your users see the data they need. If you are designing dashboards for yourself, or your team, here are some suggestions and things to keep in mind:

  • If you are going to put a dashboard on a television, make sure it only has a few graphs and can be read from far away. If you have to run across the room to see what is going on, it probably isn't fulfilling its role.
  • If you can, make a mobile-friendly version of your dashboard. More and more engineers first check outages and graphs on their phones before pulling out a laptop.
  • There is a suggestion from some folks that you should limit your dashboards to five graphs at a time. I often end up with dashboards with a lot more than that. So, as a slightly more achievable metric, I suggest keeping the first three graphs to showing the most important things about a service. This way, when the page first loads, the stuff at the top is where you can focus. One good trick is to have links to other dashboards. If you have a graph about network traffic, below it have links to dashboards with more details about networking, in case the graph shows that something is wrong.

All of the suggestions from the graph section still apply. Make sure your graphs have units and labels if you can. You often will not be the only person looking at your dashboard. Your dashboard will probably last longer than your employment in the job.

Chatbots

I live in chat. I am often idling in Slack, IRC, Signal, Keybase, and Google Hangouts. One very useful tool for me is for graphs to appear in chat, or to have the ability to have a graph be requested from chat. GitHub writes in its blog https://githubengineering.com/deploying-branches-to-github-com/ that they can just type /graph me 20150517..20150523 @github.deploys.total in Slack and get an image with a graph of the metric github.deploys.total for May 17, 2015 until May 23, 2015.

In past jobs, when an outage happened, we had the alert email include an image and link to the graph in question. For my personal websites, I get a message in a private chat once a day, with a graph for each of my sites telling me whether things are working properly. This is a pretty simple bot, which just runs on a cron.

Note

Cron is a service that runs jobs at regular intervals. We talk about it in depth in Chapter 10, Linux and Cloud Foundations.

Chatbots that provide graphs are not a required part of the monitoring infrastructure, but over the years they have become one of my favorite ways to start an investigation or to begin digging into an issue. They provide a quick snapshot and are often very useful in conversations with coworkers about outages, especially when not everyone is in the same room, or you want a record of the conversation.