Real-World SRE
上QQ阅读APP看书,第一时间看更新

Managing and maintaining monitoring data

OK, now you have the basics of a monitoring system all set up. You have data flowing from your service to a datastore. You can visualize how things change over time. As you accumulate data over time, you are going to have to maintain that service. There are a few tricks for dealing with that.

The first way is a classic, tried, and true method—pay someone else to do it. There are tons of companies that sell hosted monitoring software. Datadog, Honeycomb, GrafanaCloud, InfluxCloud, Librato, Instrumental, New Relic, and many others, sell products with all sorts of features and tools where you do not have to host your monitoring tools.

Sometimes though, you do not have that kind of money, you have special security requirements, you cannot deal with the restrictions that hosted services impose, or you have some other reason not to relegate your monitoring infrastructure to someone else. In that case, first and foremost, you should monitor your monitoring.

Saying it seems straightforward, but if you run out of space, so that you stop storing monitoring data, you lose six months of monitoring data due to successive server failures, or something else, you will be sad. Monitoring is one of those systems that people forget about. Often, it gets spun up and just ignored because it just works. Make sure you keep an eye on it so that it does not fail when you need it most. As we've said many times through this chapter, make sure your service is doing the thing it is supposed to. In this case, it should be storing metrics.

Storage tends to be the thing that bites people about monitoring tools. This is because, often, people don't realize how much data they are collecting and how quickly they can use up all of their storage. If you have a limited amount of space, and you are monitoring many things, you will want to start deleting data at some point. The two common ways to deal with this are sampling and archiving. Usually, what this looks like is changing the granularity you can look at the farther you go back into history or requiring the monitoring system to get data from a slower data source if it is past a certain time in the past. Some services have this built in, but with others you will have to deal with it yourself. Thankfully, you can also just throw more storage at the issue, as hard drive prices have dropped consistently year over year.