上QQ阅读APP看书，第一时间看更新

Resiliency

"You may encounter many defeats, but you must not be defeated."

- Dr. Maya Angelou

A network consists of physical components (for example, network devices, physical cables, and transmission links from service providers) and software components that are running within these physical devices. Like everything that is real, the network devices have a specific lifetime and will fail at some time. Further, the software that runs on these devices, even though designed and tested, will have bugs and behave erratically/hang-up at times. Any of these events will lead to a part of the network being nonfunctional. Resiliency is the property of the network that allows the network to be functional regardless of certain parts/components of the networks being out of service.

Resiliency translates to network availability and can always be measured as uptime of the network. Uptime is usually represented as a percentage, and is defined as the percentage of time that the network delivers specific services with the contracted performance parameters during a specified interval. It is important to define the parameters that determine whether the service is available or not available. As an example, if a network were designed to deliver HD video service but the users had to fallback to use SD video due to network constraints, the network would be termed unavailable for the HD video service for that duration.

Let's consider the uptime of a network device as an example. A device uptime is typically characterized by the average time between failures of the device. This is also known as its Mean Time Between Failures (MTBF). Now assume that it takes on an average x time to repair/replace the device. This time is typically referred to as Mean Time To Repair (MTTR). Hence, for a total duration of MTBF + MTTR, the device was up only for MTBF time. Hence, the uptime is calculated as follows:

Figure 5: Calculating network uptime

A network uptime of 99.9 percent translates into 0.1 percent outage, which means the network can be down for about 525 minutes in a year. This is more than 8 hours of outage in a year, and could translate to a lot of loss if the network runs revenue generating services. It is not uncommon to find CIOs talking about network uptime of 99.99 percent or 99.999 percent (also called four nines, and five nines, respectively). However, the challenge is that the network is made of components that might not have an uptime as high as the ask. For example, a WAN link might only give us an uptime of 99.5 percent. Hence, the problem statement translates to how do we build a network such that the overall network delivers more uptime than the individual components can deliver.

One way to improve uptime/availability is to use devices that have a high MTBF value. At the same time, uptime can be increased by reducing the time to repair (MTTR) by improving operational practices.

The network is a group of network components and the overall network uptime would be a function of the overall components and the way in which these components are connected together. Consider a network that has two components, that is, Component A and Component B, that are connected in such a manner that if any single component fails, the service would be down. If the uptime for each individual component is U_a and U_b, then the uptime of the system that has these two devices in series (U_s) is the product of the two uptimes:

Figure 6: Calculating availability in a series of elements

On the other hand, if the two systems are connected in a manner such that the two components can act as a backup for each other, and where one component fails, there is no outage of the system as the load can be taken by the second system, the uptime of the system in this case is calculated as shown next:

Figure 7: Calculating availability in a redundant combination

It can be seen from the preceding section that if we have system components in parallel/load-sharing mode, then the overall system availability can be improved drastically. However, there is another aspect to adding components in such a redundant mode-cost. Each additional component adds to the cost of the overall system and the additional cost has to be justified to the additional uptime that would be achieved by virtue of the additional component. The following table shows the uptime of a system of N components, where all components are in parallel with the uptime of each component being 98 percent. As can be seen from the table, the incremental gain of adding a third component is very small and might not justify the additional investment. Hence, most systems are built with two redundant components, with three components being used in a redundant mode where the outage can lead to a total network outage:

Figure 8: Effect of redundancy in availability

We will build on these concepts of redundancy during the network design and implementation chapters, where we will design a network with redundant options.