“Science is the organized skepticism in the reliability of expert opinion”
Reliability, availability and serviceability (RAS) is a computer hardware engineering term involving reliability engineering, high availability, and serviceability design. The phrase was originally used by IBM as a term to describe the robustness of their mainframe computers.
However, these terms are largely confused in the industry and warrant further definitions.
Reliability can be defined as the probability that a system will produce correct outputs up to some given time (t). Reliability is enhanced by features that help to avoid, detect and repair hardware faults. A reliable system does not silently continue and deliver results that include uncorrected corrupted data. Instead, it detects and, if possible, corrects the corruption, for example: by retrying an operation for transient (soft) or intermittent errors, or else, for uncorrectable errors, isolating the fault and reporting it to higher-level recovery mechanisms (which may failover to redundant replacement hardware, etc.), or else by halting the affected program or the entire system and reporting the corruption. Reliability can be characterized in terms of mean time between failures (MTBF), with reliability = exp(-t/MTBF)
Availability means the probability that a system is operational at a given time, i.e. the amount of time a device is actually operating as the percentage of total time it should be operating. High-availability systems may report availability in terms of minutes or hours of downtime per year. Availability features allow the system to stay operational even when faults do occur. A highly available system would disable the malfunctioning portion and continue operating at a reduced capacity. In contrast, a less capable system might crash and become totally nonoperational. Availability is typically given as a percentage of the time a system is expected to be available, e.g., 99.999 percent (“five nines”).
It is critical to understand the means that we define uptime / downtime. We use the term, ‘nines‘. What this means is that we consider a time frame, normally the time span of one year. Which is 365 days, 8760 hours, or 525,600 minutes. A typical cell tower site is often built without UPS, generators, or redundant connection paths, so they have multiple points of failure. If AC power is lost, then the site is down. As a result, as good as cellular service is in Canada, they are built to a standard of just about 95% uptime. Which is to say, that statistically speaking, a single cell site may experience 18.3 days of outage per annum. That is a lot of downtime!
So, why do we not notice these serve outages more? Well the cellular patterns overlap which adds to the coverage robustness. We may not drive through a hole in the coverage. Or, it may be down at 3:00 am when we are fast asleep. There are many reasons why it may not be noticed. However, sometimes it is noticed with a dropped call, or strange delays, echos, or noise on the call, and other times, especially in urban areas the coverage is augmented by surrounding towers. Most likely, we are just not using our smartphones when we are within the coverage footprint.
Therefore, the reality is that the coverage of just 95% is statistically appropriate for the service. Chasing perfection is said to be a fool’s game. So, good engineering demands a balance between performance and cost. Adding UPS, generators and redundant paths would not likely result in an increase in customer satisfaction or even be noticed. But, the added cost burden to the network would make a difference between profit or loss.
Another example is with data centres. Most folks assume that a Tier V data centre is never down, but in reality to meet the standards for uptime, we can tolerate up to 5 minutes and 15 seconds of service outage spread out over the 8760 hours in a year.
The secret is to find a reasonable balance between performance and cost.
Serviceability or maintainability is the simplicity and speed with which a system can be repaired or maintained; if the time to repair a failed system increases, then availability will decrease. Serviceability includes various methods of easily diagnosing the system when problems arise. Early detection of faults can decrease or avoid system downtime. For example, some enterprise systems can automatically call a service center (without human intervention) when the system experiences a system fault. The traditional focus has been on making the correct repairs with as little disruption to normal operations as possible.Note the distinction between reliability and availability: reliability measures the ability of a system to function correctly, including avoiding data corruption, whereas availability measures how often the system is available for use, even though it may not be functioning correctly. For example, a server may run forever and so have ideal availability, but may be unreliable, with frequent data corruption.
One of the most confusing aspects to understand these important terms is to define the meaning of ‘real-time’. What does real-time mean?
Below are the time domains that I use in the IT / OT world to define time the time constraints. Since my perspectives is through a lens considering networks, the definition may vary if you use different lenses to ponder time.
For example, in applications, processes can be real-time or batch processed. So, time can be measured in minutes or hours instead of milliseconds. With this broader time domain, the sense of urgency will change and therefore the definition can change too.
So, the perspective and the technology will definitely influence your perception of time.
I share a few real life examples of the time it takes for things to happen in order to frame the meaning of a millisecond, or one thousands of a second.
In summary, High availability (HA) is a characteristic of a system, which aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period.
Modernization in our technological world has resulted in an increased reliance on these systems. For example, in the worlds of oil& gas, mines, smart cities, pipelines, education, transportation, hospitals, and data centres, they all require high availability of their systems to perform routine daily activities. Availability refers to the ability of the user community to obtain a service or good, access the system, whether to submit new work, update or alter existing work, or collect the results of previous work. If a user cannot access the system, it is – from the users point of view – unavailable. Generally, the term downtime is used to refer to periods when a system is unavailable.
About the Author:
Michael Martin has more than 35 years of experience in systems design for broadband networks, optical fibre, wireless and digital communications technologies.
He is a Senior Executive with IBM Canada’s Office of the CTO, Global Services. Over the past 14 years with IBM, he has worked in the GBS Global Center of Competency for Energy and Utilities and the GTS Global Center of Excellence for Energy and Utilities. He was previously a founding partner and President of MICAN Communications and before that was President of Comlink Systems Limited and Ensat Broadcast Services, Inc., both divisions of Cygnal Technologies Corporation (CYN: TSX).
Martin currently serves on the Board of Directors for TeraGo Inc (TGO: TSX) and previously served on the Board of Directors for Avante Logixx Inc. (XX: TSX.V).
He serves as a Member, SCC ISO-IEC JTC 1/SC-41 – Internet of Things and related technologies, ISO – International Organization for Standardization, and as a member of the NIST SP 500-325 Fog Computing Conceptual Model, National Institute of Standards and Technology.
He served on the Board of Governors of the University of Ontario Institute of Technology (UOIT) and on the Board of Advisers of five different Colleges in Ontario. For 16 years he served on the Board of the Society of Motion Picture and Television Engineers (SMPTE), Toronto Section.
He holds three master’s degrees, in business (MBA), communication (MA), and education (MEd). As well, he has diplomas and certifications in business, computer programming, internetworking, project management, media, photography, and communication technology.