Curated Reads

Reliability is a probability of failure-free operation for a specified time in a specified environment for a specified purpose.

Krishna M. Kavi, Robert C. Tausworth, William W. Everett, Frederick T. Sheldon, Ralph Brettschneider, James T. Yu, Reliability Measurement: From Theory to Practice, IEEE Software 1992, no. 4, p. 13

Reliability is the statistical study of failures, which occur because of some defect in the program. The failure is evident, but you don’t know what mistake is responsible or what you can do to make the failure disappear. Reliability models are supposed to tell you what confidence you can have in the program’s correctness.

Dick Hamlet, Are We Testing for True Reliability?, IEEE Software 1992, no. 4, p. 21

Three separate but related functions comprise an integrated reliability program: prediction, control, and assessment.

Ted W. Keller, Norman F. Schneidewind, Applying Reliability Models to the Space Shuttle, IEEE Software 1992, no. 4, p. 28

Not a day goes by that the general public does not come into contact with a real-time system. As their numbers and importance grow, so do the implications for software developers.

William W. Everett, Shinichi Honiden, Guest Editors' Introduction: Reliability and Safety of Real-Time Systems, IEEE Software 1995, no. 3, p. 13

Conventional software reliability assessment validates a system's reliability only at the end of development, resulting in costly defect correction. A … statistical model checking (SMC) … validate reliability at an early stage. SMC computes the probability that a target system will satisfy functional-safety requirements.

Tai-Hyo Kim, Jongmoon Baik, Moonzoo Kim, Okjoo Choi, Youngjoo Kim, Validating Software Reliability Early through Statistical Model Checking, IEEE Software 2013, no. 3, p. 35

An operational profile describes how users employ a system … The operational profile is a quantitative characterization of how a system will be used that shows how to increase productivity and reliability and speed development by allocating development resources to function on the basis of use. Using an operational profile to guide testing ensures … the most-used operations will have received the most testing …

John D. Musa, Operational Profiles in Software-Reliability Engineering, IEEE Software 1993, no. 2, p. 14

Most developers either aren‘t familiar with reliability models or don‘t know bow to select and apply them. But the need for accurate predictions is acute, focusing attention on this comparatively young field.

Pradip K. Srimani, Yashwant K. Malaiya, Guest Editors' Introduction: Steps to Practical Reliability Measurement, IEEE Software 1992, no. 4, p. 10

recent open access

2024 (2)

Software Engineering / GOTO Conference Videos (FREE)

2024

Cost vs Stability in a Cloud Environment
GOTO Conference Videos (FREE) 2024; by Cat Swetel

2024

A Field Guide to Reliability Engineering at Zalando
GOTO Conference Videos (FREE) 2024; by Heinrich Hartmann

2023 (6)

Software Engineering / GOTO Conference Videos (FREE)

2023

Why Is My App SLOw? Defining Reliability in Platform Engineering
GOTO Conference Videos (FREE) 2023; by Jez Humble

2023

Practical Magic: The Resilience Potion & Security Chaos Engineering
GOTO Conference Videos (FREE) 2023; by Kelly Shortridge

2023

Why Is My App SLOw? Defining Reliability in Platform Engineering
GOTO Conference Videos (FREE) 2023; by Jez Humble

2023

Expert Talk: Cloud Chaos & How Contract Tests Can Help
GOTO Conference Videos (FREE) 2023; by Holly Cummins, Kevlin Henney

2023

Security Chaos Engineering
GOTO Conference Videos (FREE) 2023; by Kelly Shortridge, Aaron Rinehart, Mark Miller

2023

Getting Started with Chaos Engineering
GOTO Conference Videos (FREE) 2023; by Nora Jones, Casey Rosenthal, James Wickett

2022 (7)

Software Engineering / SE Radio Podcasts (FREE)

2022

Episode 544: Ganesh Datta on DevOps vs Site Reliability Engineering
SE Radio Podcasts (FREE) 2022

2022

Episode 544: Ganesh Datta on DevOps vs Site Reliability Engineering
SE Radio Podcasts (FREE) 2022

Software Engineering / GOTO Conference Videos (FREE)

2022

Managing to Your SLO Amidst Chaos
GOTO Conference Videos (FREE) 2022; by Liz Fong-Jones

2022

Cloud Chaos & Microservices Mayhem
GOTO Conference Videos (FREE) 2022; by Holly Cummins

2022

Cloud Chaos & Microservices Mayhem
GOTO Conference Videos (FREE) 2022; by Holly Cummins

2022

Expert Talk: Cloud Chaos & How Contract Tests Can Help
GOTO Conference Videos (FREE) 2022; by Holly Cummins, Kevlin Henney

2022

Security Chaos Engineering
GOTO Conference Videos (FREE) 2022; by Kelly Shortridge, Aaron Rinehart, Mark Miller

2021 (16)

Software Engineering / SE Radio Podcasts (FREE)

2021

Episode 453: Aaron Rinehart on Security Chaos Engineering
SE Radio Podcasts (FREE) 2021

2021

Episode 453: Aaron Rinehart on Security Chaos Engineering
SE Radio Podcasts (FREE) 2021

Software Engineering / GOTO Conference Videos (FREE)

2021

Beyond Chaos Engineering: Continuous Verification
GOTO Conference Videos (FREE) 2021; by Cat Swetel

2021

Chaos Engineering Panel
GOTO Conference Videos (FREE) 2021; by A. Rinehart, C. Yakomin, C. Nash, D. Lavezzo, K. Shortridge

2021

Cloudy with a Chance of Chaos
GOTO Conference Videos (FREE) 2021; by Christina Yakomin

2021

From Catastrophe to Chaos in Production
GOTO Conference Videos (FREE) 2021; by Kelly Shortridge

2021

Security Chaos Engineering - Winning at Security “Whack-a-Mole”
GOTO Conference Videos (FREE) 2021; by Aaron Rinehart

2021

Security Chaos Engineering: From Theory to Practice
GOTO Conference Videos (FREE) 2021; by Jamie Dicken

2021

Incident Analysis Before Chaos Engineering
GOTO Conference Videos (FREE) 2021; by Nora Jones

2021

The DiRT on Chaos Engineering at Google
GOTO Conference Videos (FREE) 2021; by Jason Cahoon

2021

Combining Chaos, Observability & Resilience to get Chaos Engineering
GOTO Conference Videos (FREE) 2021; by Yury Nio

2021

Improving Business Resiliency with Chaos Engineering
GOTO Conference Videos (FREE) 2021; by Olga Hall

2021

Making Chaos Engineering Boring: Debunking Myths Hampering Adoption
GOTO Conference Videos (FREE) 2021; by Miko Pawlikowski

2021

Risks in Systems Design: Chaos Engineering in Apps & Cloud Security
GOTO Conference Videos (FREE) 2021; by Crystal Hirschorn

2021

Prerequisites for Chaos Engineering
GOTO Conference Videos (FREE) 2021; by Courtney Nash

2021

Leadership During Chaos
GOTO Conference Videos (FREE) 2021; by Ranganathan "Ranga" Balashanmugam

2020 (2)

Software Engineering / GOTO Conference Videos (FREE)

2020

Continuous Verification: Beyond Chaos Engineering
GOTO Conference Videos (FREE) 2020; by Cat Swetel

2020

Getting Started with Chaos Engineering
GOTO Conference Videos (FREE) 2020; by Nora Jones, Casey Rosenthal, James Wickett

2019 (1)

Software Engineering / ACM queue (FREE)

2019

The Reliability of Enterprise Applications
ACM queue (FREE) 2019 (5); by Sanjay Sha

2018 (6)

Software Engineering / SE Radio Podcasts (FREE)

2018

SE Radio Episode 325: Tammy Butow on Chaos Engineering
SE Radio Podcasts (FREE) 2018

2018

SE Radio Episode 325: Tammy Butow on Chaos Engineering
SE Radio Podcasts (FREE) 2018

Software Engineering / IEEE Software

2018

Tammy Bütow on Chaos Engineering
IEEE Software 2018 (5); by Edaena Salinas

Software Engineering / GOTO Conference Videos (FREE)

2018

GameDays: Practice Thoughtful Chaos Engineering
GOTO Conference Videos (FREE) 2018; by Ho Ming Li

2018

Site Reliability Engineering at Google
GOTO Conference Videos (FREE) 2018; by Christof Leng

2018

Developing a Chaos Architecture Mindset
GOTO Conference Videos (FREE) 2018; by Adrian Cockcroft

2017 (3)

Software Engineering / IEEE Software

2017

Reliability Engineering
IEEE Software 2017 (4); by Xabier Larrucea, Fabien Belmonte, Adam Welc, Tao Xie

2017

Software Reliability Redux
IEEE Software 2017 (4); by Diomidis Spinellis

Software Engineering / GOTO Conference Videos (FREE)

2017

Site Reliability Engineering at Google
GOTO Conference Videos (FREE) 2017; by Christof Leng

2016 (5)

Software Engineering / SE Radio Podcasts (FREE)

2016

SE Radio Episode 276 Björn Rabenstein on Site Reliability Engineering
SE Radio Podcasts (FREE) 2016

2016

SE Radio Episode 276 Björn Rabenstein on Site Reliability Engineering
SE Radio Podcasts (FREE) 2016

Software Engineering / IEEE Software

2016

Chaos Engineering
IEEE Software 2016 (3); by Ali Basiri, Niosha Behnam, Ruud de Rooij, Lorin Hochstein, Luke Kosewski, Justin Reynolds, Casey Rosenthal

Software Engineering / GOTO Conference Videos (FREE)

2016

Stability Patterns & Antipatterns
GOTO Conference Videos (FREE) 2016; by Michael T. Nygard

2016

Chaos & Intuition Engineering at Netflix
GOTO Conference Videos (FREE) 2016; by Casey Rosenthal

2013 (1)

Software Engineering / IEEE Software

2013

Validating Software Reliability Early through Statistical Model Checking
IEEE Software 2013 (3); by Young Joo Kim, Okjoo Choi, Moonzoo Kim, Jongmoon Baik, Tai-Hyo Kim

2010 (1)

Software Engineering / IEEE Software

2010

The Rise and Fall of the Chaos Report Figures
IEEE Software 2010 (1); by J. Laurenz Eveleens, Chris Verhoef

2009 (2)

Software Engineering / SE Radio Podcasts (FREE)

2009

Episode 134: Release It with Michael Nygard
SE Radio Podcasts (FREE) 2009

2009

Episode 134: Release It with Michael Nygard
SE Radio Podcasts (FREE) 2009

2007 (1)

Software Engineering / IEEE Software

2007

Using Software Reliability Growth Models in Practice
IEEE Software 2007 (6); by Vincent Almering, Michiel van Genuchten, Ger Cloudt, Peter J. M. Sonnemans

2005 (2)

Software Engineering / ACM queue (FREE)

2005

Order from chaos
ACM queue (FREE) 2005 (8); by Natalya Fridman Noy

Software Engineering / IEEE Software

2005

The Virtues of Assessing Software Reliability Early
IEEE Software 2005 (3); by Bojan Cukic

2003 (1)

Software Engineering / IEEE Software

2003

Components and the World of Chaos
IEEE Software 2003 (3); by Rebecca Parsons

2001 (1)

Software Engineering / IEEE Software

2001

From Your Technical Council - Focus on Software Reliability Engineering
IEEE Software 2001 (3); by Melody M. Moore

2000 (1)

Software Engineering / IEEE Software

2000

Bookshelf - Java Application Frameworks Use Case Driven Object, Modeling with UML: A Practical Approach, Chaos and Complexity in Software, Challenging the Industry and the New Science
IEEE Software 2000 (5)

1998 (1)

Software Engineering / IEEE Software

1998

Analyzing and Improving Reliability: A Tree-Based Approach
IEEE Software 1998 (2); by Jeff Tian, Joe Palma

1997 (1)

Software Engineering / IEEE Software

1997

Qualitative and Quantitative Reliability Assessment
IEEE Software 1997 (2); by Karama Kanoun, Mohamed Kaâniche, Jean-Claude Laprie

1996 (3)

Software Engineering / IEEE Software

1996

Reliability Testing of Rule-Based Systems
IEEE Software 1996 (5); by Alberto Avritzer, Johannes P. Ros, Elaine J. Weyuker

1996

Why Software Reliability Predictions Fail
IEEE Software 1996 (4); by Filippo Lanubile

1996

A Generalized Technique for Simulating Software Reliability
IEEE Software 1996 (2); by Robert C. Tausworthe, Michael R. Lyu

1995 (2)

Software Engineering / IEEE Software

1995

Guest Editors' Introduction: Reliability and Safety of Real-Time Systems
IEEE Software 1995 (3); by William W. Everett, Shinichi Honiden

1995

Reliability Through Consistency
IEEE Software 1995 (3); by Kenneth P. Birman, Bradford B. Glade

1993 (2)

Software Engineering / IEEE Software

1993

Operational Profiles in Software-Reliability Engineering
IEEE Software 1993 (2); by John D. Musa

1993

Planning and Certifying Software System Reliability
IEEE Software 1993 (1); by Jesse H. Poore, Harlan D. Mills, David Mutchler

1992 (6)

Software Engineering / IEEE Software

1992

Applying Reliability Models More Effectively
IEEE Software 1992 (4); by Michael R. Lyu, Allen P. Nikora

1992

Using Neural Networks in Reliability Prediction
IEEE Software 1992 (4); by Nachimuthu Karunanithi, L. Darrell Whitley, Yashwant K. Malaiya

1992

Steps to Practical Reliability Meassurement - Guest Editors' Introduction
IEEE Software 1992 (4); by Pradip K. Srimani, Yashwant K. Malaiya

1992

Reliability Measurement: From Theory to Practice
IEEE Software 1992 (4); by Frederick T. Sheldon, Krishna M. Kavi, Robert C. Tausworthe, James T. Yu, Ralph Brettschneider, William W. Everett

1992

New Ways to Get Accurate Reliability Measures
IEEE Software 1992 (4); by Sarah Brocklehurst, Bev Littlewood

1992

Applying Reliability Models to the Space Shuttle
IEEE Software 1992 (4); by Norman F. Schneidewind, Ted W. Keller

1990 (2)

Software Engineering / IEEE Software

1990

Software-Reliability Engineering: Technology for the 1990s
IEEE Software 1990 (6); by John D. Musa, William W. Everett

1990

Applying Reliability Measurement: A Case Study
IEEE Software 1990 (2); by Willa K. Ehrlich, S. Keith Lee, Rex H. Molisani

1988 (2)

Software Engineering / IEEE Software

1988

Applying Software-Reliability Models in Industry
IEEE Software 1988 (4)

1988

CASE: Reliability Engineering for Information Systems
IEEE Software 1988 (2); by Elliot J. Chikofsky, Burt L. Rubenstein

Articles in this collections are (co)authored by 91 authors.

History

Let's delve into the evolving landscape of software reliability and system stability, as revealed through these article titles across several distinct periods.

The Foundations of Software Reliability (1988-2001)

This early period is characterized by a strong academic and methodological focus on "Software Reliability Engineering." The titles frequently emphasize measurement, prediction, and practical application of models to improve software quality. The goal was to quantify and manage the elusive concept of software reliability. We see this in titles like "Applying Software-Reliability Models in Industry" (1988) and "Software-Reliability Engineering: Technology for the 1990s" (1990). The conversation was very much about developing and refining the techniques to ensure software performed as expected, with titles such as "Reliability Measurement: From Theory to Practice" (1992) and "Planning and Certifying Software System Reliability" (1993). Even challenges were framed within this analytical lens, as seen in "Why Software Reliability Predictions Fail" (1996), indicating a continuous effort to refine these early methods. By the turn of the millennium, "Software Reliability Engineering" had cemented itself as a dedicated field, underscored by the "Focus on Software Reliability Engineering" (2001) from a technical council.

Early Encounters with "Chaos" and Broader System Stability (2003-2013)

Around 2003, the term "Chaos" begins to appear in titles, though not yet in the context of the engineering discipline it would later become. Initially, it often referred to the inherent complexity and unpredictability within systems or projects, as in "Components and the World of Chaos" (2003) and "Order from chaos" (2005). The focus shifted slightly from purely theoretical reliability models to more practical concerns of release management and early validation, highlighted by "Episode 134: Release It with Michael Nygard" (2009) and "Validating Software Reliability Early through Statistical Model Checking" (2013). While "The Rise and Fall of the Chaos Report Figures" (2010) uses the term "Chaos Report," it refers to project management metrics, not the later engineering practice. This period represents a bridge, where the language of general "chaos" entered the discourse, but the structured approach to "Chaos Engineering" was not yet defined.

The Dawn of Site Reliability Engineering and Chaos Engineering (2016-2018)

This era marks a significant pivot, with the explicit emergence of "Site Reliability Engineering" (SRE) and the formalization of "Chaos Engineering" as distinct, proactive disciplines. Reliability is no longer just about preventing bugs in code, but about ensuring the operation of complex systems. Titles like "SE Radio Episode 276 Björn Rabenstein on Site Reliability Engineering" (2016) and "Site Reliability Engineering at Google" (2017, 2018) clearly indicate SRE's growing prominence, particularly influenced by Google's practices. Simultaneously, "Chaos Engineering" comes into its own, moving beyond general system "chaos" to a deliberate, experimental approach. "Chaos & Intuition Engineering at Netflix" (2016) and "SE Radio Episode 325: Tammy Butow on Chaos Engineering" (2018) point to its pioneering adoption by industry leaders. The concept of "GameDays: Practice Thoughtful Chaos Engineering" (2018) and "Developing a Chaos Architecture Mindset" (2018) highlight the shift towards embedding these practices into development and operational workflows.

Expanding the Horizon of Chaos Engineering (2020-2021)

Following its formal introduction, Chaos Engineering experienced a rapid expansion and specialization. The sheer volume of titles from 2021 underscores a surge in interest and adoption. Initial guidance on "Getting Started with Chaos Engineering" (2020) quickly evolved into explorations of more advanced and specific applications. A key theme emerging here is "Security Chaos Engineering," a novel application of chaos principles to identify security vulnerabilities, as seen in "Episode 453: Aaron Rinehart on Security Chaos Engineering" (2021) and "Security Chaos Engineering - Winning at Security “Whack-a-Mole”" (2021). Beyond security, Chaos Engineering began to be explicitly linked with broader system qualities like "Observability & Resilience" (2021) and its direct impact on "Improving Business Resiliency" (2021). The discourse also matured to address adoption challenges ("Making Chaos Engineering Boring: Debunking Myths Hampering Adoption" 2021) and next-level practices like "Continuous Verification: Beyond Chaos Engineering" (2020, 2021), indicating a move from foundational understanding to widespread implementation and refinement.

Integration, Refinement, and the Cloud Era (2022-2024)

In the most recent period, the focus shifts to integrating Site Reliability Engineering and Chaos Engineering into broader, modern operational paradigms, particularly within cloud environments and platform engineering. We see discussions around how these disciplines relate to each other, as in "Ganesh Datta on DevOps vs Site Reliability Engineering" (2022), signaling a more nuanced understanding of their roles in the software delivery lifecycle. The emphasis on Service Level Objectives (SLOs) becomes prominent for defining and measuring reliability, with titles like "Managing to Your SLO Amidst Chaos" (2022) and "Why Is My App SLOw? Defining Reliability in Platform Engineering" (2023).

The challenges of the cloud era, such as "Cloud Chaos & Microservices Mayhem" (2022) and the role of "Contract Tests Can Help" (2022, 2023) in mitigating these, are actively discussed. Chaos Engineering continues to be applied, notably in "Practical Magic: The Resilience Potion & Security Chaos Engineering" (2023), reinforcing its practical value. Looking to 2024, the discourse incorporates economic considerations ("Cost vs Stability in a Cloud Environment") alongside continued efforts to formalize and share best practices, as evidenced by "A Field Guide to Reliability Engineering at Zalando." This period highlights a shift from defining and adopting these practices to refining them for complex cloud-native architectures, balancing performance, security, and cost, all while leveraging techniques like Chaos Engineering to proactively build resilient systems.