Embedding AI in Reliability Engineering

New Data-Driven Techniques Enhance Traditional Reliability Engineering Techniques

Glance through the most cited articles in the Reliability Engineering & System Safety journal [1], you may notice the emergence of data-driven techniques such as function approximation, optimization, search, and planning – a collection of techniques broadly studied under the discipline of artificial intelligence. The research community is adopting AI and data-driven approaches to enhance the insights from traditional reliability engineering techniques to advance the field.

While operators are familiar with using data to monitor critical assets, in practice, it’s largely limited to real-time monitoring of operations at a control board, alarms based on rules for process variable values or look-back root cause analysis.

The recent advances in the field of reliability and asset health management which incorporate AI-based insights present a new paradigm to enable operators to peer into the future. This new paradigm with AI does not obviate real-time monitoring or historical analysis. Instead, it complements them by enabling predictive insights across the entire value chain to improve safety, reliability, and availability of these value generating assets.

Over the last decade, researchers embraced advances in artificial intelligence and developed techniques to enhance the insights offered by traditional reliability engineering techniques. This will be the decade when we bring these experiments out of the lab.

Reliability Engineering as a Complex Multi-Objective Optimization Challenge

The problem of reliability engineering for physical systems can be posed as a complex multi-objective optimization problem. For a fleet of industrial assets, operators want to:

Maximize uptime – by increasing availability across a fleet.
Minimize unplanned downtime – by predicting impending downtime.
Maximize operations at optimal conditions – for example, maximize yield or power output.
Minimize maintenance costs – by schedule optimization and prioritization of maintenance activities.

The above objectives are subject to many constraints such as safety, quality, sustainability, and cost. Consider a leading framework for asset health management proposed by Gouriveaur and Medjaher [2] which breaks down the problem into five (5) steps. (Figure 1).

For any value-generating industrial system, we must first detect and identify the state of the system, either it is operating in one of many normal operating modes or exhibiting anomalous behavior. Once we have identified the state as undesirable or anomalous, we must diagnose the subsystem that is going to fail and identify how it will fail. Both detection and diagnostic feed into prognostics – given that our system is demonstrating anomalous behavior and we have been able to diagnose why, we can determine how the system will evolve. The final components of this framework are decision scheduling and action planning.

Figure 1. Asset Health Management Framework

Physics-based, Data-driven and Hybrid Techniques for Asset Health Management

All the steps proposed in this framework can be implemented using a variety of techniques, from empirical rules and physics-based techniques to purely data-driven techniques to hybrid approaches that utilize system physics, expert knowledge, and data-driven techniques. Physics-based methods are inherently explainable since they are grounded in the governing equations describing the system behavior. However, for most complex industrial systems, a closed-form analytical equation governing system operation or degradation is nearly impractical to derive and solve, greatly limiting the application and scalability of purely physics-based reliability engineering approaches.

On the other hand, purely data-driven approaches can statistically model the probability of failures of these systems but can be less explainable. The trade-offs between physics-based and data-driven approaches have led to the emergence of hybrid approaches that aim to encode system physics and expert rules and use data to surface a digital model of the complex system (Figure 2.).

Hybrid approaches promise to leverage the advantages of both modeling approaches. However, their applicability is also limited by whether a harmonious balance between data-driven function approximation and physical system constraints can be achieved. A blanket approach that works well for all physical systems is yet to be derived. The best approach to solving a given problem in this domain is subject to the availability of data, fidelity, and complexity of the known system physics and explainability requirements.

Figure 2. Trade-offs Between Various Approaches to Asset Health Management

The best approach to solving a given problem in this domain is subject to the availability of data, fidelity, and complexity of the known system physics and explainability requirements.

Emergence of Cutting-Edge AI Techniques Advances the Field of Reliability Engineering

There is a large body of work that showcases how AI or data-driven techniques can address one or more of the five steps shown in Figure 1.

For example, researchers from NASA’s Jet Propulsion Lab [3] have demonstrated the application of Long-Short-Term-Memory neural networks for detecting anomalies in spacecraft systems such as the Soil Moisture Active Passive satellite and the Mars Science Laboratory rover, Curiosity. With advances in networking, computing, and storage, NASA is recording and storing an ever-increasing amount of telemetry data from these systems, on the order of terabytes per day. Traditional approaches to monitor these spacecraft systems have required extensive expert knowledge and labor to define and update normal operating conditions and are prone to missing anomalies that occur within these defined limits. Multivariate temporal context matters – AI approaches stand out since they can operate at this immense scale and be tuned to capture anomalies considering both the spatial density and temporal context of the telemetry data.

Similarly, for diagnostics we see works such as that from Manjurul Islam and Jon-Myon Kim [4], where they demonstrate state-of-the-art classification performance using heterogeneous features and a one-against-all multiclass support vector machine for fault diagnosis of bearings. There is also a significant effort to jointly tackle multiple pieces of the reliability puzzle such as the approach proposed by Nguyen and Medjaher [5] where they combine failure prognostics and maintenance decisions. Their approach aims to provide the probabilities of system failure at different time horizons, allowing them to formalize and optimize their maintenance policies. Maintenance decision scheduling and action planning are yet another exciting and fitting application of AI based planning. In this area we see works such as Papakonstantinou and Shinozuka [6] where they utilize dynamic programming and partially observable Markov decision processes for devising optimal maintenance strategies.

Steps of Asset Health Management	Example of Applied AI Technique
Detection	Neural networks to detect anomalies in spacecraft systems with a dynamic thresholding approach
Diagnosis	Support vector machines for fault diagnosis of bearings in rotating machinery using a dynamic reliability measure computed with a nearest neighbor approach.
Prognosis and Decision Scheduling	Novel approach of combining prognosis and operations planning using neural networks to provide probabilities of failures at different time horizons to schedule optimal maintenance strategies
Decision Scheduling and Action Planning	Partially observable Markov decision process formulation to optimize inspection and maintenance activities in sequential decision making

Figure 3. Summary of Emerging AI Techniques for Asset Health Management

Advances in machine learning interpretability also further help bridge the gap between purely physics-based and purely data-driven reliability engineering approaches. Improvements in interpretability include approaches like local surrogate models, Shapley additive explanations, and counterfactual explanations. These advancements get us closer to explainable data-driven techniques. Another line of work championed by Brunton et. al. digs into the idea that we can discover governing equations from data – in other words, learn a model of the system dynamics that we can use for reliability engineering applications. Imagine observing a video of a pendulum oscillating and being able to derive the equation of pendulum motion

**– θ''= -gLsin⁡(θ)**

Cutting-edge AI techniques have demonstrated value in solving pieces of the reliability engineering puzzle. The technology is proven – it’s time to operationalize them for the enterprise.

An AI Application Aimed at Improving Reliability of Physical Systems

It is an exciting time to be working at the intersection of physical systems and AI. In this blog, we have discussed a leading framework for the health management of engineering systems and highlighted several examples from the last decade which showcase the ongoing renaissance of AI-based asset reliability.

Thanks to the immense progress in the field, we believe that the next decade is when we operationalize these innovations. At the end of the day, models, data, and AI all inform human decision-makers like engineers, equipment operators, and maintenance planners in the field. There needs to be a scalable system that incorporates these approaches and surfaces insights in a unified manner to the people who act on them. This is how we’ll bring these experiments out of the lab and generate business value!

At C3 AI, we have begun this work with the C3 AI Reliability Suite, which includes C3 AI Reliability, our AI-enabled predictive maintenance application that allows organizations to flexibly implement a variety of reliability engineering and asset health management techniques to monitor and maintain fleets of business-critical industrial assets.

An AI application aimed at improving the reliability of physical systems must support a variety of modeling techniques and provide a central framework to surface actionable insights.

In the next blog, we explore how the C3 AI Reliability application enables AI-enabled asset reliability and implementation at scale.

Up Next

Overview of C3 AI Reliability

Further Reading & Acknowledgements

[1] “Reliability Engineering & System Safety.” Reliability Engineering & System Safety | Journal | ScienceDirect.com by Elsevier, https://www.sciencedirect.com/journal/reliability-engineering-and-system-safety.

[2] Rafael Gouriveau, Kamal Medjaher. Chapter 2 : Prognostics. Part : Industrial Prognostic - An Overview. J. Andrews, CH. Bérenguer and L. Jackson. Maintenance Modelling and Applications., Det Norske Veritas (DNV), pp.10-30, 2011, ISBN : 978-82-515-0316-7.

[3] Hundman, Kyle, et al. “Detecting Spacecraft Anomalies Using LSTMs and Nonparametric Dynamic Thresholding.” Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018, https://doi.org/10.1145/3219819.3219845.

[4] Manjurul Islam, M.M., and Jong-Myon Kim. “Reliable Multiple Combined Fault Diagnosis of Bearings Using Heterogeneous Feature Models and Multiclass Support Vector Machines.” Reliability Engineering & System Safety, vol. 184, 2019, pp. 55–66., https://doi.org/10.1016/j.ress.2018.02.012.

[5] Nguyen, Khanh T.P., and Kamal Medjaher. “A New Dynamic Predictive Maintenance Framework Using Deep Learning for Failure Prognostics.” Reliability Engineering & System Safety, vol. 188, 2019, pp. 251–262., https://doi.org/10.1016/j.ress.2019.03.018.

[6] Papakonstantinou, K.G., and M. Shinozuka. “Planning Structural Inspection and Maintenance Policies via Dynamic Programming and Markov Processes. Part I: Theory.” Reliability Engineering & System Safety, vol. 130, 2014, pp. 202–213., https://doi.org/10.1016/j.ress.2014.04.005.

[7] Brunton, Steven L., et al. “Discovering Governing Equations from Data by Sparse Identification of Nonlinear Dynamical Systems.” Proceedings of the National Academy of Sciences, vol. 113, no. 15, 2016, pp. 3932–3937., https://doi.org/10.1073/pnas.1517384113.

About the Authors

Shrey Satpathy (author) is a Senior Data Science Instructor at C3 AI where he works with customers and customer-facing teams to implement cutting-edge machine learning applications on the C3 AI Platform. He holds a Bachelor's and Master's degree in Nuclear Engineering and has deep expertise in computational modeling of thermal hydraulic systems. He also holds a Master's degree in computer science focusing on machine learning and artificial intelligence. He is excited to be innovating at this intersection of Physics and Artificial Intelligence.

Lisa Luh (editor) is a Senior Product Marketing Manager at C3 AI, working primarily on the C3 AI Reliability Suite. Prior to C3, Lisa worked in business development for IBM’s cloud business. Lisa has an MBA from the Wharton School of the University of Pennsylvania and a B.S. from the Haas School of Business, University of California, Berkeley.