Resilient control systems

Updated on Dec 27, 2024

Edit

Comment

In our modern society, computerized or digital control systems have been used to reliably automate many of the industrial operations that we take for granted, from the power plant to the automobiles we drive. However, the complexity of these systems and how the designers integrate them, the roles and responsibilities of the humans that interact with the systems, and the cyber security of these highly networked systems has led to a new paradigm in research philosophy for next generation control systems. Resilient Control Systems consider all of these elements and those disciplines that contribute to a more effective design, such as cognitive psychology, computer science, and control engineering to develop interdisciplinary solutions. These solutions consider such things such as how to tailor the control system operating displays to best enable the user to make an accurate and reproducible response, how to design in cyber security protections such that the system defends itself from attack by changing its behaviors, and how to better integrate widely distributed computer control systems to prevent cascading failures that result in disruptions to critical industrial operations. In the context of cyber-physical systems, resilient control systems are an aspect that focuses on the unique interdependencies of a control system, as compared to information technology computer systems and networks, due to its importance in operating our critical industrial operations.

Introduction

Originally intended to provide a more efficient mechanism for controlling industrial operations, the development of digital control systems allowed for flexibility in integrating distributed sensors and operating logic while maintaining a centralized interface for human monitoring and interaction. This ease of readily adding sensors and logic through software, which was once done with relays and isolated analog instruments, has led to wide acceptance and integration of these systems in all industries. However, these digital control systems have often been integrated in phases to cover different aspects of an industrial operation, connected over a network, and leading to a complex interconnected and interdependent system. While the control theory applied is often nothing more than a digital version of their analog counterparts, the dependence of digital control systems upon the communications networks, has precipitated the need for cybersecurity due to potential effects on confidentiality, integrity and availability of the information. To achieve resilience in the next generation of control systems, therefore, addressing the complex control system interdependencies, including the human systems interaction and cyber security, will be a recognized challenge.

Defining resilience

Research in resilience engineering over the last decade has focused in two areas, organizational and information technology. Organizational resilience considers the ability of an organization to adapt and survive in the face of threats, including the prevention or mitigation of unsafe, hazardous or compromising conditions that threaten its very existence. Information technology resilience has been considered from a number of standpoints . Networking resilience has been considered as quality of service. Computing has considered such issues as dependability and performance in the face of unanticipated changes . However, based upon the application of control dynamics to industrial processes, functionality and determinism are primary considerations that are not captured by the traditional objectives of information technology. .

Considering the paradigm of control systems, one definition has been suggested that "Resilient control systems are those that tolerate fluctuations via their structure, design parameters, control structure and control parameters". However, this definition is taken from the perspective of control theory application to a control system. The consideration of the malicious actor and cyber security are not directly considered, which might suggest the definition, "an effective reconstitution of control under attack from intelligent adversaries," which was proposed. However, this definition focuses only on resilience in response to a malicious actor. To consider the cyber-physical aspects of control system, a definition for resilience considers both benign and malicious human interaction, in addition to the complex interdependencies of the control system application .

The use of the term “recovery” has been used in the context of resilience, paralleling the response of a rubber ball to stay intact when a force is exerted on it and recover its original dimensions after the force is removed. Considering the rubber ball in terms of a system, resilience could then be defined as its ability to maintain a desired level of performance or normalcy without irrecoverable consequences. While resilience in this context is based upon the yield strength of the ball, control systems require an interaction with the environment, namely the sensors, valves, pumps that make up the industrial operation. To be reactive to this environment, control systems require an awareness of its state to make corrective changes to the industrial process to maintain normalcy. With this in mind, in consideration of the discussed cyber-physical aspects of human systems integration and cyber security, as well as other definitions for resilience at a broader critical infrastructure level, the following can be deduced as a definition of a resilient control system:

"A resilient control system is one that maintains state awareness and an accepted level of operational normalcy in response to disturbances, including threats of an unexpected and malicious nature"

Considering the flow of a digital control system as a basis, a resilient control system framework can be designed. Referring to the left side of Fig. 1, a resilient control system holistically considers the measures of performance or normalcy for the state space. At the center, an understanding of performance and priority provide the basis for an appropriate response by a combination of human and automation, embedded within a multi-agent, semi-autonomous framework. Finally, to the right, information must be tailored to the consumer to address the need and position a desirable response. Several examples or scenarios of how resilience differs and provides benefit to control system design are available in the literature.

Areas Of resilience

Some primary tenets of resilience, as contrasted to traditional reliability, have presented themselves in considering an integrated approach to resilient control systems. These cyber-physical tenants complement the fundamental concept of dependable or reliable computing by characterizing resilience in regard to control system concerns, including design considerations that provide a level of understanding and assurance in the safe and secure operation of an industrial facility. These tenants are discussed individually below to summarize some of the challenges to address in order to achieve resilience.

Human systems

The benign human has an ability to quickly understand novel solutions, and provide the ability to adapt to unexpected conditions. This behavior can provide additional resilience to a control system, but reproducibly predicting human behavior is a continuing challenge. The ability to capture historic human preferences can be applied to bayesian inference and bayesian belief networks, but ideally a solution would consider direct understanding of human state using sensors such as an EEG. Considering control system design and interaction, the goal would be to tailor the amount of automation necessary to achieve some level of optimal resilience for this mixed initiative response. Presented to the human would be that actionable information that provides the basis for a targeted, reproducible response.

Cyber security

In contrast to the challenges of prediction and integration of the benign human with control systems, the abilities of the malicious actor (or hacker) to undermine desired control system behavior also create a significant challenge to control system resilience. Application of dynamic probabilistic risk analysis used in human reliability can provide some basis for the benign actor. However, the decidedly malicious intentions of an adversarial individual, organization or nation make the modeling of the human variable in both objectives and motives. However, in defining a control system response to such intentions, the malicious actor looks forward to some level of recognized behavior to gain an advantage and provide a pathway to undermining the system. Whether performed separately in preparation for a cyber attack, or on the system itself, these behaviors can provide opportunity for a successful attack without detection. Therefore, in considering resilient control system architecture, atypical designs that imbed active and passively implemented randomization of attributes, would be suggested to reduce this advantage.

Complex networks and networked control systems

While much of the current critical infrastructure is controlled by a web of interconnected control systems, either architecture termed as distributed control systems (DCS) or supervisory control and data acquisition (SCADA), the application of control is moving toward a more decentralized state. In moving to a smart grid, the complex interconnected nature of individual homes, commercial facilities and diverse power generation and storage creates an opportunity and a challenge to ensuring that the resulting system is more resilient to threats. The ability to operate these systems to achieve a global optimum for multiple considerations, such as overall efficiency, stability and security, will require mechanisms to holistically design complex networked control systems. Multi-agent methods suggest a mechanism to tie a global objective to distributed assets, allowing for management and coordination of assets for optimal benefit and semi-autonomous, but constrained controllers that can react rapidly to maintain resilience for rapidly changing conditions.

Base Metrics for Resilient Control Systems

Establishing a metric that can capture the resilience attributes can be complex, at least if considered based upon differences between the interactions or interdependencies. Evaluating the control, cyber and cognitive disturbances, especially if considered from a disciplinary standpoint, leads to measures that already had been established. However, if the metric were instead based upon a normalizing dynamic attribute, such a performance characteristic that can be impacted by degradation, an alternative is suggested. Specifically, applications of base metrics to resilience characteristics are given as follows for type of disturbance:

Physical Disturbances:

Time Latency Affecting Stability

Data Integrity Affecting Stability

Cyber Disturbances:

Time Latency

Data Confidentiality, Integrity and Availability

Cognitive Disturbances:

Time Latency in Response

Data Digression from Desired Response

Such performance characteristics exist with both time and data integrity. Time, both in terms of delay of mission and communications latency, and data, in terms of corruption or modification, are normalizing factors. In general, the idea is to base the metric on “what is expected” and not necessarily the actual initiator to the degradation. Considering time as a metrics basis, resilient and un-resilient systems can be observed in Fig. 2.

Dependent upon the abscissa metrics chosen, Fig. 2 reflects a generalization of the resiliency of a system. Several common terms are represented on this graphic, including robustness, agility, adaptive capacity, adaptive insufficiency, resiliency and brittleness. To overview the definitions of these terms, the following explanations of each is provided below:

Agility: The derivative of the disturbance curve. This average defines the ability of the system to resist degradation on the downward slope, but also to recover on the upward. Primarily considered a time based term that indicates impact to mission. Considers both short term system and longer term human responder actions.

Adaptive Capacity: The ability of the system to adapt or transform from impact and maintain minimum normalcy. Considered a value between 0 and 1, where 1 is fully operational and 0 is the resilience threshold.

Adaptive Insufficiency: The inability of the system to adapt or transform from impact, indicating an unacceptable performance loss due to the disturbance. Considered a value between 0 and -1, where 0 is the resilience threshold and -1 is total loss of operation.

Brittleness: The area under the disturbance curve as intersected by the resilience threshold. This indicates the impact from the loss of operational normalcy.

Phases of Resilient Control System Preparation and Disturbance Response:

Recon: Maintaining proactive state awareness of system conditions and degradation

Resist: System response to recognized conditions, both to mitigate and counter

Respond: System degradation has been stopped and returning system performance

Restore: Longer term performance restoration, which includes equipment replacement

Resiliency: The converse of brittleness, which for a resilience system is “zero” loss of minimum normalcy.

Robustness: A positive or negative number associated with the area between the disturbance curve and the resilience threshold, indicating either the capacity or insufficiency, respectively.

On the abscissa of Fig. 2, it can be recognized that cyber and cognitive influences can affect both the data and the time, which underscores the relative importance of recognizing these forms of degradation in resilient control designs. For cybersecurity, a single cyberattack can degrade a control system in multiple ways. Additionally, control impacts can be characterized as indicated. While these terms are fundamental and seem of little value for those correlating impact in terms like cost, the development of use cases provide a means by which this relevance can be codified. For example, given the impact to system dynamics or data, the performance of the control loop can be directly ascertained and show approach to instability and operational impact.

Examples of Resilient Control System Developments

1) When considering the current digital control system designs, the cyber security of these systems is dependent upon what is considered border protections, i.e., firewalls, passwords, etc. If a malicious actor compromised the digital control system for an industrial operation by a man-in-the-middle attack, data can be corrupted with the control system. The industrial facility operator would have no way of knowing the data has been compromised, until someone such as a security engineer recognized the attack was occurring. As operators are trained to provide a prompt, appropriate response to stabilize the industrial facility, there is a likelihood that the corrupt data would lead to the operator reacting to the situation and lead to a plant upset. In a resilient control system, as per Fig. 1, cyber and physical data is fused to recognize anomalous situations and warn the operator.

2) As our society becomes more automated for a variety of drivers, including energy efficiency, the need to implement ever more effective control algorithms naturally follow. However, advanced control algorithms are dependent upon data from multiple sensors to predict the behaviors of the industrial operation and make corrective responses. This type of system can become very brittle, insofar as any unrecognized degradation in the sensor itself can lead to incorrect responses by the control algorithm and potentially a worsened condition relative to the desired operation for the industrial facility. Therefore, implementation of advanced control algorithms in a resilient control system also requires the implementation of diagnostic and prognostic architectures to recognize sensor degradation, as well as failures with industrial process equipment associated with the control algorithms.

Resilient Control System Solutions and the Need for Interdisciplinary Education

In our world of advancing automation, our dependence upon these advancing technologies and the skill sets needed to keep the United States at the forefront of innovation. The challenges may appear rooted in design of better means to better control our infrastructures for greater safety and efficiency in generation and use of energy. However, the evolution of the technologies developed to achieve the current design of automation has achieved a complex environment where a cyber-attack, human error in design or operation, or a damaging storm can wreak havoc on the infrastructure we depend as a Nation. The next generation of systems will need to consider the broader picture to ensure as a path forward, failures do not lead to ever greater catastrophic events. As a critical resource are the students of tomorrow who will be expected to advance these designs, and require both a perspective on the challenges and the contributions of others to fulfill the need. Addressing this need, courses have been developed to provide the perspectives and relevant examples to overview the issues and provide opportunity to create resilient solutions at such universities as George Mason University and Northeastern. The tie to critical infrastructure operations is an important aspect of these courses.

Through the development of technologies designed to set the stage for next generation automation, it has become evident that effective teams are comprised several disciplines. However, developing a level of effectiveness can be time consuming, and when done in a professional environment can expend a lot of energy and time that provides little obvious benefit to the desired outcome. It is clear that the earlier these STEM disciplines can be successfully integrated, the more effective they are at recognizing each other’s contributions and working together to achieve a common set of goals in the professional world. Team competition at venues such as Resilience Week will be a natural outcome of developing such an environment, allowing interdisciplinary participation and providing an exciting challenge to motivate students to pursue a STEM education.

Standardizing Resilience and Resilient Control System Principles

Standards and policy that define resilience nomenclature and metrics are needed to establish a value proposition for investment, which includes government, academia and industry. The IEEE Industrial Electronics Society has taken the lead in forming a technical committee toward this end. The purpose of this committee will be to establish metrics and standards associated with codifying promising technologies that promote resilience in automation. This effort is distinct from more supply chain community focus on resilience and security, such as the efforts of ISO and NIST

References

Resilient control systems Wikipedia

(Text) CC BY-SA

Contents