Validate and Enable Machine Learning in Industrial AI
Hongbo Zou, Guangjing Chen, Pengtao Xie, Sean Chen, Yongtian He, Hochih Huang, Zheng Nie, Hongbao Zhang, Tristan Bala, Kazi Tulip, Yuqi Wang, Shenlin Qin, Eric P. Xing
VValidate and Enable Machine Learning in Industrial AI
Hongbo Zou * , Guangjing Chen * , Pengtao Xie * , Sean Chen * , Yongtian He * , Hochih Huang * , Zheng Nie * , Hongbao Zhang * , Tristan Bala * , Kazi Tulip * , Yuqi Wang * , Shenlin Qin * , Eric P. Xing *† * Petuum Inc., Sunnyvale, US † School of Computer Science, Carnegie Mellon University, Pittsburgh, US Email: {hongbo.zou, guangjing.chen, pengtao.xie}@petuum.com
ABSTRACT
Industrial Artificial Intelligence (Industrial AI) is an emerging con-cept which refers to the application of artificial intelligence to in-dustry. Industrial AI promises more efficient future industrial con-trol systems. However, manufacturers and solution partners need to understand how to implement and integrate an AI model into the existing industrial control system. A well-trained machine learning (ML) model provides many benefits and opportunities for industrial control optimization; however, an inferior Industrial AI design and integration limits the capability of ML models. To better understand how to develop and integrate trained ML models into the traditional industrial control system, test the deployed AI control system, and ultimately outperform traditional systems, manufacturers and their AI solution partners need to address a number of challenges. Six top challenges, which were real problems we ran into when deploying Industrial AI, are explored in the paper. The Petuum Optimum sys-tem is used as an example to showcase the challenges in making and testing AI models, and more importantly, how to address such chal-lenges in an Industrial AI system. Introduction “Industrial AI” is an emerging concept which refers to the applica-tion of artificial intelligence to industry [27]. Industrial AI is now underway, transforming traditional industrial control into smart In-dustry 4.0 and creating new opportunities, where an AI solution is built on pattern recognition with the ability to understand those pro-cesses, interact with the environment, and intelligently adapt their behavior [22][26]. The concept of Industrial AI is becoming widely adopted in mainstream manufacturers in their discourse and prac-tice. The recent popularity of Industrial AI is due to the increase in deployed sensors along with the automated data collection process, the powerful computation capability to perform complex tasks, faster network connection to access cloud services for data manage-ment, and computing power outsourcing. However, ML technology alone cannot be applied to industrial automation immediately if not sufficiently adapted to the industry domain. According to a Gartner survey in 2018, 75% of early AI projects underwhelmed, and 85% of AI projects failed [17]. This trend is due in large part to the un-derstanding gap between AI system design and user expectations. To understand how AI works with industrial control systems, we first review the framework of traditional industrial control systems without AI assistance (Figure 1). The system consists of five blocks, which include a supervisor, a controller, an actuator, a plant and measuring sensors. The supervisor is responsible for defining the desired values and instructing the connected controller. The control-ler detects the error signal and amplifies it to trigger the predefined control logic. The output of the controller is fed to the actuator for control logic execution. An actuator is a component of a machine that is responsible for moving and controlling that mechanism. It also acts as the input for the plant according to the control signal, so that the output signal approaches the reference input signal. The sen-sors, or measurement elements, are devices that convert output var-iables, such as displacement, into another manageable variable, such as a voltage, that can be used to compare the output with the refer-ence input signal. This completes the feedback path of the closed-loop system [4]. Automatic control is one of the primary goals of industrial process applications. An automatic controller compares the real values of the output of a plant with the input references (the desired values), determines the deviation and produces a control signal that will re-duce the deviation to zero or a small value [1]. The way in which the automatic controller produces the control signal is called control action. It moves a system from its initial condition to the desired state and, once there, maintains the desired state. AI has helped fun-damentally change the scope and pace of automation. In the past, mechanical or electrical hardware components performed most con-trol functions in technological systems. When hardware solutions were insufficient, continuous human participation in the control loop was necessary. In Industrial AI, machine learning enables pre-dictions to take over many control functions in the supervisor and controller block. A well-trained machine learning model can pro-vide excellent system performance under widely varying operating conditions. Despite the strength of Industrial AI for future industrial control sys-tems, ML models alone are not enough. Comprehensive designing and testing processes are necessary to ensure its adaptation in the industry. To better understand how to develop machine learning models, integrate them into the traditional industrial control system,
Controller PlantSensorsActuatorSupervisorHigher-level decision making Instrumentation
Figure 1 A typical industrial control system nd validate the deployed AI control system, manufacturers and their AI solution partners need to address all requirements and chal-lenges. For example, a machine learning model makes suggestions based on its prediction to avoid undesired outcomes before they ap-pear. This proactive process differs from traditional human control when people usually only react to outcomes in post-actions. Another challenge is that the ML models trained over mostly normal datasets can never guarantee to fully satisfy all of the edge cases in industrial control [11]. Therefore, the traditional control logic still needs to coexist with AI-control for extreme situations. This coexistence brings new questions, such as how to define the working scopes of AI control and traditional control logic in the industrial system, and whether this coexistence introduces conflicts in the control system. All of these questions need to be explored before Industrial AI im-plementation. To help enterprises at every step of the AI adoption journey, we built Petuum Symphony, an enterprise AI platform. Petuum Symphony handles AI/ML operations, data processing, AI hardware optimiza-tion, and explainable AI allowing users to rapidly build, deploy, and maintain cutting-edge AI solutions. Operationalized on the Sym-phony platform, Petuum Optimum, our flagship Industrial AI prod-uct, enables manufacturing enterprises to reach unprecedented yield and energy savings by optimizing asset and process operations. By abstracting and sharing our experience in designing and implement-ing Petuum AI in real projects, we showcase the challenges in mak-ing and testing AI models, and more importantly, how to address these challenges in an Industrial AI system. This paper makes the following contributions: 1. The difference of traditional industrial control and Industrial AI control is analyzed and presented; 2. Six implementation challenges, which were real problems we ran into when deploying Industrial AI, are explored and discussed in the paper; 3. Petuum Optimum, out flagship Industrial AI product, is used as an example to showcase the challenges in making and testing; 4. Introduce how to design and implement various ML and system testing approaches for thorough Industrial AI testing. Petuum Optimum took the listed contributions and enabled two in-dustrial AI solutions, which have been evaluated and deployed, working for two of largest cement factories in the world. The rest of this paper is organized as follows. Related work is discussed in Sec-tion 2. A tradition control system is introduced and discussed in re-lated work. Section 3 is problems and Petuum solutions. Six top challenges are proposed and explained in this section. Also, this sec-tion introduces an Industrial AI validation system adopted by Pe-tuum Optimum. This validation system gives us a concrete example to present our design and test experiences. Then, the six challenges are addressed and discussed, respectively, from Section 4 to Section 9. The conclusions, limitations, and future works are discussed in Section 10. Related Work
Thanks to recent advances in machine learning and AI, Industrial AI is an emerging field with a great amount of data, market value and new industry revolution. Industrial AI control systems use the high volumes of data generated by sensors to identify trends and patterns that can then be used to predict future trends. These predictions help avoid the worst control states making manufacturing processes more efficient while reducing energy consumption. This process enables plants to continuously adapt to new undergoing optimization with no need for operator input. In addition, as the level of inter-connec-tion increases, the AI system can learn from inter-connection, which can lead to the discovery of many complex correlations in systems that aren’t yet or are no longer evident to the detection of the tradi-tional control system. Industrial AI control systems with sufficiently intelligent analytical technology are already available. But how to design and develop an AI system depends on the well-studied and thorough understanding of the whole Industrial AI system. This tar-get is motivating us to propose this specific research topic to under-stand the challenges of Industrial AI and explore how to design and implement an Industrial AI system to optimize industrial control processes. A traditional industrial control system (ICS) is an information sys-tem used to control industrial processes such as manufacturing, product handling, production, and distribution. Industrial control systems include supervisory control and data acquisition systems to control geographically dispersed assets, as well as distributed con-trol systems and embedded control systems to control localized pro-cesses. The controller’s goal is to move a system from its initial con-dition to the desired state and, once there, maintain that desired state. The difference between the desired and actual states is called the error signal . It is also possible that the desired state will change over time. When this happens, the controller must adjust the state of the system to track changes in the desired state. This human adjustment, of course, works after the error happens, but such afterward adjust-ment cannot fully optimize the industrial production process on product quality, energy savings, and productivity [19][20]. Built on pattern recognition, Industrial AI proposes a new approach to optimize the industrial control process even further [8]. Interact-ing with the control system, Industrial AI can predict and avoid po-tential error. And it also can take over many control functions in the supervisor and controller blocks effectively and efficiently, under different operating conditions. However, human participation plays an important role in the traditional control loop, which can only pas-sively handle any edge case, whereas Industrial AI can intelligently adapt any product line change. Automatic control is always one of the major goals in industrial process applications. In Industrial AI, machine learning enables predictions to take over many control functions in the supervisor and controller blocks. A well-trained ma-chine learning model can provide excellent system performance un-der widely varying operating conditions. For consistently high per-formance and robustness, an AI control system must be carefully designed and thoroughly tested [15][34].
Problems and Petuum Solutions
Industrial AI is being touted as the new electricity that will power the next industrial revolution. Companies that harness AI will lead while those that don’t will lag behind. However, the high hurdle of understanding AI, how best to develop and validate AI control sys-tem, and ultimately, how to leverage AI as a game-changer create strong barriers to adoption. To ensure industries can truly harness the power of AI for this next revolution, Petuum is transforming the traditional industrial control to Industrial AI.
Problems and Challenges
Although Industrial AI proposes a promising idea for future control systems, manufacturers and solution partners still need to under-stand how to implement and integrate an AI model into the existing industrial system, thus implementing Industrial AI [13][14]. A well-trained ML model provides many benefits for industrial control op-timization; however, a “terrible” Industrial AI design and integra-tion also limits the capability of an ML model. The proposed Indus-trial AI doesn’t give any feasible proposal and details on Industrial AI design and test. Consequently, Industrial AI is being impeded by a bunch of challenges and questions, which need to be clarified to engage more manufacturers embracing this evolution. To better un-derstand how to develop and integrate a trained model into the tra-ditional ICS and test that the deployed AI system is working properly and more effectively than human beings, manufacturers and their AI solution partners need to address all requirements and challenges [23]. Some top challenges and real problems we face when we industrialize AI in manufacturing are discussed below: 1.
Domain knowledge transfer. Different manufacturers have various control requirements on their product line. Such requirements include complex domain knowledge, which is difficult for AI solution partners to understand. However, the developed Industrial AI system doesn’t work if the trained model cannot fully understand the domain knowledge under the control logic. Therefore, how to translate the industrial domain knowledge to solution partners is the first challenge needed to be addressed. 2.
Data preprocessing. Raw data is often incomplete or inconsistent, lacks certain patterns or trends, and likely contains errors. Data preprocessing is the step that transforms raw data into an understandable and expected format in AI training and prediction. In Industrial AI, the data is collected, saved, and processed with streaming mode. The data keeps changing its pattern over time. So, continuously applying the data preprocessing methods in ML training to prediction is the second challenge. 3.
AI model accuracy (model validation). AI model accuracy is the measurement used to determine which model is best at identifying relationships and patterns between variables in a dataset based on the input or training data. The better a model can generalize to ‘unseen’ data, the better predictions and insights it can produce, which in turn deliver more industrial control value. 4.
Real-time constraints. In industrial control, real-time constraints are restrictions on the timings of events, such that they occur on-time. The Industrial AI control time includes multiple latencies covering data collection, data transfer, AI prediction, control response, etc. Some of the latencies are constant costs. How to reduce the optimizable latencies to catch end to end real-time constraints is the fourth question needed to be addressed and tested in house. 5.
Coexistence of traditional control logic and AI control. Currently, the AI control system is based on its AI model to predict future error. Such prediction can make system avoiding the worst state before it happens, rather than stopping the current worst state. This deviation allows AI control systems to replace traditional control systems gradually. However, machine learning is not a deterministic approach to handle industrial control, which cannot fully satisfy all edge cases in industrial control. Therefore, the traditional control logic still needs to coexist with AI control for extreme situations. This coexistence brings up some new questions, regarding how to define the working scope of AI control and traditional control logic in the industrial system, and whether or not this coexistence brings up decision conflicts in system control. 6.
No supervised-steering data for in-house tests. In most cases, the development of new Industrial AI control systems is based on a running product line. The new Industrial AI control system is developed instead of the existing control system. Therefore, there is no any available supervised-steering data or system for in-house testing and debugging. We need to explore the question of how to make in-house testing more feasible for real production line control. Due to space limitations, only six typical challenges are listed above. All of these challenges need to be addressed in Industrial AI devel-opment. In this study, we take the Petuum Optimum system as an example to understand what the challenges for testing and making the AI model productive are and how to address such challenges in an Industrial AI system? And, the generalization potential of this study can benefit the development of Industrial AI systems by help-ing AI solution partners understand and interpret user requirements better, and improve the procedure of Industrial AI system design, implementation, and test [24].
Petuum Optimum System
Petuum Optimum is a typical industrial control system for industrial manufacturers. The Petuum Optimum system provides a service, which ingests historical and real-time streaming data from the man-ufacturing plant production line. This data is then leveraged into ex-pert-built machine learning algorithms to deliver precise predictions of the plant’s operational variables. The Optimum system conducts precise predictions by understanding the dynamic and non-linear re-lationships between the variables. Such predictions can prescribe optimal set points to enable the plant to reach its operational goals. Petuum Optimum also offers a supervised-steer option, which ena-bles feeding the set points into the plant’s control system for a fully autonomous operation. The self-learning capabilities of Petuum Op-timum means it is constantly collecting and processing data to con-tinuously improve the results as it gains a better understanding of the plant and its processes. This paper takes Petuum Optimum as it applies to cement manufacturing plants as an example to discuss the challenges of an Industrial AI. Because performance significantly impacts the system design of an Industrial AI control system, this tudy mainly focuses on the test of the Industrial AI system. The derived conclusions and experiences can be extended to all of the stages of an Industrial AI system development. Figure 2 shows an overview of the Petuum Optimum industrial con-trol system and how it integrates with the existing industrial control system. In general, the Petuum Optimum system is serving for pre-diction and control with three core components - data check, ML model, and control logic. The data check module queries the raw data and the values of the control variable from the controller. The controller caches the raw data collected from sensors and generates dynamic control bounds from the dynamic control table. The data check module enables data preprocessing based on the selected data processing model and sends the preprocessed data the following in-ference modules based on the raw data and dynamic bounds check-ing. There are two inference modules in the Petuum Optimum sys-tem. If the raw data is checked in bounds in the data check step, ML models are inferred to generate suggestions for industrial control op-timization. If the raw data is out of bounds, the control logic module is enabled to take over inference. Control logic acts as a safeguard to guarantee edge case handling. This overview shows the funda-mental design and integration logic of the Petuum Optimum system. Petuum Optimum provides a simple system implementation and in-tegration solution for Industrial AI. However, how to validate the developed system works for the plant’s optimization goals needs to be carefully and thoroughly tested [21]. To better understand how to develop and integrate trained models into the industrial control system, how to test the deployed AI con-trol system, and finally how to outperform traditional systems. Six top challenges include domain knowledge transfer, data validation, AI model performance test, latency constraints, coexistence of tra-ditional control logic and AI control, and in-house test. Petuum Op-timum system is used as an example to showcase the challenges in making and testing. Six challenges will be discussed one by one in the following sections. Knowledge Transfer
Different manufacturers have various plant physical characteristics and desired outcomes, leading to vastly different control require-ments on their product lines. Such requirements include complex domain knowledge that is not clear to solution partners. However, it is impossible to develop an Industrial AI system without adapting it to the customer’s specific requirements. Therefore, how to transfer the industrial domain knowledge to AI solution partners need to be addressed first. Knowledge transfer requires excellent and consistent communication between the different parties. In general, it breaks down into three stages.
Symbolize and formulate the knowledge
AI models only accept and process digitalized data during training and inference. In general, the data should be as numeric and tabular as possible to allow easier modeling. With this requirement in mind, the domain knowledge should be formulated and tabulated. The Pe-tuum Optimum system defines an AI Configuration Enablement (ACE) sheet to formalize this step. Three items: selected tags, tag bounds, and tags relation need to be checked and listed in the sheet. These heavily depend on the communication between two parties. It is critical to allow industry partners to transfer their data engineering experience and solution expectation to the AI solution partners.
Selected tags
In general, there are hundreds of sensors working on each industrial plant production line. The time series data collected from each sen-sor is defined as a tag per industry tradition. Data is important, but simply adding more tags does not always lead to a better model. Data engineers often find raw data from tags not qualified for model training because of unsupported format or too much noisy/lost data. Therefore, AI solution partners and end-users should discuss what tags they should keep for AI model training. In general, four types of tags have been defined for every use case in Petuum Optimum system. Control, normalize, and optimize tags can be adjusted to guarantee product line working in optimized status. Constraint tags are used to check whether the product is running in optimized status or an unhealthy way. All of the defined tags are selected based on the four-optimization purpose and target.
Tag bounds and target values T ag bounds define the accepted range of values during optimization. When a tag is outside its predefined bounds or away from its target value, it enters an edge case where 1) the AI model should prioritize returning it to normal, or 2) special safeguard control gets triggered to amend. Tag bounds can be defined as static or dynamic bounds. Four types of tags have been described and discussed in the previous section. The different tag types determine if the tag bounds will be static or dynamic. Control tags can only be changed gradually and are adjusted following dynamic bounds with a small step. Constraint tags are used to check optimization consequences. The static bounds of constraint tags can be defined by customers to check out-of-bounds edge cases. Optimize tags and normalize tags have prede-fined optimization weight, which suggests different optimization goal. Tags relation
Tags relation describes the positive, negative, and sometimes the non-linear relationship between tags. Such relations are the founda-tion in traditional Industrial AI. They are the so-called golden rules resulting from physics directly or through years of practice. For ex-ample, there are two tags - Tag A and Tag B in the product line. Tag A is a constraint tag, which monitors the product line running status and cannot be adjusted directly. Tag B is a control tag, which can be adjusted to change the production running status. Tag A and Tag B
PlantSensorsActuatorHigher-level Supervisor Commands
Data Process
Petuum AI-pilot Control SystemML ModelsControl Logic Instrumentation
Figure 2 Petuum Optimum industrial control system an be defined together to achieve one of the optimize goals. Two tags may have a negative relationship. The relationship can be de-fined as when the Tag A is increasing quickly, Tag B should be de-creased to a low amount to avoid Tag A from continuing to increase. Understanding how to hand over some of the control logic to the AI model is one of the challenges in Industrial AI. An agreed-upon def-inition of tag bounds and tag relations can make user expectations clear. The design of the AI model and validation logic among the AI provider party should then fully follow the definition of the ACE sheet.
Disseminate knowledge in multiple teams
In general, An AI solution implementation process involves the col-laboration of multiple development teams. Multiple different teams (i.e., customer success, ML, Quality Assurance) need to completely understand user requirements and collaborate coherently to build the Industrial AI solution end to end. For example, the customer success team communicates with the customer, define data labels and ex-pected control logic, and design data preprocessing and basic safe-guard control logic. The ML team cares more about how to tune the selected model to catch data features with fully requirements under-standing. And, the QA team also needs to completely understand the user requirements and design different test cases to validate the so-lution from the customer’s perspective. Therefore, all of the in-volved teams should keep customer expectations and transferred do-main knowledge consistent. Once the product requirement docu-ment (PRD) and ACE have been defined, any update should be re-checked with customers.
Re-check knowledge in supervised-steering test
In our experience, even with a well-defined ACE, all of the knowledge and expectations are not necessarily transferred. More often than not, customers may not fully describe their knowledge and expectations in one iteration, and the AI solution partner does not fully understand the user description either. Many intrinsic rea-sons could lead to such suboptimal knowledge transfer, and it is in the best interest of both sides to continue communication during and after the ML modeling. Therefore, it is advised to re-check knowledge and expectations with the user in a supervised-steering test. In this phase, the drafted Industrial AI solution is shown to the users and is a chance for them to provide second-round feedback. The re-checking helps the user and solution partner correct the ACE and PRD and fine-tune the deployed solution. Data Validation
Machine learning is a powerful learning tool that extracts patterns from data. By this nature, a machine learning model is acutely sen-sitive to the quality of the data. Such sensitivity is relevant in the two core phases of ML model development: training and deploy-ment [2][3]. During training, data quality directly determines the ceiling of model performance. After all, ML models, no matter how powerful, cannot learn a pattern if it is not present in the data. A common chal-lenge ML engineers face is that they do not observe the data patterns mentioned by the customers. While these usually involve further discussion and debugging, there are several common data prepro-cessing steps that are general enough to be applied in almost all use cases. Below, we introduce the four common data preprocessing topics in the Petuum Optimum system.
Outlier removal
Outliers are extreme values that deviate from other observations on data. In some cases, customers provide clear definitions for outliers, such as a threshold that defines the normal range of values. In most cases, however, this information is not readily available, and data engineering is needed [10]. Descriptive statistics methods, such as the interdecile range, checks data distribution between the first and the ninth deciles (10% and 90%). Outliers are padded with the last normal value. If outliers exist too often, data should be re-verified with customers before proceeding to ML training. It catches poten-tial errors early, allowing fast iteration of communication between customers and solution providers. Sometimes, such re-verification also helps ML engineers improve their outlier removal method.
Non-stationary signals
ML models are trained on a limited amount of data. The data pattern is assumed to be persistent in going outside the training dataset. Un-fortunately, this is sometimes not the case. The data pattern needs to be continuously monitored and validated throughout the various stages of the Industrial AI system. Otherwise, model performance may deteriorate over time. When overlooked, it may even lead to an unexpected abrupt drop in performance. Figure 3 showcases a real example that occurred during Petuum Op-timum development. The original signal (blue) was outlier removed (red), as shown in the top panel. Upon closer inspection, however, we realized that the 80s values on August 3 are well within the ex-pected range of this signal, therefore, should not be removed. Fur-ther debugging revealed that they were labeled as outliers because the model was trained based on the data of June. As shown in the bottom panel, data in June mostly fluctuated around 60 and never exceeded 70. Because of this shift in data pattern, models performed
Figure 3 Outlier removal example ery poorly for hours in August. Fortunately, this was captured in the validation stage before deployment, which underlines the im-portance of continuous monitoring of data quality in case the signal is non-stationary.
Broken sensors
In the previous section, we discussed outliers. In some production lines, the sensors may generate outlying values often enough that they are difficult to be accurately identified by statistics. These sen-sors may restart periodically to clean or re-calibrate. A dedicated data engineering team is required to work with broken sensors. This situation is formulated as follows: At each step, a window of data is injected into the ML model for inference: X = [ x t-w+1 , x t-w+2 , …, x t ] Where x t represents the values of a signal at time t, and w is the window size. Petuum Optimum uses a majority vote to decide whether a window is valid. In a given window, if the number of data points demonstrating broken characteristics is more than 25%, this window is considered unstable, and no further ML effort will be made on this window. To determine whether the sensor is working in one specific data point, customers shall provide a criterion, for example, a “broken sensor bound.” If the sensor value is within the bounds, it is considered valid. Let’s take one tag ‘Tag Name 1’ as an example to extend the details. Take a specific window of [5.0, 1.0, 4.0, 7.5, 4.5, 3.8, 16.2, 17.2, 36.0, 44.0, 2.0, 77.0] for example. Here, the window size is 12, and the broken sensor bound is [15, 50]. We observe that eight data points are out of the broken sensor bounds. Consequently, 67% (8/12) of the members in this window are invalid, which is above the 25% limit making the window unstable. Interpolation
Streaming is the most common data source in industrial control. Streaming data is generated continuously by many sensors, which typically are sent in small sizes simultaneously. It is common for different sensors to generate data with different sampling rates. Different data streams are interpolated to the same sampling rate to simplify the data processing pipeline. The same interpolation is used in both ML model training and inference [6].
If the customer suggests the interpolation method, it can be implemented directly. Otherwise, Petuum Optimum defaults to a simple interpolation method that always writes the most recently cached value to the current timestamp. This method is extremely easy to implement, executes fast, and works well with stringent time requirements. Figure 4 shows two examples of interpolation. Data processing is an integral step in ML. ML engineers may not have the domain knowledge to comprehend the data, whereas customers may not understand what is needed for successful ML training and deployment. Therefore, the data processing method and output should be discussed and verified between the two parties frequently [7]. Machine Learning Model Performance Test
From quality and testing perspective, developing Industrial AI system is no difference from traditional software development, in the sense that it is also a full life cycle involving unit test, integration test, and end-to-end test [9][12]. However, there are three major differences between software development and ML development which makes testing approach slightly different. 1) ML model is not a deterministic approach, and it cannot generate the same output for every test run hence defining pass/fail criteria is challenging. 2) the performance of ML model heavily depends on the data quality in training and serving. ML model should be able to catch data pattern changes with parameters updates automatically. 3) Industrial AI is developed and deployed to serve for streaming application. The ML model performance could be significantly impact by streaming latency. We need a new testing approach to check and validate the impact of such latency [17]. The goal of the Industrial AI system is to prescribe the optimized value to the controllers in production. This task is divided into two components: 1) a prediction model that forecasts production line behavior in the future, usually after a certain number of minutes; 2) an optimization model which finds the best control value to the controller and optimizes the future behavior. When framed as an ML task, there are two types of variables in the model: response variables and control variables. The response variables, which are measured by sensors, represent the current status of the system under control. The control variables are control components set up by the actuator to make the system reach the desired status. The two types of data have a casual relation due to their physical origin, which should be learned by the prediction model [29][31]. To optimize response variables to the desired status, Petuum Optimum searches for the best control variable with trained prediction model evaluation and feeds the control variables into the system to optimize response variables. Concretely, the ML task is formulated as follows: X t : the response values of input values at instant t; Y t : the control values of input values at instant t; Z t : the values of output values at instant t; In the regression model, the formula: 𝑋⃑ t = [ X t-w+1 , X t-w+2 , …, X t ] and 𝑌$⃑ t = [ Y t-w+1 , Y t-w+2 , …, Y t ] are used to predict Z t+ ∆ t , where w is the window size, ∆ t is the prediction length. Let 𝑍& t+ ∆ t = F ( 𝑋⃑ t , 𝑌$⃑ t ) , then the optimize target is Figure 4 Interpolation example in ( ∑ (𝑍& )*∆)) - 𝑍 )*∆) ) ), i.e., let F ( 𝑋⃑ t , 𝑌$⃑ t ) as close to the actual output Z t+ ∆ t as possible. Consequently, the loss function of the regression algorithm is defined as: loss function = ∑ (𝑍& )*∆)) - 𝑍 )*∆) ) Two test cases are used to validate the model performance. First, an offline evaluation is conducted to verify the performance of the trained model. Once a model is trained with some training data, the test chooses historical data from a different period with the same preprocessing process. The prediction results are compared with the raw data to validate the prediction model’s accuracy (Figure 5). The prediction signal is shifted by the prediction length to match the actual data. The example shows a bad performance for the prediction model. The prediction often does not catch extreme values in the original data. Through visualization, together with other numerical criteria such as the correlation, r squared, and mean square error (MSE), we can confidently reject this model and investigate ways to improve its performance. The second test case is an online streaming evaluation, which completes the test by checking performance in a semi-real deployment. Many factors impact the performance of the deployed model. First, the data pattern could change. Additionally, the steaming infrastructure could impact the data quality because of latency, missing samples, etc. Specifically, the latency of each sensor could be different, and they may significantly impact the model behavior. Latency
Time is of the essence in industrial control automation. Real-time control requires low latency throughout the pipeline, including potential latency introduced during data collection, data transfer, ML prediction and control response. Some of the latencies are fixed costs, such as data collection or transfer. Two typical latencies – data subscription and ML inference are optimized in the Petuum Optimum system [28]. Latency occurs in every step during data streaming, which is based on a publish-subscribe structure. In the publish-subscribe system, the customer’s plants collect and write data into a data warehouse [35][36]. A publisher then reads the time series data from the data warehouse and produces messages with specified topics. Every AI solution needs to subscribe to their required data with a defined topic from the publisher. In the publish-subscribe system, many factors, such as topic numbers, system cache, and subscriber numbers, could potentially affect the end-to-end latency and thus change the behavior of the ML model. Therefore, the deployed data pipeline needs to be fully designed and tested with strict time constraints [32][33]. Another major source of latency is the optimization model. At each time instance, the best control values are prescribed by the ML model. The optimization process finds the best value among the solution space but is usually very time-consuming, so shortening the time complexity is pivotal in this step. For example, reducing the inference time from one minute to eight seconds for each iteration is possible with parallel computing. The inference speed is not strictly part of the core performance of ML models, but it is a key metric to evaluate how practical the product is. Customers may also define their requirements in the PRD. Figure 6 is an example of a tag with occasional sharp spikes (top panel) within minutes. Because of network errors, prolonged data loss occurred for hours at one point, which can dramatically affect model performance if not resolved. Coexistence of traditional control logic and AI control
The Industrial AI control system uses its ML model to predict the future state and avoid undesirable outcomes. Such prediction is key as it preempts an undesirable state before it happens rather than mitigating afterward. This distinction gives AI control systems a tremendous advantage over traditional control systems.
However,
ML is often not a deterministic approach, and mistakes may occur. More importantly, the very nature of edge cases means they have little or no representation in the training data. In the current stage, it is unwise to give ML models full control of the plant. We advise the traditional control logic to coexist with AI for unforeseen corner cases. This coexistence brings up new questions as to how the scopes of AI and traditional control logic fit in the industrial system. It is also important to resolve conflicts between the two when differences arise.
Figure 5 Prediction results offline evaluation Figure 6 The streaming data with system latency
We use the industrial control for the vertical mill, a typical grinder used in cement manufacturing, to illustrate the challenge and our solution of AI coexistence. The mill’s body vibration is one of the main metrics necessary to monitor for safe production. Several reasons lead to high vibrations, such as failure of the grout, distortion of the steel structure, or cold joints between pours [1]. The ML model cannot fully predict all of these factors and their interaction because of the lack of data and long reaction time. To ensure safety, safeguard logics based on expert knowledge in traditional industrial control logic are still in use. In the Petuum Optimum system, the model enters the survival mode if the current vibration is over a defined threshold of 0.45 (Figure 7 Top). The AI model is required to perform everything within its power to reduce vibration. Figure 7 Bottom shows how special safeguard logic takes over in survival mode. The feed rate prescription is mostly maintained at the lowest level possible allowed by the system (indicated by dash bounds) because domain knowledge from the customer tells us low feed rate alleviates the vibration. In-house Test
Traditionally, the improvement of an industrial control system is incremental. The Petuum industrial Optimum system is, however, developed from the ground up and not on the running product line directly. This poses a big restriction on the model validation during development: as the AI solution partner, we do not have direct access to the controllers in the plant for both business and safety reasons. How to evaluate a trained ML model before its release and deployment is, therefore, a critical challenge for Industrial AI testing. We adopted a one-step verification to validate model output (prediction and optimization) at each timestamp. Instead of looking for cumulative effects over time, we examine whether the outputs are optimized in the correct direction during each step. This is because the AI model suggestion cannot apply to the actual input of the system without physically controlling the production line. An in-house test is necessary to thoroughly test the speed and amplitude of the optimization over many steps. During in-house testing, AI models are first deployed with close supervision from both model developers, quality engineers, and industry experts internally. One-step verification is not a replacement for in-house verification but a convenient testing framework that helps us validate the model’s offline behavior. Figure 8 demonstrates an example of one-step validation. The shown signal has an upper constraint bound of 5050, meaning it should be decreased when above this threshold. Upon validation, we observe that optimization towards the lower side does occur whenever this signal is above constraint. If this were in-house testing with an actual deployment, we would expect the original measurement to gradually decrease until back within normal range. This is, of course, not possible in offline testing, but one step validation offers a basic level of confidence in the model’s optimization power. Due to the limitation with in-house testing, to mitigate the risk before we put the model in auto-steer model in production, we use on-site testing approach and put model in non auto-steering mode in production. This way we can monitor how the model perform with multiple step change and side-by-side compare the results with traditional control system. And because we have done thorough in-house testing to validate the models, this final step only takes very short amount of time before we switch to auto-steer mode in production. Conclusion and Future Works
The concept of Industrial AI is becoming widely adopted in mainstream manufacturers in their discourse and practice. A well-studied industrial system can provide excellent system performance under widely varying operating conditions. Machine learning enabled predictions and optimization hold a strong advantage over traditional control methods. To ensure a consistently high level of performance and robustness, an AI control system must be carefully designed and thoroughly tested. In the study, we present the Petuum Optimum system as an example to demonstrate the challenges in Industrial AI deployment, and more importantly, how to solve and validate them. Six challenges from our real experience when implementing Industrial AI were discussed. The challenges showcase the implementation details of Industrial AI. Based on the findings of challenges, four takeaways were delivered to summarize the study. 1.
Data quality directly determines the performance of AI model. Data labeling, preprocessing, and quality validation should be checked and designed carefully; 2.
The industrial product line is a dynamic system. The running behavior and patterns are always changing. The trained model should catch such changes to better serve live deployment; 3.
The implementation of Industrial AI includes the trained model itself and a deployed running system. The model performance
Figure 7 Control logic validation Figure 8 One-step validation ould be changed in the running system. A well-designed test should cover both situations; 4.
Thorough testing and various testing approaches are required to ensure the high quality of the model and to detect corner cases before the model is deployed to the production environment. Any issues missed in the testing process may cause sensor hardware damage and tremendous loss to the factory as well as the AI company. A major limitation of this study is its exclusiveness of the type of data. Petuum Industrial AI is designed for typical industrial control in manufacturer plants that involve multi-channel time series data. Multimedia materials such as images and text will be interesting topics worth further exploration. In addition, some steps, such as knowledge transfer, involve some manual configurations and updates. Such manual configurations are error-prone [16]. The limitations will be removed gradually. Our future work includes 1) extending our Industrial AI solution to a wide variety of time series data types, such as speech and video. Such time series data have different data processing policies, latency requirements, and model selection. 2) removing manual configurations in solution development and validation. Automation code can be added to create configuration files based on predefined templates. Such improvement can significantly reduce the time and cost spent on development and validation. We summarized our challenges and experience in developing Industrial AI in this study. Ultimately, it shall serve as a foundation for the wide adoption of AI technology in the field of industrial control.
References [1]
Ang, K. H., Chong, G. C. Y., Li, Y., PID Control System Analysis, Design, and Technology,
Proceeding of IEEE Transactions on Control Systems Technology , 13(4): pp. 559-576, 2005. [2]
Breck, E., Cai, S., Nielsen, E., etc. The ml test score: A rubric for ml production readiness and technical debt reduction. In IEEE Big Data, 2017. [3]
Breck, E., Ployzotis, N., Roy, S., Whang, etc., Data validation for machine learn-ing,
Proceeding of the second SysML Conference (SysML2019), [4]
Demongodin, I., & Koussoulas, N. T., Differential Petri Net Models for Industrial Automation and Supervisory Control,
IEEE Transactions on Systems , Man , And Cybernetics-Part C: Applications and Reviews , Vol. 36, No. 4, July 2006. [5]
Hsiung, P-A., Real-Time Constraints,
In Institute of Information Science , Aca-demia Sinica, Taipei, 2000. [6]
Hynes, N., Scully, D., and Terry, M. The data linter: Lightweight, automated san-ity checking for ml data sets.
Proceeding of Workshop on ML Systems at of Thirty-first Conference on Neural Information Processing Systems ( NIPS 2017 ML Sys-tems Workshop ), 2017. [7]
Krishnan, S., Franklin, M. J., Goldberg, K., and Wu, E. Boostclean: Automated error detection and repair for machine learning. CoRR, abs/1711.01299, 2017. URL http://arxiv.org/abs/1711.01299. [8] [9]
Pei, K., Cao, Y., Yang, J., and Jana, S. Deepxplore: Automated whitebox testing of deep learning systems.
Proceeding of ACM Symposium on Operating Systems Principles ( SOSP 2017 ), pp. 1–18, 2017. ISBN 978-1-4503-5085-3. [10]
Polyzotis, N., Roy, S., Whang, S. E., and Zinkevich, M. Data management chal-lenges in production machine learning.
Proceeding of ACM SIGMOD , pp. 1723–1726, 2017. [11]
Qin, J., Liu, Y., and Grosvenor, R., A Categorical Framework of Manufacturing for Industry 4.0 and Beyond,
Proceeding of Changeable, Agile, Reconfigurable & Virtual Production Conference , pp. 173-178, 2016. [12]
Renggli, C., Karlas, B., Ding, B., Liu, F., Schawinski, K., Wu, W., and Zhang, C., Continuous integration of machine learning models with ease.ML/CI: towards a rigorous yet practical treatment,
Proceeding of the second SysML Conference (SysML2019), . [13] Sallomi, P., Aritificail Intelligence Goes Mainstream,
The Wall Street Journal , Retrieved Nov. 18 2019. [14]
Seebo Inc., The Impact of Machine Learning and AI on Manufacturing, Retrieved from https://cdn2.hubspot.net/hubfs/2100040/PDFs% 202018/MA-CHINE%20LEARNING%20AI%20MANUFACTURING%20-%20PDF.pdf, Retrieved Nov. 18 2019. [15] [16]
Schelter, S., Boese, J.-H., Kirschnick, J., Klein, T., and Seufert, S. Automatically tracking metadata and prove- nance of machine learning experiments.
Proceeding of Workshop on ML Systems at of Thirty-first Conference on Neural Information Processing Systems ( NIPS 2017 ML Systems Workshop ), 2017. [17]
Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., and Young, M., Machine learning: the high-interest credit card of technical debt,
Proceeding of Twenty-eighth Conference on Neural Information Processing Systems ( NIPS 2014 SE4ML Workshop ) 2014. [18] [19]
Terrell, W., Some Fundametal Control Theory I: Contrillability, Observability, and Duality,
American Mathmatical Monthly . 106 (9): pp. 705-719, 1999. [20]
Terrell, W., Some Fundametal Control Theory II: Feeback Linearization of Single Input Nonlinear Systems,
American Mathmatical Monthly . 106(9): pp. 812-828, 1999. [21] [22]
Wahlster, W., From Industry 1.0 to Industry 4.0: Towards the 4 th Industrial Revo-lution, Forum Business Meets Research, 2012. [23]
Wang, X., Dong, X. L., and Meliou, A. Data x-ray: A diagnostic tool for data errors. In SIGMOD, pp. 1231–1245, 2015. ISBN 978-1-4503-2758-9. doi: 10.1145/2723372.2750549. URL http://doi.acm.org/10. 1145/2723372.2750549. [24]
Witten, I. H., Frank, E., and Hall, M. A. Data Mining: Prac- tical Machine Learn-ing Tools and Techniques. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 3rd edition, 2011. ISBN 0123748569, 9780123748560. [25]
Wuest, T., Weimer, D., Irgens, C., & Thoben, K-D., Machine Learning in Manu-facturing: Advantages, Challenges, and Applications,
Journal of Production & Manufacturing Research , Vol. 4, Issue 1, 2016. [26]
Xu, L. D., Xu, E. L., etc., Industry 4.0: State of the Art and Future Trends,
Pro-ceeding of International Journal of Production Research , pp. 2941-2962, 2018. [27]
Yao, M., 4 Unique Challenges of Industrial Artificial Intelligence,
Forbes
Post , Retrieved Nov. 18 2019. [28]
Zheng, X., Kim, J. K., Ho, Q., Xing, E. P., Model-Parallel Inference for Big Topic Models,
Eprint arXiv:1411.2305 , 2014. [29]
Zou, H., Chen, M, & Dey, S., Understanding library user engagement strategies through large-scale Twitter analysis,
Proceeding of IEEE Big Data Service and Applications (BigDataService 2015) . 361-370. doi>10.1109/BigDataService.2015.31, 2015. [30]
Zou, H., Chen, M, & Dey, S., Exploring user engagement strategies and their impacts with social media mining: the case of public libraries,
Journal of Management Analytics , Vol. 2, Issue 4, pp. 295-313, Dec. 2015. doi>10.1080/23270012.2015.1100969, 2015 [31]
Zou, H., Chen, M, & Dey, S., A quantitative analysis of Pinterest: understanding library user engagement strategies for effective social media use, Journal of Information Technology Management, Vol. 26, Issue 3, pp. 21-32, Oct. 2015. [32]
Zou, H., Yu, Y. Tang, W. & Chen, H., Flexanalytics: a flexible data analytics framework for big data applications with I/O performance improvement, Journal of Big Data Research, pp. 4-13, Aug. 2014. [33]
Zou, H., Schwan, K., Slawinska, M., Eisenhauer, G., Zheng, F., Liu, Q., Klasky, S. et. al., FlexQuery: An online query system for interactive remote visual data exploration at large scale, Proceeding of IEEE international conference on Cluster Computing (
CLUSTER’13 ), 2013. [34]
Zou, H., Zheng, F., Eisenhauer, G., et. al., Quality-Aware data management for large scale scientific applications,
Proceeding of ACM/IEEE High Performance Computing, Networking, Storage and Analysis (SC 2012) , Nov. 2012. [35]
Zou, H. etc., A Source-aware Interrupt Scheduling for Modern Parallel I/O Systems,
Processding of ACM/IEEE 26 th Parallel and Distributed Processing Symposium ( IPDPS 2012 ), pp.156-166, 2012. [36]
Zheng, F., Zou, H., FlexIO: IO middleware for location-flexible scientific data analytics,