## Data Pretreatment Outliers and Data Reconciliation

Data pretreatment is necessary to assure that data used in modeling, monitoring and control activities provide an accurate representation of what is happening in a process. Data corruption may be caused by failures in sensors or transmission lines, process equipment malfunctions, erroneous recording of measurement and analysis results, or external disturbances. These faults would cause data to have spikes, jumps, or excessive oscillations. For example, sensor faults cause bias change, drift or increase in signal noise and result in abnormal patterns in data. The general strategy is to detect data that are not likely based on other process information (outlier detection) and to substitute these data with estimated values that are in agreement with other process information (data reconciliation). The implementation of this simple strategy is not straightforward. A significant change in a variable reading may be caused by a process equipment fault or a sensor fault. If the change in signal magnitude is due to an equipment fault, this change reflects what is really happening in the process and the signal value should not be modified. Corrective action should be taken to eliminate the effect of this disturbance. However, if the magnitude of the signal has changed because of a sensor fault, the process is most likely behaving the way it should, but the information about the process (measured value) is wrong. In this case, the signal value must be modified. Otherwise, any action taken based on the erroneous reading would cause an unwarranted process upset. The challenge is to decide when the significant change is caused by something that is happening in the process and when it is caused by erroneous reporting of measurements. This necessitates a comprehensive effort that includes signal noise reduction, fault detection, fault diagnosis, and data reconciliation. However, many of these activities rely on the accuracy of measurement information as well.

This section focuses on detection of outliers and gross errors, and data reconciliation. Data reconciliation involves both the elimination of gross errors and the resolution of the contradictions between the measurements and their constraints such as predictions from the model equations of the process. Detection and reduction of random signal noise are discussed in Section 3.5. Techniques for fault diagnosis and sensor fault detection are presented in Chapter 8. The implementation of these techniques must be well coordinated because of the interactions among signal conditioning, fault detection and diagnosis, process monitoring and control activities. The use of an integrated supervisory knowledge-based system for on-line process supervision is discussed in Chapter 8.

Most outlier detection and data reconciliation techniques are developed for continuous processes where the desired values of the important process variables are constant for extended periods of time. Consequently, these techniques are often based on the existence of stationary signals (constant mean value over time) and many of them have focused on assessment and reconciliation of steady state data. The time-dependent nonstation-ary data that batch processes generate may necessitate modification of the techniques developed for continuous processes prior to their application for batch processes.

3.4.1 Data Reconciliation

The objective of data reconciliation is to convert contaminated process data into consistent information by resolving contradictions between mea-

surements and constraints imposed on them by process knowledge. The simplest reconciliation case is steady-state linear data reconciliation which can be set as a quadratic programming problem. Given the process measurements vector x, the vector of unmeasured variables v and the data adjustments a (their final values will be the solution of the optimization problem), the data reconciliation problem is formulated as min F(a) = aTSj"1a a such that Bi(x + a) + Pv = 0 (3.14)

where Bi and P are the matrices of coefficients corresponding to x and v in Eq. 3.14 and Si is the covariance matrix of x. For example if a leak detection problem is being formulated, x will consist of component flow rates and the constraint equations will be the material balances. Matrix projections can be used to remove the unmeasured variables [113] such that the constraints in Eq. 3.14 are transformed to a reduced set of process constraints that retain only the measured variables. The covariance matrix of the reduced constraints is

where Bf is the "material balance" coefficient matrix of the reduced constraints with a residual vector (reduced balance residuals)

The optimal value of the objective function F is

which follows a chi-squared (\2) distribution with m degrees of freedom where m is the rank of He [113]. Additional restrictions such as flow rates being positive or zero may be introduced so that (x + a) is not negative. This framework can be combined with principal components analysis for gross error detection and reconciliation [259, 591].

Other data reconciliation and gross error detection paradigms have been proposed for linear processes operating at steady state. A serial strategy for detecting and identifying multiple gross errors eliminates sequentially measurements susceptible to gross errors, recomputes a test statistic, and compares it against a critical value [258, 519]. The use of generalized likelihood ratio (Section 8.3) method for identifying abrupt changes [651] has been proposed to discriminate between gross measurement errors and process faults (for example between malfunctions of flow rate sensors and leaks) [409], The approach has been extended to dynamic linear processes [410]. Techniques based on Kalman filters [563], maximum likelihood functions [516, 517], successively linearized horizons [491], orthogonal collocation [341], neural networks [269], and discretization of differential-algebraic equation systems [12] have been developed. Recent books [29, 518] provide details of many techniques for data reconciliation and gross error detection.

The first critical step in data reconciliation is the detection, identification and elimination of gross errors. Some strategies and methods to carry out this task are presented in Section 3.4.2.

### 3.4.2 Outlier Detection

Outliers or gross errors corrupt process data. Spikes in data that are candidates for outliers can be detected easily by visual inspection. Statistical tools or heuristics can then be used to assess validity of the spikes as outliers, and based on this assessment, data analysis, modeling, and/or monitoring activities may be undertaken. Detection of outliers is critical for having reliable data to make decisions about the operation of a process and to develop empirical (data based) models. Consequently, the literature on outliers is dispersed in statistics, process engineering and systems science as robust estimation, regression, system identification, and data analysis. Since many references from the process engineering literature have been provided in Section 3.4.1, outlier detection methods developed by statisticians are outlined first in this section. This is followed by a discussion of outlier detection in multivariable systems by principal components analysis (PCA).

Various books [42, 225, 599, 523] and survey papers [20, 41, 201, 476, 512] in the statistics literature provide a good account of many techniques used in outlier detection. Outlier detection in time series has received significant attention [96, 281, 538]. Fox [159] distinguished two types of outliers: Type I, the additive outlier (AO), consisting of an error that affects only a single observation, and Type II, the innovational outlier (10), consisting of an error that affects a particular observation and all subsequent observations in the series. Abraham and Chuang [3] considered regression analysis for detection of outliers in time series. Lefrancois [332] developed a tool for identifying over-influential observations in time series, and presented a method for obtaining various measures of influence for the autocorrelation function, as well as thresholds for declaring an observation over-influential.

A popular outlier detection technique in time series is the leave-one-out diagnostic idea for linear regression where one deletes a single observation at a time, and for each deletion computes a Gaussian maximum likelihood estimate (MLE) (Section 8.3) for the missing datum [80, 222, 283]. Because more than one outlier may exist in data, some outliers may be masked by other dominating outliers in their vicinity. A patch of outlying successive measurements is common in time series data, and masking of outliers by other outliers is a problem that must be addressed. One approach for determining patches of outliers is the generalization of the leave-one-out technique to the leave-k-out diagnostics. However, at times the presence of a gross outlier will have sufficient influence such that deletion of aberrant values elsewhere in the data has little effect on the estimate. More subtle types of masking occur when moderate outliers exist close to one another [379]. These types of masking can often be effectively uncovered by an iterative deletion process that removes suspected outlier(s) from the data and recomputes the diagnostics.

Several modeling methods have been proposed to develop empirical models when outliers may exist in data [91, 595]. The strategy used in some of these methods first detects and deletes the outlier(s), then identifies the time series models. A more effective approach is to accommodate the possibility of outliers by suitable modifications of the model and/or method of analysis. For example, mixture models can be used to accommodate certain types of outliers [10]. Another alternative is the use of robust estimators that yield models (regression equations) that represent the data accurately in spite of outliers in data [69, 219, 524]. One robust estimator, the L\ estimator, involves the use of the least absolute values regression estimator rather than the traditional least sum of squares of the residuals (the least squares approach). The magnitudes of the residuals (the differences between the measured values and the values estimated by the model equation) have a strong influence on the model coefficients. Usually an outlier yields a large residual. Because the least squares approach takes the square of the residuals (hence it is called the L2 regression indicating that the residual is squared), the outliers distort the model coefficients more than L\ regression that uses the absolute values of the residuals [523]. An improved group of robust estimators includes the M estimator [216, 391, 245] that substitutes a function of the residual for the square of the residual and the Generalized M estimator [15, 217] that includes a weight function based on the regressor variables as well. An innovative approach, the least trimmed squares (LTS) estimator uses the first h ordered squared residuals in the sum of squares (h < n, where n is the number of data points), thereby excluding the n - h largest squared residuals from the sum and consequently allowing the fit to stay away from the influence of potential outliers [523]. A different robust estimator, the least median squares (LMS) estimator, is based on the medians of the residuals and tolerates better outliers in both dependent and independent (regressor) variables [523].

Subspace modeling techniques such as principal components analysis

(PCA) provide another framework for outlier detection [224, 591, 592] and data reconciliation. PCA is discussed in detail in Section 4.1. One advantage of PCA based methods is their ability to make use of the correlations among process variables, while most univariate techniques are of limited use because they are ignoring variable correlations. A method that integrates PCA and sequential analysis [592] to detect outliers in linear processes operated at steady state is outlined in the following paragraphs. Then, PCA based outlier detection and data reconciliation approach for batch processes is discussed.

PCA can be used to build the model of the process when it is operating properly and the data collected do not have any outliers. In practice, the data sets from good process runs are collected, inspected and cleaned first. Then the PCA model is constructed to provide the reference information. When a new batch is completed, its data are transformed using the same PCs and its scores (see Section 4.1) are compared to those of the reference model. Significant increases in the scores indicate potential outliers. Since the increases in scores may be caused by abnormalities in process operation, the outlier detection activities should be integrated with fault detection activities. The PCA framework can also be used for data reconciliation as illustrated in the example given in this section.

Consider a set of linear combinations of the reduced balance residuals e defined in Eq. 3.16:

where Ae is a diagonal matrix whose elements are the magnitude ordered eigenvalues of He (Eq. 3.15). Matrix Ue contains the orthonormalized eigenvectors of He (detailed discussion of PCA computations are presented in Section 4.1). The elements of vector ye are called PC scores and correspond to individual principal components (PC). The random variable e has a statistical distribution with the mean 0 and covariance matrix He (e ~ (0,He)). Consequently, ye ~ (0,1) where I denotes the identity matrix (a diagonal matrix with Is in the main diagonal), and the correlated variables e are transformed into an uncorrelated set (ye) with unit variances. Often the measured variables are Normally distributed about their mean values. Furthermore, the central limit theorem would be applicable to the PCs. Consequently, ye is assumed to follow Normal distribution (ye ~ Ar(0,I)) and the test statistic for each PC is

which can be tested against tabulated threshold values. When an outlier is detected by noting that one or more ye i are greater than their threshold values, Tong and Crowe [592] proposed the use of contribution plots

(Section 8.1) to identify the cause of the outlier detected. They have also advocated the use of sequential analysis approach [625] to make statistical inferences for testing with fewer observations whether the mean values of the PCs are zero.

Outlier detection in batch processes can be done by extending the PCA based approach by using the multiway PCA (MPCA) framework discussed in Section 4.5.1. The MPCA model or reference models based on other paradigms such as functional data analysis (Section 4.4) representing the reference trajectories can also be used for data reconciliation by substituting "reasonable" estimated values for outliers or missing observations. The example that follows illustrates how the MPCA models can be used for outlier detection and data reconciliation.

Example Consider a data set collected from a fed-batch penicillin fermentation process. Assume that there are a few outliers in some of the variables such as glucose feed rate and dissolved oxygen concentration due to sensor probe failures. This scenario is realized by adding small and large outliers to the values of these variables as shown in Figure 3.4. Locations of the outliers for the two variables are shown in Table 3.8.

In this example, a multiway PCA (MPCA) model with four principal components is developed out of a reference set (60 batches, 14 variables, 2000 samples) for this purpose. A number of multivariate charts are then constructed to unveil the variables that might contain outlying data points and the locations of the outliers in those variables. The first group of charts one might inspect is the SPE, T2 charts and the charts showing variable contributions to these statistics. Contribution plots are discussed in detail in Section 8.1. Both SPE and T2 charts signal the outliers and their locations correctly (Figure 3.5), but they do not give any information about which variable or variables have outliers. At this point of the analysis, contribution (to SPE and T2 values) plots are inspected to find out the variables responsible for inflating SPE and T2. Since outliers will be projected farther from the plane defined by MPCA model, their SPE values are expected to be very high. Consistently, SPE contribution plot indicates two variables

Variable |
Locations of the outliers (sample no.) |

Glucose feed rate (no.3) Dissolved O2 conc. (no.6) |
500, 750, 800, 1000, 1500, 1505, 1510 450, 700, 900, 1400, 1405, 1410 |

800 1000 1200 Sample number

Figure 3.4. Raw data profiles containing outliers.

800 1000 1200 Sample number

### Figure 3.4. Raw data profiles containing outliers.

(variables 3 and 6, glucose feed rate and dissolved oxygen concentration, respectively). T2 contributions represent similar information. Variable 3 is not that obvious from that chart but variable 6 can be clearly distinguished. Now that the variables with outliers and the overall locations of outliers are identified, the locations of outliers in each variable need to be found. This task can be accomplished either by directly inspecting individual variable trajectories (Figure 3.4) or by using multivariate temporal contribution plots (to both SPE and T2) for the identified variables (Figure 3.6). Some of the critical change points on these curves (Figure 3.6) indicate the multivariate nature of the process, and the important events that take place. For instance, the sudden drop in variable 3 at sample 450 is due to corresponding outlier in variable 6 and the dip in variable 6 around sample 300 indicates the switch from batch to fed-batch operation. In this example, all of the outliers are clearly detected by using PCA. To prevent multivariate charts from signaling that the process is out-of-control, these outliers are marked for removal. However, for further analysis and modeling purposes, they should be replaced with estimates. The ability of the PCA technique

## Post a comment