Big Data: Circuit Breaker Failure Rate Analysis



As noted previously, it's important to have a clear objective in performing data mining. In this case the goal is to determine the failure rate for the Fails To Open (FTO) failure mode of a population of 12 KV distribution circuit breakers. The failure rate several is calculated from he population of breakers, the time span the data documents and the number of failures. For a better understanding, the distribution of those failures in time is needed. Different tools were employed in this case. An OLAP application was used to do a majority of the data processing, visualization and calculations. However, because of certain application limitations and the fact that this was more of a one time exercise, some key information was visualized in MS Access.
To expose the realities of equipment asset management data mining this example shows some work done with the circuit breaker work order data (about 6900 records) from a mainframe legacy work order system. The source data does not provide explicit identification of failures or work order type (i.e. PM, CM, etc.) or the priority of the work. However the comments field included enough information to allow manual classification of the records. An investment of a few person weeks was required for the classifications.

The data was extracted from the mainframe database as a "flat file" or structured text file. In addition to entering codes for grouping and sorting the data, effort was needed to convert multiple records for each work order (used in the mainframe dump to accommodate varying length work order completion comments) to single records for each work order. As can be seen in Figure 1, a little less than 50% of the records were "unique" work orders.


Figure 1
The data was processed to separate unique and duplicate records. The records with indications of equipment degradation or failure were segregated and grouped by failure code and eventually by year. These steps and results are reviewed below. They point out steps that should be undertaken immediately by any organization concerned with equipment failures that is using a work management system or CMMS.

The elimination of duplicate records required understanding the context of the records, i.e., the maximum number of records for each work order. Other means, more general in nature (addressing unknown numbers of duplicates) could be used with this and other tools. Whatever the technique, eliminating duplication is one hurdle often encountered with mainframe data dumps.

The raw records indicated 1447 "failure" records" and another query showed that there were 674 unique circuit breakers documented from 1/1990 to 4/2000. By removing the duplicates we find that there were only 539 failures for the same 674 circuit breakers over the same time range, as can be seen in Figure 2.


Figure 2
Eliminating the duplicate records reduces the gross failure rate for these circuit breakers by a factor of 2.7 from 0.201 to 0.078 failures per CB per year. However unless the duplicate records are eliminated (and other data problems solved) before assessing failure rates the conclusions may mislead the decision makers.

Are there other aspects of the data that need to be considered? Do you need to look further? The failure rate is represented by over 3,000 record for 674 circuit breakers over 10+ years. Certainly the results are "statistically" significant. But more can be learned by drilling down further into the data. The first drill down is by failure mode, since the FTO failure mode is of interest. The results are shown in Figure 3.
 

Figure 3
Only 22 of the 539 "failures" were in the FTO failure mode. About one in ten of the almost 3,000 PM records (i.e. MINOR designation) notes either major or minor degradation. This is a finding in it self. If only 10% of the PM tasks find any degradation, should the interval be extended? More detailed information on as found conditions are needed to make that determination. The Fail To Close (FTC) and Fail to Carry Load (FCL) failures are a concern but the FTO failures present the greatest risk. It appears that the FTO failure probability is 0.0032 failures per CB Year. However, the failure rate changes over time. Drilling down into the FTO data represented in Figure 3 above by year shows the interesting changes over the years, presented in Figure 4.


Figure 4
As can be seen in Figure 4, the vast majority of the FTO failures all seemed to occur in 1993 and did not happened again between 1993 and April of 2000. The few failures in 1990 and 1992 may have been leading indicators of a problem. This is the kind of flag to look for in equipment failure data. The occurrence of "high risk" failures, even at low numbers is significant and detecting and reporting such occurrences should be a goal of data mining.
The next question that comes to the data miners mind when viewing Figure 4 is, "What stopped the failures"? Were they all due to one type of circuit breaker and now that they had been replaced the problem was addressed? There were 5 different manufacturers represented in the 22 failures so a common design issue seems unlikely. A look at the PM program, in terms of interval and circuit breakers worked on per year proves interesting. In this case the OLAP tool was used to determine the PM interval for each PM record but could not easily be used to group that same data and produce the average interval per year. So, the data was exported to a spread sheet and then imported to MS Access. Within MS Access a query was used to aggregate the data by year and average the PM intervals as shown in Table 1 below. The count of PM work orders by year was also produced and is shown in Table 2.
 
 Expr 1  Avg of PM Interval
 1990  0.900
 1991  0.960
 1992  1.069
 1993  1.589
 1994  0.957
 1995  1.208
 1996  0.964
 1997  1.241
 1998  1.674
 1999  2.603
 2000  4.203
Table 1  

 Expr 1  Interval Est. Count of PM WOs 
 1990  2.844  237
1991   2.797  241
 1992  2.160  312
 1993  1.332  506
 1994  1.694  398
 1995  1.000  668
 1996  1.221  552
 1997  4.747  142
 1998  19.257  35
 1999  14.978  45
 2000    4
Table 2
                                                                   
It is important to know whether the intended PM interval is being followed or not. In the case of this data the interval was not stated. An algorithm determined the "actual" PM interval for each record. This may be necessary even with data with a stated PM interval since it is not clear that the planned interval is actually being followed. As can be seen in Table 1 and Table 2 a significant change was taking place over the time span of the data. Two factors produce the one year PM interval from 1990 to 1996 for Table 1. However, the PM count in Table 2 indicates that this is not entirely accurate. The algorithm for the PM interval assigns a value of 1 year for records where there is no earlier date available. Another method would be to drop these records out of the analysis. However, an estimate of the PM interval based on the ratio of the number of circuit breakers to the number of PM tasks per year tells at new story.
The 1993 FTO failures were met with a massive PM campaign, working on almost every CB each year over the next three years. The transition evident in 1999 and 2000, (although the 2000 data is incomplete), indicates a shift back to a three or four year PM interval, picking up only those circuit breakers that were now coming due again. The expectation would be that the PM count would grow rapidly back to about 224 per year.

So, with significant manual effort to prepare the data and less than a half a day of analysis a great deal can be found in data not designed for that purpose. One lesson here is that planning for the purposeful use of data, through the implementation of business practices and properly designed data systems will enable the production of detailed understandings with small investments of analysis effort. Without the preparation, the investment needed to find the same answers increases many times.

Next: Budget Development Support.
Back to Abstract.

Back to Services.