MRO Magazine

Manage failure – Integrate CBM information to optimize decision support

December 21, 2017 | By Jeff Smith

Photo: Getty Images.

The Situation

A bearing fault! With gear noise! Okay, this is bad and it’s on a large planetary gear set. I quickly (well, quickly for someone relatively new to vibration) put together a report showing the issue. This was one of my first big finds early on in my career, back when I still had delusions of grandeur, the CBM guy saves the day! This is an $80,000 gearbox and I was going to make the call so we could shut down the pulp mill and fix it. Wanting to be as prepared as possible, I checked inventory and, yes, we had one in stock. I knew enough to estimate the MTTR (Mean Time to Repair) and have strategies in place to minimize it. All set, I walked to the maintenance manager’s office. He was on the phone so I paced outside until he beckoned me in, and continued with his call. I casually observed the electro-luber that he has set on an upper shelf of his bookcase, as when I had suggested using then he told me he did not believe they worked, with a smirk I observed it was still diligently pumping grease onto his shelf, but that is another story.

He finished his call and held out his hand for the report. I tried to contain myself while he read it. As he put it down I started telling him about the bearing fault, the gearbox being on site, the MTTR, all of which he had just read in the report. Ready to spring into action I asked him how soon could we run out the line to replace the gearbox. “How long will it last?” was his reply.

What do you mean how long will it last? The bearing is already failing! Considering his question I realized what most seasoned “vibe guys” know – you don’t know for sure how long something will last. It would be a guess. As I sat dumbfounded, he simplified the question: “Do you think it will last a week?” My reply was, “Possibly… but if it fails it will cost more in cascading damage to fix it and we will not be able to properly run out the line.” Without further consideration he told me we were going to run for one week and then shut down to repair it. “Take readings daily.”


One week later, the bearing was in late stages of failure but we had made it. The line was shut down and the gearbox replaced. Still resentful of the decision, I pointed out the cascading damage that had occurred to the shaft and bearing, the metal and debris that had pitted numerous gears. Had we shut down when I wanted we would have saved thousands! The maintenance manager gave me the look that teachers give when the student just isn’t getting it. “Jeff, we were on spec and on grade for a type of pulp that is hard to make. If we shut down when you wanted we would have lost over $500,000 and would have had a very hard time getting back on spec and grade. I chose to chance it and finish the order.”

What’s the goal?

In condition-based monitoring the value add is the ability to plan and execute the repair outside your operating campaign if possible or at least minimize the MTTR. Predicting the failure does not make it go away. Yes, you can argue that if one detects it, then one can do steps like alignment that eliminate the requirement for secondary action. But alignment itself is the repair. Another example of this would be an oil condition reported that requires an oil change to prevent cascading damage. The logic I would use is the CBM detected the oil issue so the task to execute is the oil change. Some groups consider every CBM finding saves catastrophic failure of the component and will report the value of the program that way; I find it kind of foolish to report CBM wins that far exceed the maintenance budget… Just saying.

So what is the real deliverable of a CBM program? How does it make my world better if I have a budget and chose to spend part of it detecting failures? My perspective is simple; CBM provides information that can enable an organization to make informed decisions. The more information I have, the better I correlate it, which enables me to drive choices that align with my organizational goals. So what are a few examples of organizational goals?

  • Production goals: What are the required tons, units, Kilowats/Hour, or whatever the purpose of the industrial facility is. In the case of the pulp mill it was on spec on grade tonnes. So Question 1 is: How do I manage to attain my production numbers with the onset of a CBM detected failure?
  • Operational campaign: 24/7 with annual shutdown, batch process, 5/8 with weekend maintenance, seasonal, mine plan, etc. The detectability of a failure mode should be advanced enough to provide sufficient lead time to intervene outside the desired operational campaign. (One site I was at had a four-year, 24/7 operational campaign. I was not surprised to find out they had never made it.) Question 2: How do I manage the failure to align with the operational campaign?
  • Safety goals: Reliability and safety are interrelated. Reactive organizations tend to have more incidences. Question 3: What are the safety consequences if this potential failure becomes catastrophic?
  • Environmental goals. Question 4: What are the environmental consequences if this potential failure becomes catastrophic?
  • Intangible value destruction. Often overlooked, this can sink companies; intangible value is the stakeholders’ perception of the company. This could be shareholders, employees or the public in general. Question 5: Will this impact your organization negatively in an intangible way? An example of this would be a reliability issue with exploding cell phones.

Questions form the framework to determine if you should shut down or manage the failure. The process of managing the failure and making the correct decisions is often based on failure progression. Most things that fail will reach a point where they enter a rapid failure profile (the edge of catastrophic). Let’s look at the progression of a bearing failure, for example. As the loaded balls roll around in their race they actually distort the race, stressing the sub-surface. This causes sub-surface cracks that migrate to the surface. As the balls hit the cracks, a ball-pass frequency is detectable. This is the point the vibration tech knows there is an issue; therefore, establishes the severity (Stage), reports it and continues to monitor.

There are four stages of bearing failure:

Stage one is normal operation within its lubrication cycle; this is managed by ultrasound. (Minor to no wear)

Stage two is the onset of failure, and will be detected as bearing fault frequencies or midrange band energy. (Sub-surface cracking reaching surface)

Stage three shows an increase in fault frequencies with sidebands and harmonics. (Surface cracking with metal loss)

Stage four is end of life, rapid failure mode. Bearing fault frequencies begin to disappear and are replaced by random broadband noise. Heat is generated.

If this failure is on a critical asset and we are trying to survive until a window of opportunity, we need to manage the failure. It is imperative to note that it is very unlikely that bearings will follow a consistent predictable failure pattern; even if you have experienced this bearing failure before, view it as though it may fail at any moment. Let’s look at some ways of managing failures to drive informed decisions.

How do you establish severity?

  • Review the history
  • Take a thermographic image; any heat signature designates stage three.
  • Draw an oil or grease sample and conduct analytical ferrography
  • Evaluate the vibration data to establish the stage of failure

Stage 2: If there is little metal in the sample and little to no heat, then increase the frequency of route-based inspection.

Stage 3: If stage three is evident, very frequent data sets must be collected; one solution that has worked well is to have a remote monitoring vibration cart set up to manage failures. Data historians can also be utilized to monitor temperature increases or online vibration.

Stage 4: A stage-four bearing can fail at any time. Stage parts for the intervention and have contingency plans to deal with cascading damage (Flushing bearing metal out of gearboxes, etc.).

This logic, as applied to a bearing failure can be considered for most things that fail. Consider applying the logic to any measurement or trendable dataset. For example, for a wear plate, stage one is just surface erosion, stage two would be a thinning of material, stage three may be the onset of holes, and at stage four it is worn through. Integrating technologies for wear plates may be in-situ metallography, ultrasound (thickness), or thermographic imaging.

As in most maintenance, the risk management must be balanced against the cost of risk management. Most potential failures can be managed as long as the detectability is aligned with the operational objectives.

This feature appears in the November 2017 issue of Machinery and Equipment MRO.

Jeff Smith is a reliability subject matter expert and the owner of 4TG Industrial. His work spans a cross-section of industries, including oil sands, mining, pulp and paper, packaging, petrochemical, marine, brewing, transportation, synfuels and others. Reach him at


Stories continue below

Print this page