Equipment
Reliability Institute
your reliability newsletter
February, 2001 |
|
Larry
George disagreed with some things Kirk Gray and I said in the Fall
2000 issue. He agreed to write an article, and here it is.
Some readers will
find value in "Machinery Metal Fatigue" by Guil Cornejo. His answers
to a question I asked him last year made so much sense that I asked
him to amplify those answers into an article.
Commencing now,
we hope each issue will carry a practical suggestion from one of
our specialists, a solution to some difficulty he has often observed
in his consulting work.
Best wishes,
Wayne
|
*******************************
|
Measure Field Reliability with
Statistics
by Larry George
Wayne Tustin kindly sends me the newsletters.
I had the impertinence to dispute a few statements made in the previous
newsletter by Gray and Tustin. Wayne invited me to expound. His
tolerance is commendable.
Gray and Tustin are right about the need for
HALT and HASS, but reliability statistics are valuable too. Statistics
convert data into actionable information, information that helps
you decide whether to do anything and what to do, to what, when,
and how much. This information can save half of your field service
costs, double profit from service, or have unexpected consequences
that companies don't disclose. Misunderstandings about what reliability
is and which data is necessary to measure it limit the value of
reliability statistics. This article describes reliability prediction
and estimation from data required by generally accepted accounting
principles.
Blanchard says reliability is "the probability
that a system or product will perform in a satisfactory manner for
a given period of time when used under specified operating conditions."
The military standard definition of reliability is, "the probability
that an item will perform a required function without failure under
stated conditions for a stated period of time."
Probability has stood the test of time as a
useful measure of survival randomness, so reliability is P[Life
> age] for ages within the useful product life, whether for hardware,
electronics, or humans. The time variable is age for most products,
whether in calendar hours or operating hours, miles, cycles, etc.
As far as customers are concerned, the only appropriate operating
conditions are field conditions, not in a laboratory. Reliability
is not MTBF.
Reliability engineers and their managers believe
they have to test to measure reliability. Have you ever said, "We
need to test at least n units for at least t hours
to verify P[R > .95] > .9"? You figure out the smallest n
and t you can possibly test (http://www.sre.org/sresoft.htm).
Then your manager says you can test only half as many units for
half the time. Typically, n and t are based on an
incorrect constant-failure-rate assumption, thereby eliminating
any chance of learning actionable information. (A constant failure
rate implies the absence of infant mortality, wearout, and the need
for maintenance).
People believe that it is necessary to track
at least a sample by serial number from birth to death to estimate
field reliability. This data gives ages at failures and survivors'
ages, which are sufficient but not necessary. Most companies have
given up tracking parts by serial number because of errors, data
storage requirements, and failure to use actionable reliability
information. Fortunately, tracking parts by serial number is not
necessary for either field reliability prediction or estimation.
Reliability Prediction?
People make MTBF predictions, argue about them,
and compare lies. "My MTBF is bigger than yours." Most MTBFs are
predictions, seldom verified. They are predictions of averages,
not age-specific reliability. Have you seen predictions in the range
of 500,000 hours for computer hardware? That's 250 years for a computer
operated 2000 hours per year, M-F, 9-4, or more than 50 years for
continuous operation.
To predict age-specific field reliability,
use field reliability of comparable products. Designs may change,
but other factors (process, environment, and customers) that determine
field reliability don't. The field reliability of comparable products
provides a reasonable, relative reliability prediction. Scale the
fielded products' age-specific failure rates to take changes in
MTBF predictions into account to make an age-specific reliability
prediction [George and Langfeldt].
Alternatives to Test and MTBF Prediction
It is a waste of time and credibility to track
annual failure rate (AFR) and argue about wiggles in monthly AFR
charts. AFR, annual returns divided by the installed base, is an
average and provides little actionable information, too late, and
too imprecisely. It is a waste of talent, ability, and initiative
not to use actionable information from available data.
Several clients have asked for age-specific
reliability predictions, because their customers asked. They wanted
to know the probability of being dead on arrival, the probability
of failure in the first month, first three months, six months, year,
etc. Age-specific reliability predictions provide actionable information
because, although designs change, age-specific reliability doesn't
change, much. Designs change, but manufacturing, packaging, shipping,
installation, environment, and customers don't. Until there's field
experience with new products, age-specific reliability predictions
help plan warranty, service, spares production, and burn-in and
assist the designers.
It's not necessary to track products and parts
by serial number to estimate age-specific reliability. Tracking
parts by serial number requires about 1000 times as much data storage
capacity and probably incurs more than 1000 times as many errors,
compared to ships and returns data (table 1). Generally accepted
accounting principles require ships and returns data, which is sufficient
for estimating age-specific reliability. That means that your company
has sufficient data [George]. Ships and returns are population data,
so reliability estimates from them have no sample uncertainty.
Table 1. 1988 Ford V-8 460-cubic-inch
Drivetrain Ships and Returns
|
Month
|
Shipments
|
Monthly returns
|
|
Aug-87
|
213
|
18
|
|
Sep-87
|
6439
|
797
|
|
Oct-87
|
6951
|
1291
|
|
Nov-87
|
5715
|
1511
|
|
Dec-87
|
5390
|
1791
|
|
Jan-88
|
6336
|
2282
|
|
Feb-88
|
6319
|
2628
|
|
Etc.
|
Etc.
|
Etc.
|
Figure 1 shows the field reliability estimated
from the ships and returns data in table 1. It shows two reliability
functions, one for the age at first warranty return and one for
the age between subsequent returns. The probability of drivetrain's
being returned in the first month was more than 15% initially and
18% subsequently. The former indicates that many were defective
practically from delivery. The latter indicates that the problems
didn't get fixed. The 1988 Ford V-8-460-cubic-inch engine was the
last Ford engine with a carburetor, a very unhappy engine.

Figure 1. 1988 Ford V-8-460-cubic-inch
drivetrain field reliability
Age-specific failure rates help failure
analysis
The failure rate function shows what's happening
(see figure 2). Process defects cause infant mortality, evidenced
by an initially decreasing failure rate. Design defects cause prematurely
increasing failure rates. Other phenomena, such as warranty expiration
anticipation, preventive maintenance, and periodic inspections,
also manifest themselves.

Figure 2. Age-specific failure
rates per month and their possible causes
Engineers regard design defects as more significant
than process defects.They assume that their designs will be produced,
packaged, shipped, installed, and operated in a manner that achieves
inherent reliability. Design defects cause premature wearout, which
becomes apparent pretty early in the product life cycle, although
not as early as process defects, which cause infant mortality. Engineers
should be reassured to know that, for most products, retirement
occurs before wearout, so the failure rate function decreases with
age.
Conclusion
Don't give up on statistics, even for reliability
predictions. Population statistics eliminate sample uncertainty
and help you predict, measure, and use age-specific field reliability,
without tracking parts by serial number. Which do you prefer, randomness
with uncertainty or without? Uncertainty means you're gambling without
knowing the odds.
References
- Gray, Kirk and Wayne Tustin, "Electronics
Testing into the 21st Century: Success in Test Is in Capabilities,
Not Specifications," ERI News - Reliability Newsletter, Equipment
Reliability Institute, Nov. 2000.
- George, L. L., "Field Reliability Estimation
Without Life Data," ASA, SPES Newsletter, Dec. 1999, htttp://web.utk.edu/~asaqp/newsletters/1299newsletter.pdf.
- George, L. L. and Eva Langfeldt, "Age-Specific
Reliability Prediction," to appear in ASQ Reliability Review,
2001.
Larry George is an ASQ Certified Reliability
Engineer. He has a Ph. D. in industrial engineering and operations
research, with a minor in statistics. He taught for 11 years, worked
for a national laboratory for 11 years, and has worked in the real
world for more than 20 years. ASQ just elected him as a Fellow.
Contact him at pstlarry@home.com,
925 447 4969, or http://members.home.net/pstlarry.
Eva Langfeldt edited this article and made
it readable. Eva has her own editorial services business, Text Support,
with clients in high-tech, publishing, and marketing. Contact her
at 925 314 9588 or eva@megapathdsl.net.
Larry offers training in field reliability
analysis and applications. It helps you estimate your products'
and service parts' field reliability from your data and use that
information to help solve your problems. Contact Wayne
or Larry for course information. Send data for free samples
of field reliability estimates and their applications.
|
*******************************
|
Machinery metal fatigue, brief
practical notes
by Guil Cornejo
Some people refer to high cycle fatigue (HCF) fracture
as representative of a fatigue induced by many stress cycles. This
could be caused by a self resonance or a near resonance of a component
such as a turbine blade. Either a low stress or high stress condition
goes together with HCF resonance or its counterpart, low cycle fatigue
(LCF).
HCF resonance fracture is usually accompanied by
characteristic "beach marks" at the point of fracture. What is implicit
here is the fact that for a structure (blade, rotor, etc.) to resonate
it must be stressed within the elastic range of the stress-strain
diagram or within Young's modulus of elasticity. This type of failure
is not usually accompanied by heat marks like blueing or metal discoloration.
Fractures sometimes show that parts rubbed together. The shiny "finger
print" shows the effect of rubbing between the two sides of the
initiating crack- lapping. The fracture point then usually reveals
"beach marking" like what we see at the ocean beach as each wave
comes in and then leaves the shore.
Low cycle fatigue means a couple of things. Primarily
LCF is a failure induced by stressing a component just above the
Young's Modulus range or way into the material's plastic range.
This type of failure, if pure plastic, will have a typical characteristic
failure pattern at the fracture point (like breaking a nail by bending
it back and forth, in about 3 or 5 cycles). Or the failure may show
signs of heat discoloration at the fracture point. The metal grain
(austenitic, gamma etc.) is usually revealed under the microscope
and it is unique, recognizable and often clearly undisturbed.
Unfortunately, there is often a combination of both
HCF and LCF at a fracture point. As the stress increases in a component,
as it fractures during HCF, the failure may accelerate from HCF
to LCF. This is where the insight of the engineer and the metallurgist
comes into play, defining the fracture. Sometimes, the failure can
be very complex; after a couple of LCF bends, the failure accelerates
into HCF. Finally, the "high" and the "low" cycle adjectives are
sometimes relative. For a rotor that is crunching rock at very low
speed, a high cycle excitation would be close to LCF frequency range.
In contrast, for a gas turbine rotor running say at 400 CPS or 24000
RPM the distinction between HCF and LCF would be much easier since
the excitation is way above the clear cut LCF range.
Guillermo "Guil" Cornejo is president
of RPM & Predictive Engineering Inc., a California, USA, corporation.
He has 20 years of factory and worldwide field experience solving
turbomachinery/powertrain vibration problems in industrial and marine
applications. If you want to know more about Guil, visit www.equipment-reliability.com
and click Consulting then Specialists. You can e-mail
Guil at Gcornejo@equipment-reliability.com.
|
*******************************
|
Questions our readers
have asked...
You saw Guil Cornejo's answer to my question,
and below is Kirk Gray's answer to a reader's question. Now it's
your turn. What question would you like to ask one of ERI's specialists?
Send it to webmaster@equipment-reliability.com
On this issue Kirk Gray (KG) tells us about
an email he got from a Quality Engineer from Israel. You will see
their initials (KG and SS) as guides at the paragraphs.
(SS) My name is Samuel Sela (SS) and I am
a Quality Engineer - CQE of ASQ - (Mech. Eng. -M.Sc.- in the past)
in Rafael, Israel. I found your address in the Equipment
Reliability Institute site in the Web, and I'll be very grateful
if you'll accept to present your opinion in a few controversial
issues among our people. So, for ESS of electronic boxes, containing
PCB's, for military use, which fundamental specification do you
recommend to apply?
Temperature cycling:
- Temp. range
(KG) The optimum temperature range should be determined from the
last iteration of HALT. That is the HALT that is completed after
the last improvement of the operating capabilities. Most electronics
should be able to be cycled in a temperature range of 100 C. The
larger the range of temperature cycling can increase the screening
strength (that is the ability to precipitate latent defects to
patent defects) in a shorter screening time. In other words, the
higher the range ot temperature cycling the less number of cycles
required to find the same defects, and the shorter HASS cycle.
- Number of cycles
(KG) It all depends on the temp range and rate of change. Initially
a HASS process should be developed based on 4 to 6 thermal cycles.
Since the product in HASS is continuously monitored the "HASS
Chamber lifetime bathtub curve" should flatten out before the
last screen cycle. If very few defects are found in the last thermal
cycles, then the screen is too long and can be shortened. If a
significant number of defects are found in the last thermal cycle
then the screen is too short.
- Rate
(KG) Again, as in thermal cycles, more is better. The evidence
from a Hughes? Rome Air Development Study done in the 1980's showed
that the higher the rate of change, the higher the screening strength.
Again, the higher the rate of change, the faster cycling can occur,
the faster the HASS process and the lower the cost of HASS. In
liquid emersion ESS, electronics are subjected to rates of thermal
change of around 500 C per minute without damage to defect-free
circuit boards. A good target for the thermal change rate is around
40 - 50 degrees C per minute on the product, not air temperature.
- Time in max. temperature
(KG) The benefit in thermal cycling is in inducing fatigue damage
through the expansion and contraction of all the thousands of
differing material interfaces that may have flaws in them. Therefore
when the most or all of product reaches equilibrium at the extremes
of the thermal cycling range, the next phase of the cycle can
start. The typical dwell for any temperature in a HALT/HASS chamber
(which forces the product temperature by high volume air circulation
and overdriving the air temperature) is 10 - 20 minutes. Of course
the biggest factor in the rate of change of the product is its
thermal mass and power dissipation. Power supplies with heavy
discrete analog components would take much longer to reach equilibrium
that a hand held PDA, so again each HASS should be developed based
on the product capabilities not a "standard" process.
- Functional test
(KG) It is very critical to continuously monitor the device under
test because it has been estimated and the industry evidence suggests
that about 30% of latent defects are observable while stress is
applied, but will be functional and not detectable when the stress
is removed. An example is a break in a metal (solder or wire trace)
connection. During the cold contraction the metal break will separate
enough to make an open circuit, yet when the product reaches room
temperature the metal expands again to make contact and close
the circuit. Sometimes a thermal interlock cannot be overridden
for a HASS process and it is acceptable to go beyond the (designed)
operating temperature, with monitoring occurring when it falls
below the designed in shutdown.
- Does the box have to be open?
(KG) It is best in most if the box can be opened to allow for
faster thermal changes. By helping reduce the mass, you can increase
the thermal rate of change. Many of those doing HASS use a dedicated
fixture to hold and operate the circuit cards and only use vibration
on the final assembly to verify there are no loose connectors
or fasteners. Again, each HASS should developed base on characteristics
of the product and the production volumes and throughput.
- PSD & frequency range
(KG) HALT and HASS chambers use multi-axis repetitive shock producing
a pseudo-random, broad frequency range without frequency control
(typical range is from 200 Hz to 10,000 Hz with most of the energy
in the range from 500 - 3000 Hz depending on the vibration table
manufacturer). There are methods that can be used to dampen out
certain frequencies, but it is better to improve the design to
eliminate sensitivities or a weakness for a certain frequency
if possible before changing the vibration spectrum.. The level
of vibration intensity during HASS should be based on the Upper
Destruct Limit (UDL) found in HALT. A good rule of thumb is to
use half the UDL level found in HALT. It is very important to
verify that the level will not use a significant portion of fatigue
life through a proof-of-screen (POS) before implementing a HASS
process.
- Duration
(KG) See the explanation of number of cycles. The criteria should
be the same after some data is gathered on the first 100 - 1000
units. If defects are being found towards the end of the last
cycle, the screen maybe too short, if no defects are found in
the last cycle(s), the screen might be too long. Most screens
are between 30 minutes to 2 hours, but then again it is specific
to the product. Again whatever HASS process is developed must
use a POS to insure you are not taking out significant fatigue
damage
(SS) Honestly, we use a few versions, originated
from various references and it's difficult to get a common agreement
for a standard specification.
(KG) The paradigm shift from traditional reliability
test processes and HALT and HASS is a significant one that makes
understanding and acceptance of these process very difficult to
those who have been following a "cookbook" of burn-in testing. The
major shift is from basing testing on maximum operational component
and material specifications from the supplier, or simulation of
worst case end-use environmental conditions. HALT and HASS approach
the problem with the following fundamental shift in perspective:
1) The best screen for the electronic assembly
is one based on that the assemblies operating and destruct limits,
2) That electronics has more than enough life
to exceed its technological obsolescence and that rapidly using
a small portion of fatigue life is the best method for removing
the front end of the life-cycle bathtub curve or what is sometimes
called "infant mortality". Infant mortality is simply a result of
latent defects either in the design or the manufacturing process.
3) That the only "standard" in HALT and HASS
is using step-stress to find the weak links and removing or improving
them, and then developing a HASS based on destruct and operation
limits. This requires a much better understanding of the product
and the Physics-of-failure of latent defects than has been used
in the past in electronics. The understanding and acceptance of
the HALT and HASS approach still suffers from its success. Very
few companies want to let their competitors know how they are able
to reduce warranty returns 90% and reduce the time in test by 95%
(Advanced Energy Industries, Inc. has documented this gain on one
product type). The best manufacturing and test methods usually are
considered to be a competitive advantage and are not the first to
be published after an electronics company has discovered them.
Kirk Gray has over 21 years in
the electronics manufacturing industry and the last 11 years in
the application of HALT and HASS processes. Kirk is Vice Chairman
of the Denver Chapter of the IEEE Reliability Society , Chairman
of the IEEE/CPMT Technical Committee 7 on Reliability, and Registration
Chairman for the annual IEEE/CPMT Workshops on Accelerated Stress
Testing (AST) held in the fall each year. If you would like to contact
him, please send an email to gray@equipment-reliability.com.
|
|
|
Reliability ain't free
|
|
Would you rather invest a little up front? Or pay a great deal later?
Invest a little extra now, to give your designers
and test people the tools to do it right? Or pay dearly, later on,
in delays, in performance penalties, in excessive warranty and other
expenses?
"It" refers to the creation of reliable products.
Our products (home, office, factory, telecom,
satellite, military, other) are increasingly complex. As a society,
we are increasingly vulnerable when those products do not work reliably.
A computer failure forces airport closures. Another closes grocery
checkout lines. Another idles or mislocates an expensive satellite.
Another stops my car in traffic.
Reliability requires a small training investment
up-front. Your designers and your test people know what training
they need. Approve their training requests. Invest in them.
|
| |
|
Participate at ERI News
|
You are invited to send news of reliability-oriented
events to supplement ERI's newsletter. Please send an email to the
webmaster. |
| |
|
Vibration and Shock courses
coming up
|
|
Wayne Tustin will teach the following short courses in vibration
and shock measurement, analysis, calibration, testing, HALT, ESS
and HASS:
ERI
classes
Huntsville,
Alabama, February 20-22, 2001
Hillsboro
(Portland), Oregon, March 20-22, 2001
Pico
Rivera (LA), CA,
May 16-18, 2001
Farmingdale
(LI) , NY,
June 6-8, 2001
In addition,
Wayne will present a super-concentrated 1-day version at Grand Rapids,
MI,
March 27, 2001.
Details are available from Vibration
Research, phone 616-669-3028 or send an email to
john@
vibrationresearch.com
*******
Society
of Automotive Engineers
Troy,
MI
April 18-20, 2001
*******
Applied
Technology Institute
College
Park, MD
April 9-12, 2001 (get more information from Wayne Tustin)
|
| |
|
Announcements
|
|
European EMC Instructor
ERI is seeking an authority
on European EMC directives to teach USA EMC practitioners. Please
e-mail tustin@equipment-reliability.com
|
| |
|
Contact information
|
|
ERI - Equipment Reliability Institute
1520 Santa Rosa Av.
Santa Barbara - CA - 93109
Tel/Fax: (805) 564-1260
Wayne Tustin tustin@equipment-reliability.com
Webmaster webmaster@equipment-reliability.com
Website http://www.equipment-reliability.com
|
| |
|
Free Newsletter
|
|
ERI News is sent in both html and plain text formats. If you had
any problems reading this newsletter, please let us know. Send an
email to the webmaster,
reporting your difficulties.
If you would like to subscribe
to ERI News, go to our website,
fill in the form "Free Newsletter" and hit the Submit
button.
Click here
to subscribe!
If you do not want to receive ERI's quarterly
newsletter, please send a reply to this message with "remove"
as subject.
|
|