(*This paper was published in the IEEE Transactions on Reliability, 2001)
Patrick D.T. O'Connor
Abstract
The paper reviews the nature
of reliability in relation to the causes of failures of engineering products,
explains how most of the methods that have been developed and applied by reliability
and quality specialists have been misleading and ineffective, and makes suggestions
for the way ahead.
Keywords: failures, reliability,
quality, variation, ISO9000, engineering management
INTRODUCTION
In the final paragraph of my book "Practical
Reliability Engineering", originally published in 1981, (Reference 1) I wrote:
"It is notable that no undemocratic
state has been able to make any significant contribution to the reliability
and quality revolution. Quality represents the essence of freedom - freedom
to make decisions at work and as a customer. Centralised bureaucratic state
systems do not allow this freedom. The new techniques for controlling the reliability
of design and quality of production enable us to produce complex but reliable
products, but the techniques are dependent for their success on the motivation
that comes only with personal freedom".
Not long afterwards the democracies
eventually won the Cold War, proving once and for all that attempts to regulate
human behaviour by dogma and coercion fail by comparison with the liberation
of human talent and motivation that is inherent in free market economies. Peter
Drucker, in 1955, (Reference 2) had explained the poverty of "scientific management",
the doctrine taught by Frederick Taylor, which had provided the basis for demarcation
at work, the alienation of workers from managers, and the growth of trades unionism.
Scientific management relegated humans at work to the status of automatons (albeit
not necessarily badly treated in a physical sense) and attributed the capability
to plan and decide only to "managers". Marxism germinated and grew on the ground
prepared by Taylorism. It has always puzzled me that Karl Marx is included in
writings on Western philosophy. Marxism was a socio-political argument, that
could be imposed and maintained only by force, not a philosophy. Now we all
appreciate how counterproductive and dangerous were these insidious ideas: they
appealed to those who sought power and who did not realise or acknowledge the
inventive and productive spirit that exists in all human intellect. Drucker
taught that all workers are managers, and that the role of higher management
is to develop and mobilise these talents for the benefit of the firm. Druckerís
"new management" liberated workers at all levels. Later Deming (Reference
3) based his teaching on quality and productivity on Drucker's ideas.
So now we all know that regulation
and control of people at work, whether imposed by the state or by management,
is discredited as a failed philosophy. We know this, but do we live accordingly?
Or do regulation and scientific management still influence our thinking? Taylorís
legacy is still manifest in a number of ways that are counterproductive in relation
to the performance of people and businesses (Reference 4). This is particularly
so in the field of quality and reliability, as I will explain later in this
article.
WHAT IS RELIABILITY?
The commonsense perception of
reliability is the absence of failures. We know that failures have many different
causes and effects, and there are also different perceptions of what kinds of
events might be classified as failures. The burning O-ring seals on the Space
Shuttle booster rockets were not classed as failures, until the ill-fated launch
of Challenger. We also know that all failures, in principle and almost always
in practice, can be prevented.
There are three kinds of engineering
product, from the perspective of failure prevention:
- Intrinsically reliable
components, which are those that have high margins between their strength
and the stresses that could cause failure, and which do not wear out within
their practicable lifetimes. Such items include nearly all electronic components
(if properly applied), nearly all mechanical non-moving components, and
all correct software.
-
Intrinsically unreliable
components, which are those with low design margins or which wear out,
such as badly applied components, light bulbs, turbine blades, parts that
move in contact with others, like gears, bearings and power drive belts,
etc.
-
Systems which include
many components and interfaces, like cars, dishwashers, aircraft, etc.,
so that there are many possibilities for failures to occur, particularly
across interfaces (e.g. inadequate electrical overstress protection, vibration
nodes at weak points, electromagnetic interference, software that contains
errors, etc.).
It is the task of design engineers
to ensure that all components are correctly applied, that margins are adequate
(particularly in relation to the possible extreme values of strength and stress,
which are often variable), that wearout failure modes are prevented during the
expected life (by safe life design, maintenance, etc.), and that system interfaces
cannot lead to failure (due to interactions, tolerance mismatches, etc.). Because
achieving all this on any modern engineering product is a task that challenges
the capabilities of the very best engineering teams, it is almost certain that
aspects of the initial design will fall short of the "intrinsically reliable"
criterion. Therefore we must submit the design to analyses and tests in order
to show not only that it works, but also to show up the features that might
lead to failures. When we find out what these are we must redesign and re-test,
until the final design is considered to meet the criterion.
Then the product has to be manufactured.
In principle, every one should be identical and correctly made. Of course this
is not achievable, because of the inherent variability of all manufacturing
processes, whether performed by humans or by machines. It is the task of the
manufacturing people to understand and control variation, and to implement inspections
and tests that will identify non-conforming product.
For many engineering products
the quality of operation and maintenance also influence reliability.
The essential points that arise
from this brief and obvious discussion of failures are that:
- Failures are caused primarily by people
(designers, suppliers, assemblers, users, maintainers). Therefore the achievement
of reliability is essentially a management task, to ensure that the right
people, skills, teams and other resources are applied to prevent the creation
of failures.
- Reliability (and quality) are not separate
specialist functions that can effectively ensure the prevention of failures.
They are the results of effective working by all involved.
- There is no fundamental limit to the extent
to which failures can be prevented. We can design and build for ever-increasing
reliability.
Deming explained how, in the
context of manufacturing quality, there is no point at which further improvement
leads to higher costs. This is, of course, even more powerfully true when considered
over the whole product life cycle, so that efforts to ensure that designs are
intrinsically reliable, by good design and effective development testing, generate
even higher payoffs than improvements in production quality. The "kaizen"
(continuous improvement) principle is even more effective when applied to up-front
engineering.
We observe the results of this
practical philosophy every day. Modern complex products such as microprocessors,
car and aircraft engines, spacecraft, electronic systems, etc. are extremely
and increasingly reliable and economic. Their development and production are
based upon recognition and application of the essential points listed above.
Reliability and Time
Reliability is sometimes referred
to as "quality in the time dimension", since it is determined by the
failures that do or do not occur during the life of the product. The most important
task in reliability engineering is to look forward in order to make design,
development and manufacture reliable: it is necessary to anticipate and prevent
future failures. However, the problem with failures that might happen in the
future is that we usually do not know what they might be, and they are the results
of oversights and mistakes, which we expect (or hope) will not be made. It is
usually clear what we must do in order to create the future in terms of parameters
like performance and price of the next product, but thinking ahead into the
unknowns and uncertainties of future failures is difficult. It can also be perceived
as a negative activity: project engineering is an activity based on optimism,
but failure prevention work must be based on pessimism and scepticism. (The
"parts count" approach to predicting reliability implies that we do
know what failures will occur, and how often. This myth will be dealt with later).
Therefore reliability engineering is often perceived as unproductive and
esoteric.
Of course it is also important
to look at the past and the present, in order to reduce the problems of today
and to provide lessons for the future. Failure analysis is part of that. However,
it must not be allowed to dominate the reliability effort, otherwise we will
go on reaping the same fields of weeds on new products. In fact things could
get even worse, as we stretch technology, compress timescales, and compete for
markets.
This point that reliability engineering
is concerned with the uncertain future, whilst most other engineering management
is concerned with the present or with the more predictable future is of great
philosophical and practical importance. It is not easy for managers to think
long-term about reliability, especially when they are not engineers, and when
their motivations are geared to short-term objectives. This is why reliability
engineering is nearly always inadequately and inappropriately managed and resourced.
The management dimension is crucial, and without it reliability engineering
can degenerate into ineffective design analysis followed later by panic failure
analysis, with minimal impact on future business. Training in reliability engineering,
when given just to staff, is not very effective. Reliability philosophy and
methods should always be taught first to top management, and top management
must drive the reliability effort.
Reliability and Variation
All engineering parameters (strengths,
electrical parameters, dimensions, etc.) are variable. So are all environmental
and other operating conditions (temperatures, vibration-induced stresses, electrical
load, etc.). These variations can affect reliability whenever a single parameter
or condition is exceeded. Failures can also be caused by the interactions of
two or more variables, such as stress and strength, or of electronic component
parameters that allow a circuit to become unstable above a certain temperature.
Variation in quality and reliability
engineering is usually more complex and difficult to deal with than most "natural"
variation, because it seldom follows the conventionally-taught mathematical
form of the s-normal distribution. The s-normal pdf has values between + ¥
and -¥ . Of course a machined component dimension cannot vary like this.
The machine cannot add material to the component, so the dimension of the stock
(which of course will vary, but not by much) will set an upper limit. The nature
of the machining process, using gauges or other practical limiting features,
will set a lower limit. Therefore the variation of the machined dimension will
be curtailed. Only the central part might be approximately s-normal.
In fact all variables, whether naturally-occurring or resulting from engineering
or other processes, are curtailed in some way, so the s-normal distribution,
while being mathematically convenient, is actually misleading when used to make
inferences well beyond the range of actual measurements, such as the probability
of meeting an adult who is one foot tall.
There might be other kinds of selection process.
For example, when electronic components such as resistors, microprocessors,
etc. are manufactured, they are all tested at the end of the production process
and are then categorised and sold according to the measured values. Typically,
resistors that fall within + / - 2% of the nominal resistance value are classified
as precision resistors, and those that fall outside these limits, but within
+ / -10% become non-precision resistors, and are sold at a lower price. Those
that fall outside + / -10% are scrapped. Microprocessors are sold as, say, 166MHz,
200MHz, 400MHz, etc. devices, depending on the maximum speed at which they function
correctly on test, having all been produced on the same process. The different
maximum operating speeds are the result of the variations inherent in the process
of manufacturing millions of transistors and capacitors and their interconnections,
on each chip on each wafer. The technology sets the upper limit for the design
and the process, and the selection criteria the lower limits. Of course, the
process will also produce a proportion that will not meet other aspects of the
specification, or that will not work at all.
The variation might be unsymmetrical, or skewed.
There are mathematical pdfís that represent such distributions, such as the
lognormal and the Weibull distributions. However, it is still important to remember
that these mathematical models will still represent only approximations to the
true variations, and the further into the tails that we apply them the greater
will be the scope for uncertainty and error.
The variation might be multi-modal rather than
unimodal as represented by distribution functions like the s-normal, lognormal
and Weibull functions. For example, a process might be centred on one value,
then an adjustment moves this nominal value. Backlash or hysteresis can also
generate bimodal outputs. A component might be subjected to a pattern of stress
cycles that varies over a range in typical applications, and to a further stress
under particular conditions, for example resonance, lightning strike, etc.
The parts of the distributions of most concern
to quality and reliability engineers are the extreme values in the "tails".
We are concerned by high stresses, high and low temperatures, slow processors,
weak components, etc. However, this is where the data is always less frequent
and more uncertain, and where conventional statistical methods are most misleading.
People like life insurance actuaries, clothes manufacturers and pure scientists
are interested in averages and standard deviations, as represented by the behaviour
of the bulk of the data. Since most of the sample data, in any situation, will
represent this behaviour, they can make credible assertions about population
parameters. However, the further we try to extend the assertions into the tails,
the less credible they become, particularly when the assertions are taken beyond
any of the data. Engineers often have only small samples to measure or test,
so that the data available on extreme values is very limited or non-existent.
Interaction effects can be difficult to predict,
to detect and to understand. Interactions involve the tails of distributions,
so the uncertainties of the effects are further increased.
Many variables can change over time. Mechanical
strength can vary over time as a result of fatigue, wear or corrosion, dielectric
strength can change over time and applied stress, etc. The relevant measure
of "time" in any application might be hours, load cycles, distance,
etc., or combinations of these. When variation changes over time the uncertainty
of the distribution tails increases disproportionately. Parameter distributions
can also vary batch to batch, supplier to supplier, etc.
Variation of engineering parameters is, to a
large extent, the result of human performance. Factors such as measurements,
calibrations, accept/reject criteria, control of processes, etc. are subject
to human capabilities, judgements, and errors. People do not behave s-normally.
These are the realities of variation
that matter in engineering, and they transcend the kind of basic statistical
theory that is generally taught and applied. Most engineering teaching covers
no more than conventional statistics, and engineers therefore tend to be uncertain
about how to deal with the realities of variation and sceptical about the application
of statistical methods. The use of conventional mathematical statistics to attempt
to understand the nature, causes and effects of variation in engineering can
be misleading.
Despite all of these reasons
why conventional statistical methods can be misleading if used to describe and
deal with variation in engineering, they are widely taught and used, and their
limitations are hardly considered. Examples are:
- Most textbooks and teaching on SPC emphasise
the use of the s-normal distribution as the basis for charting and decision-making.
They emphasise the mathematical aspects, such as probabilities of producing
parts outside arbitrary 2s or 3s limits, and pay little attention
to the practical aspects discussed above.
- Many contributions to the literature on statistical
process control contain "exact" calculations of values such as proportions
outside limits, based upon the unrealistic assumption of s-normality for the
processes.
- Methods of "probabilistic design"
are taught and applied, that involve precise determinations of failure probabilities
for items of variable strength subjected to variable stress. These assume
that the relevant distributions are known far into the tails, and they ignore
the practical limitations discussed above.
- Typical design rules for mechanical components
in critical stress application conditions, such as aircraft and civil engineering
structural components, require that there must be a specified factor of safety
between the maximum expected stress and the lower 3s value of the expected
strength. This approach is really quite arbitrary, and oversimplifies the
true nature of variations such as strength and loads, as described above.
Why, for example, select 3s ? If the strength of the component were truly
normally distributed, about 0.1% of components would be weaker than the 3s
value. If few components are made and used, the probability of one failing
would be very low. However, if many are made and used, the probability of
a failure among the larger population would increase proportionately. If the
component is used in a very critical application, such as an aircraft engine
suspension bolt, this probability might be considered too high to be tolerable.
- The so-called "six sigma"
approach to achieving high quality is based on the idea that, if any process
is controlled in such a way that only operations that exceed plus or minus
6s of the underlying distribution will be unacceptable, then only about
one per million operations will fail. The exact quantity is based on arbitrary
and generally unrealistic assumptions about the distribution functions, as
described above. ("Six sigma" entails other features, such as the
use of a wide range of statistical and other methods to identify and reduce
variations of all kinds, and the training and deployment of specialists called
"six sigma black belts". It is not altogether a bad approach, but
it is not the best, it is based to a large extent on "scientific"
management thinking, and it is heavily hyped by consultants).
Reliability and Quality of Production
The management and achievement
of reliability cannot sensibly be divorced from production quality. It is common
experience that a large proportion of the failures that we experience are caused
by inadequate manufacture. For example, a missile system had a well-documented
reliability in military use of 90%. 90% was also the "predicted" reliability,
using the approved "models" and "data", so no one complained.
When a new production operation was started, the reliability instantly rose
to over 95%. As Deming would have pointed out, 10%, then 5%, failed because
they were built differently to those that worked. The failures almost certainly
cost more to build than the successes. By improving build quality reliability
was increased and costs reduced. There was no fundamental reason why quality
and reliability could not have been improved even further. What was the use
of the reliability prediction?
Total quality management (TQM)
is the philosophy of design for production and control of production operations,
based upon the ideas taught by leaders such as Shewhart (Reference 5), Deming,
Ishikawa, Juran, Hutchins, Imai, and Crosby, and initially applied in Japan
in the late 1950's. In this approach, every person in the business becomes committed
to a never-ending drive to improve quality and productivity. The drive must
be led by top management, and it must be vigorously supported by intensive training,
the appropriate application of statistical methods, and motivation for all to
contribute. The total quality concept links quality to productivity. It has
been the prime mover of the Japanese industrial revolution, and it is fundamental
to the survival of any modern manufacturing business competing in world markets.
TQM is based firmly on the "new management" of Peter Drucker.
We have seen how TQM has generated
enormous gains in reliability of complex products like cars, machines and electronic
devices and systems. When production quality is managed effectively complexity
is not the enemy of reliability as it used to be perceived.
WRONG THINKING ABOUT RELIABILITY
Reliability Prediction
It follows from the discussion
of what generates reliability that it cannot be predicted, as though it were
a "parameter" of a design, in ways that are helpful or meaningful.
To state, for example, that a design has an MTBF of X hours ignores the causes,
consequences and costs of failures, and what can be done to reduce them. By
contrast, a statement that products built to the design will weigh Y kilograms
is fully meaningful and credible. The trap of attempting to quantify reliability
was created when Kelvin wrote "when you can measure what you are speaking
about and express it in numbers, you know something about it. When you cannot
measure it, when you cannot express it in numbers, your knowledge is of a meagre
and unsatisfactory kind". Kelvin was right, but only because he was speaking
as a scientist. Engineers are applied scientists. However, Kelvin's logic does
not apply to quality and reliability, which are the results of human behaviour
and perceptions. The uncertainties inherent in human behaviour and perceptions
overwhelm any mathematical "models" of reliability.
We can predict the future only
if we know that the underlying conditions that created the past and present
conditions will be unchanged, and that we fully understand the relevant cause-and-effect
relationships. This is the case in pure and applied science, such as volts drop
across a resistor. However, neither of these criteria hold for reliability.
There are no forces of nature that constrain designers and others to repeat
the mistakes of the past, or prevent them from making new ones. New products
entail new technologies. The methods and "models" that have been developed and
used for "predicting" reliability, such as US MIL-HDBK-217, Bellcore, etc.,
are fraudulent and highly misleading. What was the value of the reliability
prediction of the missile mentioned above? Most industry sectors have either
never used or have stopped using such methods. Despite this, some organisations
and reliability "specialists" continue to apply them, and when the
US DOD decided to stop relying on Military Standards for nearly all procurement,
they decided to retain MIL-HDBK-217 "for guidance, until a suitable commercial
equivalent is available".
The only logically correct way
to predict the likely reliability of a new product is to base the prediction
on the management objectives and commitment, in relation to risks and uncertainties
(Reference 6).
Further examples of the futility
and error of inappropriate quantification of reliability and quality are:
- The creation of "models" for the reliability
of software, expressed as the probability of failure over time (time has no
meaning in the context of software operation) or of "error count" (errors
are created by people, and different errors can have widely different consequences.
The Ariane 5 launch control software contained one error, but that was enough
to destroy the entire vehicle and its payload).
- Extremely complex Markov models for the reliability
and availability of systems and networks, when simple empirical formulae,
or even qualitative statements, would usually be more helpful, and could actually
be understood by managers and engineers.
- An overwhelming proportion of contributions
to reliability symposia and literature consists of esoteric papers that provide
"exact" mathematical formulations that have little or no practical
value.
Reliability Testing
All engineering products must
be tested during development to ensure that the design is correct in relation
to performance, reliability, safety and other requirements. Then production
items must be tested to ensure that only good ones are shipped. The logical
and only effective approach to development testing for reliability (including
durability and safety) is to generate failures as quickly and economically as
practicable, so that product and process design weaknesses are discovered and
corrected. This in turn implies that the stresses should be as high as can be
applied, within the limits of the technology. For example, there is no point
in testing an electronics assembly at temperatures exceeding the solder melting
point. The same logic applies to production testing, with the proviso that the
stresses applied must not damage good items, but only cause weak ones to fail.
This philosophy of test has been
applied, for example in structural and fatigue testing and in environmental
stress screening (ESS) of electronic assemblies. However, it is only recently
that the logic has been fully applied, in the methods of highly accelerated
life testing (HALT) and highly accelerated stress screening (HASS) developed
by Hobbs (Reference 7).
It is an intriguing fact that
the subject of testing is largely untaught on engineering degree courses, and
that there are no books that cover the subject in terms of philosophy, physics,
technologies, methods, economics and management. This probably explains why
so much testing in industry, particularly in relation to reliability, is based
upon inappropriate thinking and blind adherence to standards and traditions.
Recently I visited a company in the advanced communications sector, which
was conducting long-term tests of multiple samples of expensive new production
in a large environmental chamber. They explained that they were doing it "to
measure the reliability". I asked them if they had actual in-service reliability
data, and they showed it to me: they already knew the reliability being achieved.
I asked them if they were finding any failures on test that were different to
those in service. They said no. The test was not a requirement of any of their
customers. I explained that they were performing a very expensive test, and
delaying shipments, to obtain zero information or improvement. Months later
they were still doing the test, because "it was in their procedures". I have
come across numerous other examples of sub-optimal testing, in a wide range
of industries.
Methods for "demonstrating" reliability,
such as probability ratio sequential testing (PRST), the basis of US MIL-STD-781,
are misleading because they imply that all failures have the same "value",
that causes of detected failures will not be removed, that no new failure causes
will arise, and that the pattern of failures over time is constant. In nearly
every case these implications are false. Products are tested to "measure"
reliability, when the proper objective of development testing should be to find
opportunities for improving reliability, by forcing failures using accelerated
stresses.
My forthcoming book (Reference
8) is intended to fill the need for a multidisciplinary book on testing in engineering.
Reliability Teaching and Literature
Despite these facts as described,
nearly all reliability training at universities is provided by departments of
mathematics or statistics. The reliability literature is overwhelmingly mathematical,
in journals and at conferences.
International Standards for Reliability
ISO/IEC60300 is the international
standard for "dependability" a term that is supposed to include reliability,
maintainability, availability and safety ("RAMS"). Every aspect of
this family of documents reflects the kind of over-emphasis on inappropriate
quantitative methods described above. The people who make up the drafting committee
(IEC TC 56) and its various working groups seem to be unaware of the criticisms
of these ideas, or of the fact that the companies which lead the world in reliability
do not apply them.
ISO/IEC61508 is a recently released
international standard for assurance of safety of systems that include electronics
and software. This standard also demands the application of a wide range of
inappropriate and controversial methods, including requirements for "independent"
analyses of designs. Few companies involved in the creation of safety-related
hardware and software seem to be aware of the new requirements, which have been
written by "experts" who seem to have been divorced from the practical
engineering and management realities.
ISO9000 AND MANAGEMENT OF
QUALITY
The international standard for
quality systems, IS09000, has been developed to provide a framework for assessing
the management system which an organisation operates in relation to the quality
of the goods or services provided. The concept was developed from the US Military
Standard for quality, MIL-Q-9858, which was introduced in the 1950's as a means
of assuring the quality of products built for the US military services. In the
ISO9000 approach, suppliersí quality management systems (organisation, procedures,
etc.) are audited by independent assessors, who assess compliance with the standard,
and issue certificates of registration. Today many organisations and companies
rely on ISO9000 registration to provide assurance of the quality of products
and services they buy and to indicate quality of their products and services.
The major difference between
ISO9000 and its defence-related predecessors is not in its content, but in the
way that it is applied. The suppliers of defence equipment were assessed against
the standards by their customers. By contrast, the ISO9000 approach relies on
"third party" assessment. Certain organisations, such as the US Underwriterís
Laboratories (UL), the British Standards Institution (BSI), Lloyds Register,
and several others, are "accredited" by the appropriate national accreditation
services, which entitles them to assess companies and other organisations. The
justification given for third party assessment is that it removes the need for
every customer to perform his own assessment of all of his suppliers. However,
the total quality philosophy demands close partnership between supplier and
purchaser. A matter as important as quality cannot safely be left to be assessed
spasmodically by third parties, who are unlikely to have the appropriate specialist
knowledge, and who cannot be members of the joint supplier-purchaser team.
The other main difference is
that IS09000 is applied to every kind of product and service, and by every kind
of purchasing organisation. Today, schools and colleges, consultancy practices,
local government departments, and window cleaners, in addition to large companies
in every industrial sector, are being forced by their customers to become registered
or are deciding that registration is necessary for future business success.
Some major industry sectors, notably the "big 3" US automotive companies,
and some US telecommunications and aerospace companies have developed industry-specific
variants (QS9000, TC9000, AS9000). It will be interesting to see how QS9000
influences the competitive position of the American automakers. Their Japanese
competitors have not followed this approach,
ISO9000 does not specifically
address the quality of products and services. It describes, in very general
and rather vague terms, the "system" that should be in place to assure
quality. In principle, there is nothing in the standard to prevent an organisation
from producing poor quality goods or services, so long as written procedures
exist and are followed. Obviously an organisation with an effective quality
system would normally be more likely to take corrective action and to improve
processes and service, than would one which is disorganised. However, the fact
of registration cannot be taken as assurance of quality. It is often stated
that registered organisations can, and sometimes do, produce "well-documented
rubbish". An alarming number of purchasing and quality managers, in industry
and in the public sector, seem to be unaware of this fundamental limitation
of the standard.
The effort and expense that must
be expended to obtain and maintain registration tend to engender the attitude
that optimal standards of quality have been achieved. The publicity that typically
goes with initial certification of a business supports this belief. The objectives
of the organisation, and particularly of the staff directly involved in obtaining
and maintaining registration, are directed at the maintenance of procedures
and at audits to ensure that staff work to them. It becomes more important to
work to procedures than to develop better ways of working.
Since its inception, ISO9000
has generated considerable controversy. Some companies and individuals are questioning
the value of the exercise, as they do not see how the expensive process of preparing
documentation and undergoing registration improves the quality of their products
and services, and they also query the benefits in relation to the high costs
of compliance and questionable effectiveness. The evidence is, however, variable.
Some organisations have generated real improvements as a result of registration,
and some consultants and registration bodies do provide good service in quality
improvement.
The leading teachers of quality
management all argue against the "systems" approach to quality, and the world's
leading companies do not rely on it. So why is the approach so widely used?
The answer is partly cultural and partly coercion.
The cultural pressure derives
from the tendency to believe that people perform better when told what to do,
rather than when they are given freedom and the necessary skills and motivation
to determine the best ways to perform their work. This belief stems from the
concept of scientific management, as described earlier.
The coercion to apply the standard
comes from several directions. For example, the UK Treasury guidelines to public
purchasing bodies states that they should "consider carefully registered suppliers
in preference to non- registered ones". In practice, many agencies simply exclude
non- registered suppliers, or demand that bidders for contracts must be registered.
All contractors and their subcontractors supplying the UK Ministry of Defence
must be registered, since the MoD decided to drop its own assessments in favour
of the third party approach, and the US Defense Department has recently decided
to apply ISO9000 in place of MIL-STD-Q9858. Several large companies, as well
as public utilities, demand that their suppliers are registered. The European
Community CE Mark regulations encourage ISO9000 registration.
Other malign effects of ISO9000
include the development and growth of a substantial industry of agencies, registration
bodies and consultants, parasitic on productive industry. In the UK the annual
direct costs of registration exceed $150 million and are growing rapidly. The
"quality" literature, as represented by the journals of the major professional
societies for the discipline, has become overwhelmingly devoted to ISO9000 and
similar standards. Articles, training courses, etc. on traditional quality control
and improvement activities such as measurement, SPC, etc. have almost disappeared.
There is more advertising for ISO9000 services and training in these journals
than for all other services and products combined.
Defenders of ISO9000 say that
the total quality approach is too severe for most organisations, and that ISO9000
can provide a "foundation" for a subsequent total quality effort. However, the
foremost teachers of modern quality management all argue against this view.
(It is notable that none of these serve on the national or international committees
that prepare and "update" the standard). They point out that any organisation
can adopt the total quality philosophy, and that it will lead to far greater
benefits than will registration to the standard, and at much lower costs. The
ISO9000 approach seeks to "standardise" methods which directly contradict
the essential lessons of the modern quality and productivity revolution, as
well as those of the new management.
It is notable that ISO9000 is
very little used in Japan, and then mainly by companies which perceive that
it will provide advantages in Western markets, not because they believe that
it will lead to improvements in quality. Companies that embrace TQM set standards
for product and service quality, internally and from their suppliers, far in
excess of the requirements of ISO9000. These are aimed at the actual quality
achievements of the products and services, and at continuous improvement in
these levels. Much less emphasis is placed on the "system".
The recent changes to ISO9000
("ISO9000/2000") do not deal with these fundamental criticisms. They
will lead to higher costs and greater controversy. Quality and reliability of
products and services will not be assured or improved.
WHERE ARE THE RELIABILITY HEROES?
Names like Shewhart, Juran, Deming,
Ishikawa and Crosby are recognised worldwide as contributors to quality philosophy
and management. Their reputations and influence extend far beyond narrow perceptions
of "quality" to the highest levels of industry and management. They
all emphasised and taught practical, realistic, effective approaches. It is
interesting by contrast that no similar "heroes" of reliability have
emerged over the years since the discipline has been in existence. There have
of course been notable contributors in specific areas, such as Shainin and Taguchi
in test design and analysis, Weibull, Nelson and Crow in data analysis, Hobbs
in accelerated test (mentioned earlier), and others in areas such as failure
physics, etc. However, no name is associated with teaching and applying the
wider philosophy of excellence, as taught by the "quality" heroes,
to the upstream engineering activities of design and development, and to the
higher levels of management.
We know, as Deming taught and
as has been widely demonstrated, that continuous improvement in manufacturing
quality ("kaizen") leads to continuous gains in productivity
and competitiveness. The potential gains from kaizen in engineering design
and development are, in most cases, even greater. Some companies recognise this,
but most still apply, in varying degrees, the inappropriate and sub-optimal
approaches to design and development for reliability that have been described
above.
Reliability needs a hero to lift
the discipline out of its over-reliance on the ideas and methods that have misled
and detracted from practical achievement, and which have resulted in justified
scepticism and distrust from the wider engineering profession. The future must
be based upon effective engineering and management applied to the whole product
cycle, in addition to application of the reliability and quality engineering
techniques that are effective. Therefore the hero must have a reputation that
goes beyond the reliability "profession".
THE WAY AHEAD
- Hard as it may seem, the reliability and quality
"professions" must accept that they have, in the ways discussed
above, misled engineers and managers about how product reliability should
be managed and achieved. We must realise and teach that reliability is achieved
by excellent engineering, in the widest sense, with the objective of minimising
all causes of failures. "Scientific" management of quality and reliability
must be replaced by the methods that are consistent with Druckerís new management.
- Reliability must be managed as an integral
aspect of total product design, development/test and manufacture. Since most
modern engineering systems buy high proportions (typically 70% - 80%) of their
failures from their sub-system and component suppliers, this integrated team
approach must be extended to all key suppliers. Relying on concepts like ISO9000
provides almost no assurance in this respect.
- In the integrated engineering approach to
new product design, test and manufacture we must ensure that, as far as practicable,
all variations that can affect performance, yield, reliability, durability
and costs are identified, understood and controlled. At the same time we must
also appreciate and teach the extent to which traditional mathematical/ statistical
methods can misrepresent the true nature of variations and interactions in
engineering.
- We must eliminate the over-reliance on the
"numbers game" in reliability, in relation to prediction, modelling and measurement.
Contributions to symposia and journals should be subjected to tests of practical
reality and applicability.
- Reliability testing should be taught, applied
and managed as an activity to stimulate failures as quickly and economically
as practicable, rather than to generate statistics. This can be performed
only through the application of accelerated combined stresses, not by using
"typical" stresses. The methods of highly accelerated life testing (HALT)
and highly accelerated stress screens (HASS) should be applied (References
7,8).
- Reliability must be taught as an integral
part of all engineering curricula, by engineering teachers, not by mathematicians.
The curriculum should include manufacturing quality aspects and maintenance,
and practical understanding of variation. The ASQ curricula for reliability
and quality engineering already exist to provide the framework for this approach.
Courses that do not cover the ASQ curricula should not be accredited.
- We must stop the development and use of standards
for reliability and quality, such as ISO/IEC60300, ISO/IEC61508 and ISO9000.
- The main professional societies for engineering
and quality should combine to eliminate the inappropriate ideas and methods
that have been developed, taught and applied, and to force through the adoption
of the practical, relevant methods. THEY ARE THE METHODS THAT ACTUALLY
WORK.
References:
- P.D.T. OíConnor: Practical Reliability Engineering,
John Wiley and Sons Ltd. (3rd. edition, 1995).
- P.F. Drucker: The Practice of Management,
Heinemann (1955).
- W.E. Deming: Out of the Crisis, MIT University
Press (1981).
- P.D.T. OíConnor: The Practice of Engineering
Management, John Wiley and Sons Ltd. (1985).
- W.A. Shewhart: The Economic Control of Manufactured
Product, Van Nostrand (1931).
- P.D.T. OíConnor: Quantifying Uncertainty in
Reliability and Safety Studies, Microelectronics and Reliability Vol. 35 Nos.
9-10 pp. 1347-1356, (1995).
- G. Hobbs: Accelerated Reliability Engineering:
HALT and HASS, John Wiley and Sons Ltd. (1999).
- P.D.T. OíConnor: Test Engineering (to be published
by John Wiley and Sons Ltd. (2001).