日志档案

发表于 2007-3-10 22:08:04

4

标签: 硬盘  MTBF  失效  故障  

你对磁盘的认识全都错了

200721316日,在美国加里福尼亚州圣琼斯举行的文件与存储技术会议USENIX FAST’07上,有多篇论文对硬盘制造商提供的MTBF提出了质疑,其中影响最大的两篇论文分别是卡内基梅隆大学的Bianca SchroederGarth Gibson的文章 Disk failures in the real world:What does an MTTF of 1,000,000 hours mean to you? 和Google研究小组的  Failure Trends in a Large Disk Drive Population。两篇文章同时公开了同一个发现:实际的硬盘失效率比制造商给出的数值要高出多倍,硬盘制造商提供的MTBF参数根本经不住时间考验。

我作为科普作者,觉得这是一个非常好的题材。因此,与马来西亚大学的董欣女士合作完成了《我们被硬盘制造商耍了》一文,对该问题进行了深入剖析。

下面这篇富有煽动性博文最早进入了我的视线,也正因为受到它的启发,才产生了写作兴趣,所以全文转载,以表达对Robin Harris先生的谢意 。

 

Everything you know about disks is wrong

By Robin Harris on Wed, 02/21/2007

Filed under :Storageobin Harris's blog

 

Two bombshell papers released at the Usenix FAST '07 (File And Storage Technology) conference this week bring a welcome dose of reality to the basic building block of storage: the disk drive.

 

Together the two papers are 29 pages of dense computer science with lots of info on populations, statistical analysis, and related arcana. I recommend both papers. The following summary, and two longer analyses at StorageMojo are summaries of what I found interesting.

 

The first conference paper, from researchers at Google, Failure Trends in a Large Disk Drive Population (pdf) looks at a 100,000-drive population of Google PATA and SATA drives. Remember that these drives are in professionally managed, Class A data centers, and once powered on, are almost never powered down. So conditions should be nearly ideal for maximum drive life.

 

The most interesting results came in five areas:

·                                 The validity of manufacturer's MTBF specs

·                                 The usefulness of SMART statistics

·                                 Workload and drive life

·                                 Age and drive failure

·                                 Temperature and drive failure

 

MTBF

 

Google found that Annual Failure Rates were quite a bit higher than vendor MTBF specs suggest. For a 300,000-hour MTBF, one would expect an AFR of 1.46%, but the best the Googlers observed was 1.7% in the first year, rising to over 8.6% in the third year.

 

SMART: not very smart

 

SMART (Self-Monitoring, Analysis, and Reporting Technology) is supposed to capture drive error data to predict failure. The authors found that several SMART errors were strong predictors of ensuing failure:

·                                 scan errors

·                                 reallocation count

·                                 offline reallocation

·                                 probational count

 

For example, after the first scan error, they found a drive was 39 times more likely to fail in the next 60 days. The other three correlations are less striking, but still significant. The problem: even these four predictors miss over 50% of drive failures. If you get one of these errors, replace your drive, but not getting one doesn't mean you are safe. SMART is simply not reliable.

 

Workload and drive life

 

Defining workload isn't easy, but the good news is that the Googlers didn't find much of a correlation.

 

After the first year, the AFR of high utilization drives is at most moderately higher than that of low utilization drives. The three-year group in fact appears to have the opposite of the expected behavior, with low utilization drives having slightly higher failure rates than high ulization ones.

 

They did find infant mortality was higher among high-utilization drives. So burn those babies in!

 

Age and drive failure

 

The authors note that their data doesn't really answer this question due to the mix of drive types and vendors. Nonetheless their drive population does show AFR increases with age.

 

Hot drives = dead drives?

 

 Possibly the biggest surprise in the Google study is that failure rates do not increase when the average temperature increases. At very high temperatures there is a negative effect, but even that is slight. This might mean cooling costs could be significantly reduced at data centers.

 

Beyond Google Google's paper wasn't the only cool storage paper or even the best: Bianca Schroeder and Garth Gibson of CMU's Parallel Data Lab paper Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? won a "Best Paper" award.

 

They looked at 100,000 drives Including HPC clusters at Los Alamos and the Pittsburgh Supercomputer Center, as well as several unnamed Internet services providers. The drives had different workloads, different definitions of "failure" and different levels of data collection so the data isn't quite as smooth or complete as Google's. Yet it probably looks more like a typical enterprise data center, IMHO. Also she included "enterprise" drives in her sample.

 

Key observations from the CMU paper: High-end "enterprise" drives versus "consumer" drives?

 

. . . we observe little difference in replacement rates between SCSI, FC and SATA drives, . . . ."

So how much of that 1,000,000 hour MTBF are you actually getting?

 

Infant mortality?

 

. . . failure rate is not constant with age, and that, rather than a significant infant mortality effect, we see a significant early onset of wear-out degradation.

 

The infant mortality effect is slightly different than what Google reported. Both agree on early the more important issue of early wear-out. Vendor MTBF reliability?

 

While the datasheet AFRs are between 0.58% and 0.88%, the observed ARRs [Average Replacement Rate] range from 0.5% to as high as 13.5%. . . . up to a factor of 15 higher than datasheet AFRs. Most commonly, the observed ARR values are in the 3%range.

 

Actual MTBFs?

 

The weighted average ARR was 3.4 times larger than 0.88%, corresponding to a datasheet MTTF of 1,000,000 hours."

 

In other words, that 1 million hour MTBF is really about 300,000 hours - about what consumer drives are spec'd at.

 

Drive reliability after burn-in?

 

Contrary to common and proposed models, hard drive replacement rates do not enter steady state after the first year of operation. Instead replacement rates seem to steadily increase over time.

Drives are mechanical devices and wear out like machines do, not like electronics.

 

Data safety under RAID 5?

 

The assumption of data safety behind RAID 5 is that drive failures are independent so that the likelihood of two drive failures in a single RAID 5 LUN is vanishingly low. The authors found that this assumption is incorrect.

 

. . . the probability of seeing two drives in the cluster fail within one hour is four times larger under the real data . . . .

In fact, they found that a disk replacement made another disk replacement much more likely.

 

Independence of drive failures in an array?

 

The distribution of time between disk replacements exhibits decreasing hazard rates, that is, the expected remaining time until the next disk was replaced grows with the time it has been since the last disk replacement.

Translation: one array drive failure means a much higher likelihood of another drive failure. The longer since the last failure, the longer to the next failure. Magic!

 

Let the dialogue begin!

 

 The importance of these papers is that they present real-world results from large drive populations. Vendors have kept drive-reliability data to themselves for what now seem obvious reasons: they've been inflating their numbers. With good field numbers coming out, smart storage and systems folks can start designing for the real world. It's about time.

系统分类: 资源共享   |   用户分类: 故障诊断   |   来源: 转贴   |   【推荐给朋友】

    阅读(1697)    回复(5)  

投一票您将和博主都有获奖机会!

  • yugd

    2007-4-15 12:43:46

    我感觉google的《Failure Trends in a Large Disk Drive Population》一文是篇相当糟糕的论文。因为在它发表之后几个月来我一直在试图弄懂它在说些什么,它的出了什么样的结论?但是我至今也无法完全明 了这篇论文的所有意义。

    首先这篇论文的语言风格非常糟糕。在文中反复出现的有两个词令人印象深刻:“population”和“surprisingly”。前者不断的强调了 google公司有很多的硬盘,比其它人拥有的多的多;后者不断的强调他们得出了惊世骇俗的结论,并且推翻了所有已存在的相关理论。

    但是这篇论文是否给出了我们想要的,或者是作者想展示的数据呢?纵观论文全文,作者仅仅给出了故障率与使用年限,温度,和部分SMART属性之间的关系。
    下面是我对这些数据存在的疑问:

    首先在原文图2,图3,图5,图9中,3个月,6个月的硬盘的年故障率是如何统计出来的?

    其次在涉及到与硬盘型号相关的数据时,作者宣称这些数据是私有的,拒绝透露。

    有关温度的数据,图4中的温度梯度非常大,从15度到约52度,有37度之巨。难道google使用的是露天机房而不是温度恒定,湿度恒定,有专家维护的专门机房?硬盘在运转时为什么会有如此巨大的温度梯度?

    论文的测试方法也是有问题的。例如有关SMART属性中加电次数和震动的测试实验,就没有完成。

    尽管存在以上的缺陷,作者仍然的出了“令人吃惊”的结论,宣称1.高温和高使用率对硬盘的状况没有明显的影响。2.不能根据SMART建立预测模型。

    实际上,对于1,52度的温度上限对于硬盘来说,仍然处于可靠工作的设计范围之内。从30度至50度都是硬盘可靠工作的设计范围,因此对故障率显然不会有 较大的影响。注意的是这里的温度是硬盘工作时由SMART检测到的绝对温度,而不是环境温度。换算成环境温度大概是在10度至30度之间。显然,在我们的 印象中这是非常舒适的环境温度。而对于故障率较高的温度范围:15度至25度(换算成环境温度则为-5度至5度),很显然这属于较寒冷的环境范围,许多电 子器件在此温度下都无法正常工作;液态轴承的电机也会由于油脂没有熔化而无法正常运转。所有结论1说“高温对硬盘的状况没有明显的影响”是错误的,我们所 说的高温应该是指60度至70度甚至更高的温度,而不是指30度至50度这样的常温。

    其次,关于根据SMART建立预测模型的问题。是否有任何一家硬盘生产商声称SMART可以精确的预测硬盘故障?SMART只是提供了硬盘运转时的参数变 化,以最大限度的减少硬盘突然损坏造成的数据丢失。至于如何根据这些参数变化采取相应的措施则是更高层的部分--操作系统和系统管理员的责任。SMART 显然不能预测所有的硬盘故障,例如由于地震和火灾造成的硬盘损坏。那么要根据SMART建立一个精确的故障预测模型似乎是不可能的。不过也许google 的人员足够的聪明,最终可以研究出这样一套系统。

  • avan

    2007-4-15 14:53:45

    谢谢 yugd 朋友的评论,您的严谨的治学态度值得我学习。针对您的观点,我也提出自己的看法。

    (1)我们每个人说话和做事风格都不一样,有的人比较低调,另一些人可能比较张扬。处世态度上有如此大的不同,自然会在文章里体现出来。从做学问的角度来看,文章有点儿个性,并不是太大的问题。作为读者,应该从文章里面得到知识,形成智慧,而不是用自己的标准去要求别人。吹毛求疵,说三道四,是没有意义的。

    (2)Google的论文我没有仔细阅读,我们的文章里也只是引用了硬盘年更换率的一个数据作为旁证。对于Google研究小组文章中关于“温度对硬盘寿命影响不大”的结论,我们的文章种没有引用。您所谈到的腔内温度与环境温度会相差20度,这个换算关系不晓得是从哪里借用的。我相信,这个换算关系不会总是成立,譬如刚刚启动时,里面和外面的温度可能非常接近,在譬如机箱内的温度较高时,两者也可能更加接近。

  • riple

    2007-3-15 13:19:03

    谢谢avan老哥,我按照你给的线索找到了FAST的网页,看到了很多关于文件系统和存储的最新研究资料,对我的工作帮助太大了。

    非常感谢!

  • helixapp

    2008-4-16 17:55:42

    我们被硬盘制造商耍了 这个文章找不到?

  • avan

    2008-4-16 22:30:33

    《我们被硬盘制造商耍了》一文发表在《微型计算机》2007年第4期(下),作者董欣。