Category Archives: NAS

Consumer hard disks: a large-scale reliability study

Backblaze, an online backup service, have published statistics about the reliability of consumer hard disk drives (HDDs) in their enterprise environment:

Throughout the last couple of years, they have applied different models of Seagate, Western Digital, and Hitachi drives in large scale (thousands of individual disks) in their storage pools, enabling them to collect conclusive statistical data, such as the “annual failure rate” (cumulative drive years of service / number of failures). An impressive visualization is the 3 year survival rate:

Backblaze blog: disk survival rate by vendor

Backblaze data: disk survival rate by vendor

Note: for a normal consumer type, a Seagate drive should (on average) last longer than indicated by the graph above, since Backblaze is putting their drives under heavier load than normal users do (Backblaze for sure performs significantly more write operations on the disks, also the normal user does not spin the drive 24/7). While there is no doubt that Seagate drives were identified to be significantly less reliable in this study (compared to WD and Hitachi), Backblaze still likes Seagate drives, because they usually are quite cheap.

You might be wondering why exactly Backblaze is using consumer hard disk drives for running their service instead of enterprise class disks. It’s all about money — especially the value-for-money ratio. Backblaze realized that the relative enhancement in reliability and warranty of enterprise class disks compared to consumer HDDs is not as high as the relative prize increment. On a small scale, it might be worth having enterprise disks, because one has less trouble with failing hardware. When operating on the large scale, however, Backblaze concluded that it is cheaper to operate with consumer HDDs. Obviously, Backblaze has to make sure (and they do, I guess), that data is stored with sufficient redundancy, so that the overall likelihood of data loss is on the same level (almost zero) as with enterprise disks. Also, Backblaze has to properly adjust their infrastructure and internal processes/guidelines according to the “high” rate of disk failures. Considering 30,000 disks being operated simultaneously, and an annual failure rate of 5 % (which is realistic, according to their data), about 4 disks have to be exchanged per day (30000*0.05/365).

For my NAS, I have recently bought Western Digital’s 3 TB disks from the Red line, which are energy-saving consumer disks ‘certified’ for 24/7 operation — a very good choice, according to technical reviews and Backblaze’s data. You might want to look into these Red line models yourself, they come at a very good price. So far, I was assuming that the differences in HDD quality on the consumer market are small. Backblaze’s data disproves this assumption and although Seagate models surely are fine for many types of applications, I won’t blindly buy Seagate anymore. It is impressive to see certain models being much more reliable than others (on average) under heavy load conditions.

I am looking forward to seeing more conclusive data provided by Backblaze in the next couple of years. This kind of data is something that usually is kept secret within the manufacturer companies, and normal tech reviews simply cannot provide these data as of missing statistics.

(Free)NAS: simple auto-shutdown revisited

There are several blog posts and forum entries out there about automatically shutting down a NAS machine depending on environment changes. The solutions I have seen either are too simple or dallying overcomplex.

The use case requires a simple solution that behaves predictably, solid, and stable under real-world conditions. The shutdown condition I want to have implemented: shut down (or send to sleep, does not matter at this point) the NAS machine if no host of a pre-defined list of hosts is reachable for a given time period (a few minutes). Obviously, within this time period, the reachability of the hosts must be tested periodically. The whole check system should be repeated every 15 minutes or so via cronjob. Each test run is to be aborted as soon as at least one host responds. Each single test for responsiveness is to be executed by the ping program, whereas the target machine must miss more than one ICMP request and not answer for at least a few seconds until it is considered unreachable. Reachability is evaluated via ping‘s return code, a reliable measure — it always exits with code 0 if the host has responded at least once (others are parsing ping‘s standard output, which only adds uncertainty).

Implementing this the right way is simple, but still serious engineering and not just a scripting exercise. The system needs to be reliable and maintainable, and needs to have a solid logging facility. While one could do this in any shell language, I think it is difficult to get the edge cases right and to not create nasty traps one did not think of. Python provides an ideal framework for fulfilling the above-mentioned criteria and its control flow allows for writing highly predictable code. However, one needs to understand and properly deal with Popen objects, exceptions, logging, and ping, which — as always — requires careful study of the corresponding documentations. Continue reading