Data Centers and Cosmic Rays


Cosmic Rays

A patent from Intel caught my attention recently. The patent covers a chip with the ability to detect cosmic rays, presumably to enforce error correction at the CPU level. While it's not a huge problem with CPUs currently, Moore's law tells us that transistors will get much smaller than they are in 2008 and when you make transistors smaller and try to build complex mappings between millions of transistors, the likelihood of soft errors becomes larger. By soft errors, I mean that a perfectly good transistor may switch states inadvertently.

Cosmic Rays are not really rays, per se. They are energetic particles that originate from any of a number of places in outer space - from the sun directly, from a black hole, from a supernova, and from many other sources. The particles are exponentially more energetic than what the fastest particle accelerators created by man can produce. They affect us in ways that we don't fully understand, but as an example, they account for more than 12% of the radiation exposure the average Australian receives annually. They are invisible. They are generally not very big. They are real. And they are very frequent.

A little less than 10% of the cosmic rays that come down at us are alpha particles (helium nuclei). These consist of two protons, and two neutrons. The absence of electrons means that they are positively charged. In silicon, a positively charged atom can't travel very far (<30 microns) before it interacts with the silicon itself.

RAM is affected by Cosmic Rays now

When you look at CPUs, there is very little in them that is not constantly in energetic motion. Aside from registers and cache, the transistors within the silicon are constantly pushing data. Intel et. all has a lot of redundancy built into the logic paths as well as error checking, and as well, the CPU has the most reliable power curve on the motherboard. There are a lot of reasons CPUs don't see as many soft errors as RAM does.

RAM on the other hand is your data store. While your I/O processes will create a constant flow of data in and out of RAM, some of what is in memory is very stable - basic OS cache information, persistent data storage for high volume information, and all of the information your currently running programs rely on. RAM transistor states are exponentially more stable than CPU because of the task they are tailored for. Cosmic Rays present a problem for RAM now. A switched bit might not be very obvious, and in a lot of situations it might not matter, but it happens - and one of the most common reasons for it happening is cosmic rays.

It happens 10 times more frequently at Denver than it does at sea level. If you have a data center above sea level, ECC ram is that much more important. ECC detects and resolves soft errors transparently to everybody. It does so on the RAM itself, without any need for bus interaction.

So save yourself the headache of dealing with random unrepeatable issues caused by undetectable memory errors. And save yourself the conversation with your data center manager that their lack of cosmic ray protection is costing you time and money. Spend the extra bucks on ECC RAM and make it all less of a worry.

How much does this really happen

Errors occur about 6 times per gigabyte of RAM per year. This is based on some old data - studies that are now about 10 years old (summarized here), but the numbers are reflective of the best possible architecture for RAM that was conceived at the time. More common memory architecture at the time sustained 180 soft errors per gig per year - primarily attributed to cosmic ray interactions. On a machine with 2 GB of ram, that's just about 1 RAM error per day. In Denver, that would amount to 3600 errors annually - enough that it would be a surprise if you weren't regularly affected.

Discuss Data Centers and Cosmic Rays