While reliability issues come in many flavors (reentrancy, exception handling, dead
locks, etc.), the current most popular reliability issue to debate (at least inside
of Microsoft) is out of memory, aka OOM.
In .NET 1.0 and 1.1 the CLR wasn't hardned to OOM. Basically if you hit a hard out
of memory failure (couldn't grow the page file, GC running doesn't free resources)
the CLR falls over. For Whidbey the CLR execution engine and a small percentage of
mscorlib has been "hardened" to OOM.
Ah, first, define harden.
One of the architects on the Indigo team had a great definition (I probably can't
come up with verbatim his definition, but i'll try): Hardening means tolerating a
fault and leaving the component in a consistent state. That consistent state might
be unavailable, but the component is never left in a corrupted state.
In the overly simple example I did the other day, it was easy to make the component
hardened to any failures.
Ah, next define failures.
Here is a great one - how should your component behave if the power cable is unplugged?
Power loss is a failure. Hard disk crashing is a failure. Running out of memory is
a failure. You see the pattern. There are extremely reliable systems out there. I
heard anecdotally about a banking system that had a reliability policy that if a nuclear
bomb went off in one city and pending transactions would be automatically rerouted
to another city. That's a pretty high bar.
So, back to our little out of memory problem. Because of the dynamic nature of the
CLR - virtual method calls, dynamic JIT, boxing, etc - it is extremely hard to write
code that is guaranteed to never require an allocation. In unmanaged code this tends
to be easier (not easy, but easier) because allocations are always explicit and never
asynchronous. With that, hitting a hard memory failure (you can't allocate a byte)
means that you basically can't continue to run managed code.
Of course, nothing is ever so simple. In Whidbey the notion of "constrained execution
regions" (CER) was introduced with reliability policy which allows for writing managed
code that is in fact hardened to hard out of memory failures. But writing that type
of code is truly rocket science.
So what is a component to do?
Caveat: I'm talking here just about the problems in this space... there is no
need to panic.