@stephenw10:
I expect this machine will now fail! ::)
Excellent, you've taken your first steps towards reducing your exposure to unexpected outages.
My process goes something like the following:
Assume everything will fail at the absolutely worst possible time.
Determine what the worst possible time is for a failure for each service.
Determine how all the systems that affect a given service can possibly fail.
Determine what, if anything, I can do about the potential failure points.
For any potential failure I can't avoid, create a plan of action and documentation for what I will do when it happens.