Last Tuesday, Gmail users were unable to access their accounts through the Gmail web interface for nearly 2 hours. Google experts responded and then posted a statement explaining what happened. When you read through the statement, it is difficult to understand what caused what. By conducting a root cause analysis and building a Cause Map to visually document the problem, you can improve the clarity of why the incident happened and more importantly, help identify specific solutions to prevent it from happening again. For example, according to the explanation, a small fraction of servers were taken offline to perform routine upgrades. Normally, that shouldn’t be a problem; however, they had also underestimated the load placed on the routers by some recent changes. This caused the request routers to become overloaded. The relationships would look like this…
As you can see both causes are required to cause the overload. Google’s response was to increase capacity by adding additional request servers. Now that the crisis has been contained and email access has returned to normal, the next step is to conduct a more thorough root cause analysis to make sure this type of event doesn’t happen again. Stay tuned for a more detailed analysis of the incident.




You must log in to post a comment.