Compute Engine bug wrongly rerouted all incoming global traffic for 18 minutes
Less than a month after Google was bigging up its cloud prowess and announcing a significant data centre expansion, the company has had to publicly apologise to its customers for a global cloud outage earlier this week.
Google Cloud’s compute service spluttered offline in all regions for 18 minutes on Monday evening, Pacific Time.
At 19:09pm, all inbound traffic from customers using Google’s Compute Engine instances was routed incorrectly, resulting in dropped connections and no way to reconnect.
“We recognise the severity of this outage, and we apologise to all of our customers for allowing it to occur,” Google said in a statement.
Google has taken the incident very seriously, and has offered 10-25 percent discounts of affected customers’ monthly cloud bills, more discount than it contractually offers through its service level agreements.
Other Google cloud applications, such as Google App Engine and Google Cloud Storage were not affected.
Importantly, this kind of outage could prove even more costly for Google. As more customers move to the cloud, and others consider doing so, companies like Google have to be seen to be working to prevent such outages, or risk losing business.
To that end, Google said it has already learned lessons from the slip-up, and has put into place an action plan to make sure it doesn’t happen again.
“It is our intent to enumerate all the lessons we can learn from this event, and then to implement all of the changes which appear useful,” said Google.
“As of the time of this writing in the evening of 12 April, there are already 14 distinct engineering changes planned spanning prevention, detection and mitigation, and that number will increase as our engineering teams review the incident with other senior engineers across Google in the coming week,” it said.