400 errors when trying to create an external (L3) Load Balancer for GCE/GKE services-Google

SUMMARY:

On Monday 7 December 2015, Google Container Engine customers could not 
create external load balancers for their services for a duration of 21 
hours and 38 minutes. If your service or application was affected, we 
apologize — this is not the level of quality and reliability we strive to 
offer you, and we have taken and are taking immediate steps to improve the 
platform’s performance and availability.


DETAILED DESCRIPTION OF IMPACT:

From Monday 7 December 2015 15:00 PST to Tuesday 8 December 2015 12:38 PST,  Google Container Engine customers could not create external load balancers  for their services. Affected customers saw HTTP 400 “invalid argument” errors when creating load balancers in their Container Engine clusters.  6.7% of clusters experienced API errors due to this issue.

The issue also affected customers who deployed Kubernetes clusters in the 
Google Compute Engine environment.

The issue was confined to Google Container Engine and Kubernetes, with no 
effect on users of any other resource based on Google Compute Engine.

ROOT CAUSE:

Google Container Engine uses the Google Compute Engine API to manage 
computational resources. At about 15:00 PST on Monday 7 December, a minor  update to the Compute Engine API inadvertently changed the case-sensitivity  of the “sessionAffinity” enum variable in the target pool definition, and  this variation was not covered by testing. Google Container Engine was not  aware of this change and sent requests with incompatible case, causing the  Compute Engine API to return an error status.

REMEDIATION AND PREVENTION:

Google engineers re-enabled load balancer creation by rolling back the 
Google Compute Engine API to its previous version. This was complete by 8 
December 2015 12:38 PST.

At 8 December 2015 10:00 PST, Google engineers committed a fix to the 
Kubernetes public open source repository.

Google engineers will increase the coverage of the Container Engine 
continuous integration system to detect compatibility issues of this kind. 
In addition, Google engineers will change the release process of the 
Compute Engine API to detect issues earlier to minimize potential negative 
impact.

Link to Original Report