Outage in Google Cloud

Global: Calico enabled GKE clusters’ pods may get stuck Terminating or Pending after upgrading to 1.22+

Resolved Minor
September 29, 2022 - Started over 1 year ago - Lasted 2 days

Need to monitor Google Cloud outages?
Stay on top of outages with IsDown. Monitor the official status pages of all your vendors, SaaS, and tools, including Google Cloud, and never miss an outage again.
Start Free Trial

Outage Details

Summary: Global: Calico enabled GKE clusters’ pods may get stuck Terminating or Pending after upgrading to 1.22+ Description: The following GKE versions are vulnerable to a race condition when using the Calico Network Policy, resulting in pods stuck Terminating or Pending: All 1.22 GKE versions All 1.23 GKE versions 1.24 versions before 1.24.4-gke.800 Only a small number of GKE clusters have actually experienced stuck pods. Use of cluster autoscaler can increase the chance of hitting the race condition. A fix is available in GKE v1.24.4-gke.800 or later. The fix is also being made available in v1.23 and v1.22, as part of the next release, which has now started. Once available, customers can manually upgrade to the fixed version. Or, Clusters on the RAPID, REGULAR or STABLE release channels using 1.22 or 1.23 will upgrade automatically over coming weeks. We will provide an update by Friday, 2022-09-30 15:00 US/Pacific with current details. The issue was introduced in the Calico component, and GKE has been working closely with the Calico project to produce a fix. We apologize to all who are affected by the disruption. Diagnosis: The Calico CNI plugin shows the following error terminating Pods: “Warning FailedKillPod 36m (x389 over 121m) kubelet error killing pod: failed to "KillPodSandbox" for "af9ab8f9-d6d6-4828-9b8c-a58441dd1f86" with KillPodSandboxError: "rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod "myclient-pod-6474c76996" network: error getting ClusterInformation: connection is unauthorized: Unauthorized" Workaround: Customers currently experiencing the issue, are requested to take one of the following actions: 1. [Recommended] Manually upgrade to GKE v1.24.4-gke.800 or later (if viable), or reach out to Google Cloud Support to have an internal patch applied 2. Restart the kubelet and calico-node to get the pods unstuck.
Components affected

Vendor and Uptime Monitoring in one place

With IsDown, you can monitor all your critical services' official status pages from one centralized dashboard and receive instant alerts the moment an outage is detected. Say goodbye to constantly checking multiple sites for updates and stay ahead of outages with IsDown.

Start free trial

No credit card required · Cancel anytime · 3152 services available

Integrations with Slack Microsoft Teams Google Chat Datadog PagerDuty Zapier Discord Webhook

Setup in 5 minutes or less

How much time you'll save your team, by having the outages information close to them?

14-day free trial · No credit card required · Cancel anytime