Outage in Google Cloud

Customers using Container-Optimized OS (COS) in Google Kubernetes Engine (GKE) may not be able to fetch specific NVIDIA GPU drivers

Resolved Major
March 12, 2024 - Started 8 months ago - Lasted about 2 hours

Need to monitor Google Cloud outages?
Stay on top of outages with IsDown. Monitor the official status pages of all your vendors, SaaS, and tools, including Google Cloud, and never miss an outage again.
Start Free Trial

Outage Details

Summary: Customers using Container-Optimized OS (COS) in Google Kubernetes Engine (GKE) may not be able to fetch specific NVIDIA GPU drivers Description: We are investigating an issue with Google Kubernetes Engine affecting customers using NVIDIA GPU drivers on Container Optimized OS (COS). Affected Nodes that are newly created or recreated have non functional GPU drivers preventing functioning of workloads using the drivers. Some GPU drivers are unaffected (P4, P100, V100, K80). Our engineering team continues to work towards resolving the driver fetching issue. We will provide more information by Tuesday, 2024-03-12 18:00 US/Pacific. We apologize to all who are affected by the disruption. Diagnosis: GKE users will see error messages on the GPU node when installing the GPU driver of this nature - "Failed to download GPU driver installer, status: 403 Forbidden". Workaround: None at this time. However, the issue can be mitigated by avoiding recreation of existing Nodes running GPUs. Note GCP has halted automatic Node recreation as a partial mitigation.
Components affected
Google Kubernetes Engine (us-west3) Google Kubernetes Engine (us-west2) Google Kubernetes Engine (europe-west3) Google Kubernetes Engine (us-central1) Google Kubernetes Engine (us-west4) Google Kubernetes Engine (europe-west2) Google Kubernetes Engine (us-east4) Google Kubernetes Engine Google Kubernetes Engine (me-west1) Google Kubernetes Engine (northamerica-northeast1) Google Kubernetes Engine (me-central2) Google Kubernetes Engine (asia-south1) Google Kubernetes Engine (me-central1) Google Kubernetes Engine (asia-south2) Google Kubernetes Engine (southamerica-east1) Google Kubernetes Engine (asia-southeast2) Google Kubernetes Engine (northamerica-northeast2) Google Kubernetes Engine (asia-northeast3) Google Kubernetes Engine (europe-west9) Google Kubernetes Engine (us-south1) Google Kubernetes Engine (australia-southeast2) Google Kubernetes Engine (asia-east1) Google Kubernetes Engine (asia-northeast2) Google Kubernetes Engine (europe-central2) Google Kubernetes Engine (asia-southeast1) Google Kubernetes Engine (australia-southeast1) Google Kubernetes Engine (europe-southwest1) Google Kubernetes Engine (europe-west12) Google Kubernetes Engine (asia-east2) Google Kubernetes Engine (us-east5) Google Kubernetes Engine (europe-north1) Google Kubernetes Engine (europe-west4) Google Kubernetes Engine (europe-west8) Google Kubernetes Engine (us-east1) Google Kubernetes Engine (europe-west6) Google Kubernetes Engine (us-west1) Google Kubernetes Engine (southamerica-west1) Google Kubernetes Engine (asia-northeast1) Google Kubernetes Engine (europe-west10) Google Kubernetes Engine (europe-west1)

Start monitoring all your vendors in just 5 minutes

With IsDown, you can monitor all your critical services' official status pages from one centralized dashboard and receive instant alerts the moment an outage is detected. Say goodbye to constantly checking multiple sites for updates and stay ahead of outages with IsDown.

Start free trial

No credit card required · Cancel anytime · 3273 services available

Integrations with Slack Microsoft Teams Google Chat Datadog PagerDuty Zapier Discord Webhook

Setup in 5 minutes or less

How much time you'll save your team, by having the outages information close to them?

14-day free trial · No credit card required · Cancel anytime