Customers using Container-Optimized OS (COS) in Google Kubernetes Engine (GKE) may not be able to fetch specific NVIDIA GPU drivers
Resolved
Major
March 12, 2024 - Started 11 months ago
- Lasted about 2 hours
Need to monitor Google Cloud outages?
Stay on top of outages with IsDown. Monitor the official status pages of all your vendors, SaaS, and tools, including Google Cloud, and never miss an outage again.
Start Free Trial →
Outage Details
Summary: Customers using Container-Optimized OS (COS) in Google Kubernetes Engine (GKE) may not be able to fetch specific NVIDIA GPU drivers
Description: We are investigating an issue with Google Kubernetes Engine affecting customers using NVIDIA GPU drivers on Container Optimized OS (COS). Affected Nodes that are newly created or recreated have non functional GPU drivers preventing functioning of workloads using the drivers. Some GPU drivers are unaffected (P4, P100, V100, K80).
Our engineering team continues to work towards resolving the driver fetching issue.
We will provide more information by Tuesday, 2024-03-12 18:00 US/Pacific.
We apologize to all who are affected by the disruption.
Diagnosis: GKE users will see error messages on the GPU node when installing the GPU driver of this nature - "Failed to download GPU driver installer, status: 403 Forbidden".
Workaround: None at this time. However, the issue can be mitigated by avoiding recreation of existing Nodes running GPUs. Note GCP has halted automatic Node recreation as a partial mitigation.
Latest Google Cloud outages