Customers using Container-Optimized OS (COS) in Google Kubernetes Engine (GKE) may not be able to fetch specific NVIDIA GPU drivers
Resolved
Major
March 12, 2024 - Started over 1 year ago
- Lasted about 2 hours
Incident Report
Summary: Customers using Container-Optimized OS (COS) in Google Kubernetes Engine (GKE) may not be able to fetch specific NVIDIA GPU drivers
Description: We are investigating an issue with Google Kubernetes Engine affecting customers using NVIDIA GPU drivers on Container Optimized OS (COS). Affected Nodes that are newly created or recreated have non functional GPU drivers preventing functioning of workloads using the drivers. Some GPU drivers are unaffected (P4, P100, V100, K80).
Our engineering team continues to work towards resolving the driver fetching issue.
We will provide more information by Tuesday, 2024-03-12 18:00 US/Pacific.
We apologize to all who are affected by the disruption.
Diagnosis: GKE users will see error messages on the GPU node when installing the GPU driver of this nature - "Failed to download GPU driver installer, status: 403 Forbidden".
Workaround: None at this time. However, the issue can be mitigated by avoiding recreation of existing Nodes running GPUs. Note GCP has halted automatic Node recreation as a partial mitigation.
One place to monitor all your cloud vendors. Get instant alerts when an outage is detected.
Latest Google Cloud outages