Customers using Container-Optimized OS (COS) in Google Kubernetes Engine (GKE) may not be able to fetch specific NVIDIA GPU drivers
ResolvedMajor
March 12, 2024 - Started almost 2 years ago
- Lasted about 2 hours
Incident Report
Summary: Customers using Container-Optimized OS (COS) in Google Kubernetes Engine (GKE) may not be able to fetch specific NVIDIA GPU drivers
Description: We are investigating an issue with Google Kubernetes Engine affecting customers using NVIDIA GPU drivers on Container Optimized OS (COS). Affected Nodes that are newly created or recreated have non functional GPU drivers preventing functioning of workloads using the drivers. Some GPU drivers are unaffected (P4, P100, V100, K80).
Our engineering team continues to work towards resolving the driver fetching issue.
We will provide more information by Tuesday, 2024-03-12 18:00 US/Pacific.
We apologize to all who are affected by the disruption.
Diagnosis: GKE users will see error messages on the GPU node when installing the GPU driver of this nature - "Failed to download GPU driver installer, status: 403 Forbidden".
Workaround: None at this time. However, the issue can be mitigated by avoiding recreation of existing Nodes running GPUs. Note GCP has halted automatic Node recreation as a partial mitigation.
The Status Page Aggregator with Early Outage Detection
With IsDown, you can monitor all your critical services' official status pages from one centralized dashboard and receive instant alerts the moment an outage is detected. Say goodbye to constantly checking multiple sites for updates and stay ahead of outages with IsDown.