Cooling Issue in US Central Region Data Center

Incident Report for GroqCloud

Resolved

This incident has been resolved.

As a result of a lightning strike, one of our primary data centers in the US central region experienced cooling issues that resulted in an urgent powering down of all servers in the facility. Our global model scheduler was running on general-purpose compute in the impaired data center and the decision was made to re-locate it.

The actions taken resulted in all model instances globally being re-created causing significant degradation of service between 02:08 UTC and 02:24 UTC. Unfortunately, the scheduler isn't currently optimized for rapid model bringup in multiple regions at the same and manual intervention was required which delayed the recovery. By 02:27 UTC errors rates had dropped for ~3% and by 03:34 UTC had returned to standard levels.

Posted Jul 16, 2025 - 03:57 UTC

Update

We are seeing performance improvement across the impacted models as the team continues to remediate the issue.

Posted Jul 16, 2025 - 02:45 UTC

Update

We are continuing to work on a fix for this issue.

Posted Jul 16, 2025 - 02:17 UTC

Update

The team has identified the issue causing varying outage, elevated error rates, and latency across several models. They are working to implement a fix.

Posted Jul 16, 2025 - 02:02 UTC

Identified

The team is currently investigating an issue that may be causing errors across various models.

Posted Jul 16, 2025 - 01:31 UTC

This incident affected: API.