Cooling Issue in US Central Region Data Center
Resolved·Full outage

This incident has been resolved.

As a result of a lightning strike, one of our primary data centers in the US central region experienced cooling issues that resulted in an urgent powering down of all servers in the facility. Our global model scheduler was running on general-purpose compute in the impaired data center and the decision was made to re-locate it.

The actions taken resulted in all model instances globally being re-created causing significant degradation of service between 02:08 UTC and 02:24 UTC. Unfortunately, the scheduler isn't currently optimized for rapid model bringup in multiple regions at the same and manual intervention was required which delayed the recovery. By 02:27 UTC errors rates had dropped for ~3% and by 03:34 UTC had returned to standard levels.

Wed, Jul 16, 2025, 03:57 AM
(2 months ago)
·
Affected components
API
Updates

Resolved

This incident has been resolved.

As a result of a lightning strike, one of our primary data centers in the US central region experienced cooling issues that resulted in an urgent powering down of all servers in the facility. Our global model scheduler was running on general-purpose compute in the impaired data center and the decision was made to re-locate it.

The actions taken resulted in all model instances globally being re-created causing significant degradation of service between 02:08 UTC and 02:24 UTC. Unfortunately, the scheduler isn't currently optimized for rapid model bringup in multiple regions at the same and manual intervention was required which delayed the recovery. By 02:27 UTC errors rates had dropped for ~3% and by 03:34 UTC had returned to standard levels.

Wed, Jul 16, 2025, 03:57 AM

Identified

We are seeing performance improvement across the impacted models as the team continues to remediate the issue.

Wed, Jul 16, 2025, 02:45 AM(1 hour earlier)

Identified

We are continuing to work on a fix for this issue.

Wed, Jul 16, 2025, 02:17 AM(27 minutes earlier)

Identified

The team has identified the issue causing varying outage, elevated error rates, and latency across several models. They are working to implement a fix.

Wed, Jul 16, 2025, 02:02 AM(15 minutes earlier)

Identified

The team is currently investigating an issue that may be causing errors across various models.

Wed, Jul 16, 2025, 01:31 AM(30 minutes earlier)
Powered by