Bot Framework degraded service on July 18th and 19th, 2017

On July 18th and 19th, the Bot Framework service experienced degraded service in 2 of our datacenters. The service has been restored to full health and details are included below. The root cause was a misconfigured maintenance job that resulted in high CPU across several servers and datacenters.

Scope of Impact

The maintenance job misconfiguration affected certain Azure Web Apps, including Bot Framework servers in Singapore and Dublin. Bots communicating with Direct Line or Channels hosted in those datacenters have received HTTP 503 status codes, indicating that the bot was unavailable.

The datacenters were affected at different times and for different durations, detailed below. In the case of the Dublin datacenter, the impact lasted long enough that we removed the datacenter from our routing tables. This resolved the errors but increased latencies for traffic that would have been routed to Dublin but had to instead be routed to another datacenter.

Timeline

(All times UTC)

  • 2017-07-18 13:14 – Multiple health check requests to services in our Dublin datacenter started failing with HTTP 503 (Service Unavailable)
  • 2017-07-18 13:35 – We initiated a restart of the services
  • 2017-07-18 14:42 – 503s persist, another restart is requested
  • 2017-07-18 14:51 – All of our services in the Dublin datacenter are removed from Traffic Manager
  • 2017-07-19 08:02 – Health check requests to our Direct Line cluster in our Singapore datacenter started failing with HTTP 503 (Service Unavailable)

    Health checks flopped between success and failure every few minutes. Latencies of posts to Direct Line also increased to 20-30 seconds

  • 2017-07-19 15:17 – 503s stopped and health checks returned to normal
  • 2017-07-25 17:50 – Services in Dublin re-added to Traffic Manager, customer impact ends

Analysis and remedies

The Bot Framework services rely on several other services, both in Azure and outside. These services provide great value that allows us to focus on building bots and not worry too much about the infrastructure. However, systems occasionally fail so we constantly monitor the technologies we depend on, our partner channels, authentication endpoints, and our own code.

Our system health tests did detect a problem, however internal system logs didn’t indicate any particular issues with the Bot Framework servers or any unusual behavior. As a result, we identified that the issue might be upstream of our services, and began work with our partner teams in Azure.

The Web Apps team informed us that the impacted servers were hitting File Server Resource Manager (FSRM) quotas on anonymized logs. This caused extra CPU usage, which was initially resolved by deploying a script to clean up those log files, however, a code bug in the throttle (which kicks in when high CPU occurs) caused a few sites to still return 503’s after the initial issue was mitigated. We are working with the Web Apps team to better understand how we can diagnose issues at all levels of the stack more quickly to limit the impact to our shared customers.

Thank you for your patience.

Vince Curley from the Bot Framework team.