Bot Framework outage on July 25, 2017


On July 25, Bot Framework services experienced an outage that affected many bots. This outage has been resolved and details are included below. The root cause of the outage was due to the unexpected expiration of a Bot Framework registration within Azure Active Directory. The outage lasted approximately 3 hours and 10 minutes.

Scope of impact

This outage affected bots using the v3.1 Bot Framework authentication protocol. The outage prevented affected bots from requesting login tokens to be used against Bot Framework channels. Without these tokens, bots were unable to respond to users.

Timeline

(all times UTC)

  • 17:58 - Bot Framework’s registrattion within Active Directory expired

  • 18:15 - API requests for tokens began failing. Most bots cache tokens for 1 hour, so this outage progressively affected bots from 0% at 18:15 to 100% at 19:15.

  • 18:46 - the problem was identified and we engaged the team to resolve the issue

  • 20:10 - completed building a script to repair the issue

  • 21:08 - the script to repair the issue completed, restoring functionality to all datacenters

Analysis and remedies

The primary cause of the outage was the Bot Framework authentication registration, which was erroneously marked with an expiration date. This registration has now been repaired and restored without an expiration. Further, we are auditing our other assets to ensure they are marked without expirations or, in the case of SSL certificates, are properly registered in our alerting system.

Additionally, we completed a thorough team-wide review of the incident, and identified changes we’ve implemented to decrease the amount of time it takes for us to investigate and repair problems like this.

Lastly, we received many questions about the state of the service while the outage was occurring. We know the importance of transparency and communication, and we’re accelerating our plans to deliver a status page that will keep you informed about Bot Framework services and any related interruptions.

Thank you for your patience during this outage. We value you as a customer, and your commitment to our platform.

Dan Driscoll and Jim Lewallen from the Bot Framework team.