On Tuesday, June 6th HelloSign browsers and API integrations stopped working due to an expired SSL certificate on the HelloSign application. The outage was from 11:27am to 11:53 PDT (2:27pm - 2:53pm EDT, 20:27 - 20:53 CEST). We’d like to talk about how this happened, and share how we plan to prevent it from happening again.
This failure did not expose any information, it prevented access to that information. Nor was it a complex failure.
It had three contributing causes:
- We separated our web site (www.hellosign.com) from our app (app.hellosign.com) a few months ago.
- Procedure adjustments related to compliance.
Traditionally, we used a calendar or ticket system to manage certificate expirations, and only had www.hellosign.com and api.hellosign.com as subdomains, both using a wildcard certificate for *.hellosign.com.
When checking for expiration dates, we checked ‘www’ since our browsers told us when the expiration date is. Earlier this year, we moved the web-application to app.hellosign.com and off of www.hellosign.com, and moved ‘www’ to a new certificate.
Finally, our new procedures due to our SOC 2 Type 1 report changed the duties of our DevOps team in ways that we’re still getting used to; there are some things we used to be able to ‘just do’, but now document and get permission first in accordance with compliance requirements.
Putting it Together
- A DevOps engineer knew our *.hellosign.com certificate would be expiring soon.
- Per tradition, the engineer checked the expiration of www.hellosign.com, not remembering that it was on a different certificate now.
- Per procedure, the technician created a ticket to swap out the certificate and set the due date to the expiration of the wrong certificate.
- Coincidentally, we worked another certificate renewal two days before the expiration incident. The technician working the renewal saw that *.hellosign.com would expire soon and went to make a ticket. Under the pre-compliance behaviors, they would have just updated the *.hellosign.com certificate while they were in there. Post-compliance, this requires a separate change-ticket.
- The other DevOps engineer found an existing change ticket for the *.hellosign.com renewal assigned to a separate engineer, but didn’t double-check the due date. Then they promptly forgot about it, assuming it was well in hand.
- The *.hellosign.com certificate expired, causing our API customers to start throwing errors in their integrations. Minutes later, Support starts getting calls from users of our web application.
Doing Better Next Time
SSL certificates are a critical component of doing business on the modern internet and should have automated expiration-tracking pointed at it. The calendar and ticket system had worked well for us and our small number of certificates, but didn’t scale to our current company size.
We will be implementing automation to track expirations, create update tickets, and – if the expiration date is soon enough – trigger on-call alarms.
Automation to detect expiration, request renewal, and update the live certificates would be better, and the tech-industry as a whole is moving closer to making this an easy thing to set up. We’re not quite there yet.
While this expiration shouldn’t have happened, it did provide proof of one of the other technical controls we have in place. We make use of a technology called HTTP Strict Transport Security (HSTS) to tell browsers that the only way to connect to us is by HTTPS.
If you’ve visited app.hellosign.com in the last couple of months, your browser knows we have a policy in place that will:
- Prevent you from connecting to app.hellosign.com over an unencrypted connection.
- Prevent you from clicking past the scary security warning.
This protects our users from malicious actors forcing your connection to HelloSign to go over unencrypted connections in order to grab your credentials and documents, and makes us a slightly harder target to impersonate for phishing purposes.