We launched HelloSign for Gmail yesterday. We saw an excellent traffic spike, and received a lot of great feedback from Lifehacker, TechCrunch, a top HN post, Twitter, and most importantly, our users. As you rely on us for your most important documents, we prioritize transparency. We wanted to talk about the huge traffic influx, the intermittent time outs it produced and how we quickly resolved things.
In the meantime, here were our favorite comments:
- "I'm already a heavy user of hellofax/hellosign but this falls into the "F****** Awesome" category of new things for me. This is just WAY better than having to download and drag and drop into hellofax then email from there which I honestly already thought was pretty easy..." (HN user zbruhnke)
- "A service that plugs into Gmail and lets you sign and send documents. FINALLY. May the fax machine die..." (@mgamini)
- "I use HelloSign every single day, this is huge for my productivity. Thanks for giving me an hour of my life back each day guys" (HN user dmor)
And a couple comments that made the Ops team worry:
- "Looks promising, sorely needed. However, installation did not work for me. I presume that you are under a heavy load..." (HN user dirtyaura)
- "Today, the HelloSign team released a new Gmail plugin that gives you the power to sign documents without having to download, print, and scan them...Due to the influx of new users, we kept getting errors while trying to install it through the website. Thankfully, it worked when we tried installing it via the Chrome Web Store..." (digitraltrends.com)
With all the inbound traffic, we needed to resolve these issues as quickly as possible. Here's some of the details.
The short answer is, MySQL kept returning: " Too many connections". More correctly, there were a few queries in our code that needed optimization.
Before the user spike that came with HelloSign for Gmail, the problematic queries were a minor annoyance, but did not have a noteworthy impact on overall site functionality. Then we had a traffic spike that was noteworthy in the context of our historical hit counts.
The beginning of January was relatively consistent for HelloSign, until the Gmail Plugin launch:
Graph of HelloSign traffic during January 2013[/caption] We have solid systems for a typical load and a typical spike, but we weren't prepared for a level of traffic that was an order of magnitude higher.
For reference, here are our DB Connections for the last week (the spike here is pretty much equal to our max_connections setting):
Weekly graph of DB connections on the HelloSign database.[/caption] This graph and the DB Connection graph below show the period of intermittent timeouts covering around 70 minutes (around noon PST).
It is worth noting that our traffic spike continued well beyond the scope of this graph. Our database and query errors caused too many queries to take too long on the database server, and so the number of active connections to the database continued to grow until it reached its maximum setting.
After we fixed the problematic queries the database was able to support the increased traffic while maintaining the much smaller number of concurrent database connections you see on the far right of the graph.
Fixing the issue
Luckily, slow queries were getting logged on our db instance. First we disabled a few administrative services with queries that are particularly complex. Then we proceeded to check the slow query log and identified a few problematic queries that required revising. There were a couple trends in our long-running queries. The longest running queries were retrieving some random values, and were doing so using something like this:
SELECT column FROM table ORDER BY RAND() LIMIT 10;
The problem with ORDER BY RAND() is that MySQL generates random values for each record in the table, then proceeds to order based up on those values. For small tables this does not take a particularly long time, but with large tables the overhead for ORDER BY RAND() is huge. Assuming you are using an auto_increment integer primary key, a decent alternative is to pull the max id, and generate your random values programmatically. Then you have something more like:
SELECT MAX(id) FROM table; # Generate random values based on maximum id SELECT id, column FROM table WHERE id IN (rand_00, .. ,rand_09);
Another issue we found was caused by queries making use of sub-queries. MySQL has a lot of great features, but support for dependent sub queries is unfortunately not one them. Depending on your query structure, running two separate queries or using a join will usually allow sub query removal.
For queries populating IN values:
SELECT column FROM table WHERE id IN (SELECT column FROM table WHERE other_column=1); MySQL's "IN" is quite fast, even with very large arrays. Get the values, then just write the full query manually with the sub-query's output:
SELECT column FROM table WHERE other_column=1; SEELCT column FROM table WHERE id IN (id_1, id_2, .. , id_n);
For more complex sub queries, performance is usually improved by rewriting the query to use a join instead.
After deploying these query optimizations, our CTO selflessly unlocked achievement 'Query-Bane' by committing db-query-genocide against any and all SELECT statements running for longer than 3 minutes on our db (poor queries, was it really their fault?).
And now some CPU and DB Connection metrics for the last 24 hours. CPU for the last 24 hours:
Database CPU usage after making updates.[/caption] DB Connections for the last 24 hours:
HelloSign DB connections after updates.[/caption] Much better. After the updates we continued to see traffic growth, while CPU usage and concurrent connection levels at the database were back down to our expected levels.
We already had slow queries logging and were rotating the slow_log table with periodic notifications. We are refining these notifications to be more explicit about problems and include dynamic updates when thresholds are met. We are also formalizing our periodic review of the slow_log table contents.
We reviewed the problematic queries, and the team is more acutely aware of the taxing aspects of the queries in question. This helps us avoid putting similar queries in place during future updates.
In order to catch these optimization needs sooner, we are also enhancing our monitoring to detail small increases in the database CPU usage and connection metrics. Now even if problematic queries make it to production, we will be aware of them sooner.
Dealing with problems that arise from an influx of users is a great problem to have, and we are pumped to keep making updates that allow us to support more users.
Everyone here was excited about HelloSign for Gmail, and we are ecstatic that our users share our excitement. On the Ops-side, having database issues during a launch day was extremely frustrating, but we are glad that they were exposed and are addressed.
While HelloSign services remained available, database connection errors definitely degraded experiences for some of our users. As such, we wanted to be as transparent as possible in relaying our analysis of the sporadic service during this event. We have always been attentive to scalability, and this issue was the first of its kind for HelloSign. We are keeping this experience in mind as we move forward and continue implementing new features for our users.
Now, on to the next feature!