What Caused Today's Search Performance Issues In Europe and Why It Will Not Happen Again

17 Mar 2014

During a few hours on March 17th you may have noticed longer response times for some of the queries sent by your users.

Slower than average search
performance

Average latency for one of our European clusters on March 17th

As you can see above, our slowest average response time (measured from the user's browser to our servers and back to the user's browser) on one of our European clusters peaked at 858ms. On a normal day, this peak is usually no higher than 55ms.

This was clearly not a normal behavior for our API, so we investigated.

How indexing and search calls share the resource

Each cluster handles two kinds of calls on our REST API: the ones to build and modify the indexes (Writes) and the ones to answer users' queries (Search). The resources of each cluster are shared between these two uses. As Write operations are far more expensive than Search calls, we designed our API so that indexing should never use more than 10% of these resources.

Up until now, we used to set a limitation on the rate of Writes per HTTP connection. There was no such limit for queries (Search); We simply limited Write calls to keep search quality. To avoid reaching the Write rate limit too quickly, we recommended users to Write by batching up to 1GB of operations per call, rather than sending them one by one. (A batch, for example, could be adding 1M products to an index on a single network call.) A loophole in this recommendation was the origin of yesterday's issues.

What happened yesterday is that on one of our European clusters, one customer pushed so many unbatched indexing calls from different HTTP connections that they massively outnumbered the search calls of the other users on the cluster.

This eventually slowed down the average response time for the queries on this cluster, impacting our usual search performance.

The Solution

As of today, we now set the rate limit of Writes per account and not per HTTP connection. It prevents anyone from using multiple connections to bypass this Write rate limit. This also implies that customers who want to push a lot of operations in a short time simply need to send their calls in batches.

How would you batch your calls? The explanation is in our documentation. See here for an example with our Ruby client: https://github.com/algolia /algoliasearch-client-ruby#batch-writes

Hyde

What Caused Today's Search Performance Issues In Europe and Why It Will Not Happen Again

How indexing and search calls share the resource

The Solution

Related Posts

Welcome Texas! 15 Jul 2015

When Solid State Drives are not that solid 15 Jun 2015

We just raised our Series A. What's next? 28 May 2015

Search