Sphinx Search logo

How we tamed Sphinx Search

It is no secret that Spil Games is a heavy user of Sphinx Search. We use it in many ways including game-search, profile-search and since a few months ago to even build our category and subcategory listings. In all cases we do not use it as an extension of MySQL but rather as a standalone daemon facilitating listings of (document) identifiers.

As 2013 progressed towards X-mas we saw the utilization of our category/subcategory Sphinx cluster sky-rocketing which caused the response times to increase heavily. During peak hours we performed about 500 queries per second with response times in milliseconds while sometimes all of a sudden the response times of the application would go up near sub-second response times. We quickly added response time capturing inside the application and compared it against the load spikes on the Sphinx hosts:

Sphinx response time vs load

One of the major contributors to the load increase was the indexing process. Just like the Sphinx search daemon this indexing process is multi-threaded and this means it will suck up all idle cpu time of all cpu cores. Coincidentally this covered about 80% of the load spikes. This meant for the first ever we had to fight with a genuinely multi-threaded application.

At first attempt we tried to adjust the indexer with nice (to +19) but this did not help at all as nice will only make the process consume as much idle cpu as possible which it apparently already did. Second tool we tried was cpulimit, but this tool has the drawback that it can only limit the percentage of cpu idle to be used and can only be executed after the indexer has started. Then we found taskset which allowed us to run the indexing process on one single cpu core. This is what our cron looks like:

This helped us to restrict the indexing process to one core and keep all the other cores available for the Sphinx search daemon to respond. This helped us to overcome 80% of our load spikes.

As you can also see we use the –quiet flag to suppress the output from the indexer (and spam us all the time) but actually the indexer has valuable information regarding the indexing process. We have made a parser to send the information via StatsD to Graphite:

This script reads from STDIN, so you can pipe the output of your indexer to this script and then the script will send the metrics via StatsD to Graphite. If you don’t run StatsD on localhost you can substitute the write command to “WRITE=’nc -w1 <yourgraphitehost> 2003′” and change the > to |.

Last but not least contributor to the load increase on the Sphinx nodes was an altered mysql_statsd daemon we ran to keep an eye on Sphinx. We thought that if it worked fine for MySQL, why not for Sphinx? It has a MySQL command line, right? So we can use SphinxSQL with SHOW STATUS, right? Well actually it was a bit more intrusive than we thought, so we stopped the daemon and now use this script via cron:

It will give roughly the same metrics with a lot less granularity, but it will do for now…