FOSDEM 2014 logo

Follow up on FOSDEM

After my presentation at FOSDEM I got a few questions regarding our Galera implementation and why we did things the way we want.
First of all, the slides:

Second of all, the questions I got:
Q: Why first copy all the data to a new MySQL server using innobackupex and then perform the mysqldump?
This is a question regarding the consolidation of multiple existing asynchronous replicated clusters to a new Galera cluster.
In the slides I showed we use an active-inactive Master-Master setup where one of the MySQL masters is receiving all write-traffic while the inactive master is receiving read-traffic. If we would perform the mysqldump on the inactive master we either have to drain the inactive master from read-traffic and stop replication or it will lock the tables and we would not have a frozen snapshot.
A related question was asked why we do not use innobackupex to feed the backup to the Galera cluster and then create the cluster from this as a starting point. That could be done for the first node, however we wish to consolidate multiple clusters into one Galera cluster we have to ensure the data gets replicated into the new cluster online. Therefore mysqldump is the only viable solution here.

Q: Why are you using MMM?
This is a choice we made five years ago. It worked fine enough for us and we stuck to it till today. We do know it is flawed (some say be design) and we know it has a lot of drawbacks and it is actually one of the drivers to start using Galera. ;)

Q: Why don’t you expect clashes when writing the same data twice at the same time?
In our sharded environment (the Spil Storage Platform) will never write the same data twice as every piece of data that is sharded by user, function and location will have its own owner process in this platform. This means there will never be a second process writing the same piece of data. In other words: our environment allows us to isolate writes and we never expect clashes.
In our other (current) environments the number of writes is low, so the chance of a clash will be low.

If you have any other question, don’t hesitate to reach out to us or place a comment below.

Functional testing of Puppet modules and Docker

In this article I will describe our way of testing Puppet modules and how features of Docker (and lxc containers) are used in our testing framework.

In Spil Games we were early adopters of Puppet. In 2013 decision were made to update our Puppet infrastructure to version 3.* Of course we decided to follow all the best practices and do it agile :-) . While the official documentation provides a more or less clear overview of basic components (modules, hiera, node classification),  we found there is no optimal (ready to use) way of testing the functionality of the modules. That’s why we came up with own testing solution based on lxc containers. This solution in connection with Gerrit and Jenkins gives us a very solid and fast framework for testing module functionality.

Continue reading

Sphinx Search logo

How we tamed Sphinx Search

It is no secret that Spil Games is a heavy user of Sphinx Search. We use it in many ways including game-search, profile-search and since a few months ago to even build our category and subcategory listings. In all cases we do not use it as an extension of MySQL but rather as a standalone daemon facilitating listings of (document) identifiers.

As 2013 progressed towards X-mas we saw the utilization of our category/subcategory Sphinx cluster sky-rocketing which caused the response times to increase heavily. During peak hours we performed about 500 queries per second with response times in milliseconds while sometimes all of a sudden the response times of the application would go up near sub-second response times. We quickly added response time capturing inside the application and compared it against the load spikes on the Sphinx hosts:

Sphinx response time vs load

One of the major contributors to the load increase was the indexing process. Just like the Sphinx search daemon this indexing process is multi-threaded and this means it will suck up all idle cpu time of all cpu cores. Coincidentally this covered about 80% of the load spikes. This meant for the first ever we had to fight with a genuinely multi-threaded application.

Continue reading

MySQL StatsD project on Github

Done is better than perfect.

That is one of Spil Games Engineering’s principles and it is really something I embrace. Almost one and a half year ago, Engineering started to use StatsD and Graphite for collecting and graphing performance metrics and one year ago I got fed up that there was, except for a couple of abandoned projects, no real drop in solution for getting a daemon sending performance metrics via StatsD into Graphite . So I created my own daemon on a Friday afternoon in Python and it was indeed done and not perfect.

Continue reading

OpenStack Swift & many small files

At Spil Games we have been running Swift for more than two years now and are hosting over 400 million files with an average file size of about 50 KB per object. We have a replica count of three so there are 1.2 billion files to be stored on the object servers. Generally speaking, Swift has turned out to be a solid object storage system. We did however run into some performance issues. Below we’ll describe how we analyzed and solved them.

Continue reading

Making RUM actionable

Real User Measurements (or Monitoring) has seen a steep increase in use in the past year. Companies see it as the next Holy Grail in web performance monitoring and are shifting their attention to it. With current RUM techniques, it’s still impossible to pinpoint what the cause of a slowdown is. How do you get insights from the RUM data and how fast can you notice shifts in trends? The second post in our series about RUM will dive into it. Missed the first about the general introduction of RUM? Click here.

Continue reading

Big data processing with Disco

Those who deal with big data probably know about Disco – a distributed computing framework aimed to provide a MapReduce platform for big data processing Python applications. We are proud to say that we are one of the largest users of Disco in the Netherlands.

As an owner of multiple high-traffic portals with lots of content served by CDN providers we want to ensure that the data on our portals loads fast. We have recently rolled out a Disco based solution that helps us to deal with this issue and continuously monitors availability and load performance of our content.

We put the basics of MapReduce paradigm, key points of writing MapReduce jobs with Disco and highlights of our solution together into one workshop. We showed participants how easy it is to develop their own big data applications that process a billion samples per day.

If the topic sounds attractive to you – you’re welcome to have a look at the slides we used during the workshop: http://spil.com/discoworkshop2013

Database TCO presentation

Speaking about MySQL UG NL Meetups – during the Q1 Meetup in February (hosted by Spil Games) one of our DBAs gave a presentation on database TCO (Total Cost of Ownership). We would like to share it here.

Apart from covering the topic of determining database maintenance costs the presentation included some ideas on possible improvements as well as findings of our own small case study. The slides can be found here: http://spil.com/databasetco

Feel free to ask questions and give comments.