MySQL StatsD release 0.1.3

About one and a half years ago we created MySQL StatsD to keep get insights of any MySQL server using a local daemon that frequently queries the MySQL server and pushes its data to StatsD locally. In the past year we have seen the usage of the MySQL StatsD project increase and more and more people use it happily.

Of course there is always something lacking with projects like this, so in our latest release (0.1.3) we included/fixed the following:

  • Multiple bufferpool support (get your stats on the individual bufferpools)
  • New stats type called Delta (this time documented and included in the example config)
  • Fixed issue 30: MySQL StatsD crashed when replication broke
  • Added documentation on configuration and its directives

If you feel the urge to fix something or contribute to the project, you are more than welcome to help us out!

You can find the project here: MySQL StatsD @ Github

The kernel panics from 2014 on our OpenStack Hypervisors

Having a bit of time on this last day of the year I decided to document two of our more challenging issues we had in 2014:
We had 2 cases where hypervisors spontaneously started to lock up or reboot.
Note that we are using Scientific Linux 6.5 and OpenStack Icehouse from RDO.

Issue 1: XFS kernel panic / deadlock


These lockups, sometimes resulting into a kernel panic, started to occur when we started to run bigger ELK instances (Elasticsearch, Logstash, Kibana) on our hypervisors.
When a hypervisor or instance locked up the following could be found in the kernel log of the hypervisor:
XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)


The message pointed to memory allocation issues. However, over 100GB of free memory was present on the hypervisor.
With some help of the xfs irc channel the culprit was found:
The RAW backing file for the instance consisted of too many extents.
Because the data in the ELK instances grew in very small increments (e.g. compared to mysql which allocates big blocks of data at once) the RAW backing file had lots and lots of extents.
Apparently xfs started to have issues keeping track of the extents.
(the xfs_bmap command took quite some time to complete…)


We changed to pre-allocated RAW files for this instance type.
Pre-allocated files are allocated in one go you have just a few big extents instead of many small extents whenever the file grows.

Issue 2: Bridge/netfilter kernel panic


The issue just started without an apparent reason and we had a random hypervisor reboot once every few days. There was nothing to see in the kernel log and we do an automatic reboot after a kernel panic.


Finding this issue required us to first to capture some more info.
We enabled the kernel crash dumps (just enable the kdump service) and now we had a kernel log with the important part:
<4>RIP: 0010:[<ffffffffa048893d>]  [<ffffffffa048893d>] br_nf_pre_routing_finish+0x18d/0x350 [bridge]
<1>RIP  [<ffffffffa048893d>] br_nf_pre_routing_finish+0x18d/0x350 [bridge]

This points to a bridge / netfilter issue.
This specific message is not documented but other people have had kernel panics when (e.g. bridge) info is missing from a packet.


Dropping all traffic not specifically allowed by Neutron in the iptables FORWARD chain fixed the issue.  Although this theoretically should not have made a difference (no packets should hit this rule) we have not had any reboots since this rule was applied.
Note that we do not use namespaces and it is quite possible that using namespaces also prevents this issue from happening.

OpenStack Operator tool: Novacek


As an OpenStack operator we missed some tooling for troubleshooting and status checking.
To fill this gap we created a python tool to give us this info:
The tool is written in python and uses the OpenStack libraries to get the necessary information.


Typical things we use this tool for:

  • Get the status of all instances when you login to a hypervisor
  • Mail all people who have an instances on a hypervisor (very basic currently and needs the email in the tenant description field)
  • Check if the hypervisor setup (VLANs) is correct (this also requires the easyovs library)


Although the tool is fully functional we still have some things we would like to add.
To make sure people are not replicating this functionality we chose to publish the tool in the early stages.
Feel free to extend and improve the functionality by doing a merge request.

Piqi in Spil Games – unifying our interface definitions

Piqi is a language we’ve been using to define our internal and external interfaces at Spil Games. We’ve benefited from its versatility as a project, and this blog post will try to show how.

Background (how Spil renders portals)

Simplified request flow example

Simplified request flow example

  1. Browser requests a page (
  2. Request goes though a CDN, load balancers, and ends up in Widget Platform. Widget platform is a proprietary system (written in Erlang, find out more about that in this talk from Erlang User Conference 2013) which renders HTML. It gets all information needed from a RESTful Spil API (HTTP/JSON).
  3. Spil API receives the request, does the authentication/authorization, and passes it further to backend services. Spil API is connected to backend services via native Erlang; spapi-router is used for service discovery and routing. Google protocol buffers is the payload format.

You can recognize that this is a typical Service-Oriented Architecture (more
in a talk in Erlang Factory 2013). We have a few dozen independent services, which expose an API that other services (or even browsers from front-end) can call. The way interfaces are defined influences many things: coding a client, coding a server, even the architectural design in early stages of service planning.

Why is this important?

Having many services poses a few challenges:

  • Every service has its own interface, which, in order to be efficiently used, has to be clearly documented:
    1. Input, output types with examples.
    2. Error conditions.
    3. Human-readable remarks.
  • Since a lot of data is being sent all the time, it’s critical to minimize the data encoding overhead. JSON is good enough for user-facing systems (nicely compatible with web browsers and very easy to use and understand). Protocol buffers provides many advantages for internal traffic: type checking, compatibility, very efficient packing.
  • At service level, input and output should be type-checked according to definition; that avoids lots of silly bugs. It also ensures documentation is kept up to date, because interface definition is documentation.

Technically our requirements for the interface definition are:

  1. Specify functions and input/output types in machine-understandable format.
  2. Input and output verification should be automatic – developer should not care.
  3. Generate documentation from interface definition.
  4. Generated boilerplate, so developer can focus on implementing the business side of the service.
  5. Compatibility is critical. It must be possible to evolve schema and reason about it.


  • Use the same language (and tools with it) for external and internal services. It is easier to maintain one system than two. Keeping this has a lot of advantages:
    1. Once you establish a front-end protocol, you’re able to use this as a base for a back-end protocol. This increases operative efficiency and reduces learning time for developers.
    2. “Internal” and “External” definitions can be shared. For example, both can have common date/time types.
    3. Code generators, documentation generators, etc. can be shared, suppying knowledge for both realms.
  • To minimise demand on CPU and network, use a more efficient protocol (i.e. binary) for internal services. For example, Google protocol buffers, Apache Thrift, Cap’n Proto.

Quick Piqi example

We treat Piqi as Interface Definition Language (IDL), even though it’s much more than that. It helps us define interfaces and work with them. Piqi acts as a language bridge, which can “talk” in different protocols, namely JSON, XML and Google protocol buffers.

Below is the example, echo service definition and implementation:


With just these few lines of code, we automatically gain the ability call this service using HTTP/JSON, HTTP/XML, HTTP/protobuf and Erlang native/protobuf. Example query using HTTP/JSON:

And a corresponding .proto file, which is used to construct payload directly
to the function:

Output site in compact JSON is 109 bytes, whereas size of the equivalent payload in protocol buffers is 23 bytes. Space saving more than a factor of 4.

This function is just as easy to perform using Erlang and protocol buffers directly, without needing to add any extra line of code. This also creates automatic interface documentation:

Generated documentation from piqi rpc file

Automatically generated documentation for Echo Service

Piqi vs Piqi-rpc

Piqi is a language-agnostic tool and can work with any language which supports Google protocol buffers. Language independent.

Piqi-rpc is a library for Erlang applications which enables to create web-services (like in the example above). Language dependent – Erlang.

That being said, Piqi has found its uses mostly in the worlds of Erlang and OCaml, but it is important to know that it can be useful for much wider environments.


Piqi is the language we use to define interfaces for all our internal and external services. Each service is a piqi-rpc application, exporting its functionality via both JSON/HTTP and Erlang/protobuf. Each service contains its own definition, which can be reused by other services. Generated documentation, which people at all stacks (from front-end javascript to back-end developers) find very convenient.

Piqi and Spil Games

We started adopting Piqi somewhere in mid-2011. Piqi and piqi-rpc are running in production since mid-2012. It is cheerfully generating and parsing millions of queries per day, and it has helped make our work a lot easier and more efficient.

Interested in finding out more? We’re hiring!


Spil Games hackday: research MaxScale

Spil Games hackday

At Spil Games we organise a monthly hackday (actually the last two Thursday/Friday of the month) where several teams try to hack something in limited amount of time. The only rules for the hackday are: make something cool, keep it simple, document it, learn something and have fun. In general it means time is too short so we really have to go for it. 😉

Last week the August hackday(s) happened on Thursday and Friday and me and my team decided (last minute) to join in with a research project on MaxScale.

What was there to hack?


In our current database connectivity strategy we (as in Spil Games) use the load balancers+routers to connect to our new MySQL database clusters. This has the advantage that we can control per host if they are allowed to access the database or not. At the same time this has the drawback that a lot of administration is necessary to make this happen on the router and thanks to the load balancers masking the original client host on the database we can’t use the normal MySQL grants to filter per host level. Also on the client host the administration of databases has to be done in the application configuration file for both database server and user. Config management in Spil is not the easiest, so if we could simplify this by making the database configurations easier it would help a lot with deployments and setting up similar environments for any project.


We want to research the feasibility of using the MaxScale MySQL proxy inside Spil Games instead of the current load balancer+router approach. MaxScale can be set up both as a central proxy or as a local(host) proxy and we like to have tested. MaxScale also features the loading of the grants on start up and this has to be tested for multiple databases at the same time. Also the security needs to be tested: it could be dangerous if MaxScale is too permissive (e.g. allowing resources to be accessed where it should not).

For the central proxy we need to find out how to set it up as a highly available solution. (floating ips, pacemaker, etc)

For the local proxy we need to find a way to automatically configure MaxScale (puppet) to access the right resources and we also need to test its maintainability through its api, telnet and http access.

More information on MaxScale can be found here:


  • Art van Scheppingen
  • Jaakko Pesonen (joined day 1)
  • Jing Rao (joined day 2)

How to use MaxScale?

Continue reading

Using Ceilometer with Graphite

At Spil Games we love OpenStack and we love metrics.
We tried to run Ceilometer in the past but we experienced performance issues. We heavily use Graphite to store metrics we thought it would be a good idea to push Ceilometer metrics into Graphite. The data is directly sent from the compute node to the graphite backend so there are no bottlenecks.The quick and dirty proof of concept code provided here works great in our environment 😉 Note that this solution ONLY offers some compute graphs and does nothing more then this.

What you get:

Per vm graphs.
e.g. cpu usage of all machines on a single hypervisor:


This installation is tested/based on SL 6 and the Icehouse RDO packages.
These steps need to be done on all OpenStack hypervisors where you want graphs from.

  • Install the openstack-ceilometer-compute package.
  • Configure ceilometer.conf:
    – Make sure to have the rabbitmq and keystone settings configured.
    – Add the graphite settings: prefix and append_hostname


  • Ad a publisher to the graphite entry points.
    e.g. /usr/lib/python2.6/site-packages/ceilometer-2014.1.1-py2.6.egg-info/entry_points.txt

  • Clone our git repo for the example pipeline.yaml and graphite publisher.
  • Copy the pipeline.yaml to /etc/ceilometer/pipeline.yaml
    – make sure you edit the yaml publishers to send it to the correct graphite server.

  • Install the graphite publisher.
    – Copy the to /usr/lib/python2.6/site-packages/ceilometer/publisher/
  • Restart the openstack-ceilometer-compute agent

You should start seeing graphs now.

Getting it upstream

Since we already have some code we decided to put this on the web.
There is a blueprint here to get things officially upstream but there is still some discussion going on there.


How I mastered WebRTC

WebRTC, short for ‘Web Real Time Communication’, is an open source API that supports voice/video chats and peer-to-peer connections for browsers without any plugins. The project started in the beginning of 2011. Now, a couple of years later, the project is getting more mature and with that more useful. Although not all browsers support WebRTC it is supported by three major browsers, Chrome, Firefox and Opera.

Continue reading

FOSDEM 2014 logo

Follow up on FOSDEM

After my presentation at FOSDEM I got a few questions regarding our Galera implementation and why we did things the way we want.
First of all, the slides:

Second of all, the questions I got:
Q: Why first copy all the data to a new MySQL server using innobackupex and then perform the mysqldump?
This is a question regarding the consolidation of multiple existing asynchronous replicated clusters to a new Galera cluster.
In the slides I showed we use an active-inactive Master-Master setup where one of the MySQL masters is receiving all write-traffic while the inactive master is receiving read-traffic. If we would perform the mysqldump on the inactive master we either have to drain the inactive master from read-traffic and stop replication or it will lock the tables and we would not have a frozen snapshot.
A related question was asked why we do not use innobackupex to feed the backup to the Galera cluster and then create the cluster from this as a starting point. That could be done for the first node, however we wish to consolidate multiple clusters into one Galera cluster we have to ensure the data gets replicated into the new cluster online. Therefore mysqldump is the only viable solution here.

Q: Why are you using MMM?
This is a choice we made five years ago. It worked fine enough for us and we stuck to it till today. We do know it is flawed (some say be design) and we know it has a lot of drawbacks and it is actually one of the drivers to start using Galera. 😉

Q: Why don’t you expect clashes when writing the same data twice at the same time?
In our sharded environment (the Spil Storage Platform) will never write the same data twice as every piece of data that is sharded by user, function and location will have its own owner process in this platform. This means there will never be a second process writing the same piece of data. In other words: our environment allows us to isolate writes and we never expect clashes.
In our other (current) environments the number of writes is low, so the chance of a clash will be low.

If you have any other question, don’t hesitate to reach out to us or place a comment below.

Functional testing of Puppet modules and Docker

In this article I will describe our way of testing Puppet modules and how features of Docker (and lxc containers) are used in our testing framework.

In Spil Games we were early adopters of Puppet. In 2013 decision were made to update our Puppet infrastructure to version 3.* Of course we decided to follow all the best practices and do it agile 🙂 . While the official documentation provides a more or less clear overview of basic components (modules, hiera, node classification),  we found there is no optimal (ready to use) way of testing the functionality of the modules. That’s why we came up with own testing solution based on lxc containers. This solution in connection with Gerrit and Jenkins gives us a very solid and fast framework for testing module functionality.

Continue reading

Sphinx Search logo

How we tamed Sphinx Search

It is no secret that Spil Games is a heavy user of Sphinx Search. We use it in many ways including game-search, profile-search and since a few months ago to even build our category and subcategory listings. In all cases we do not use it as an extension of MySQL but rather as a standalone daemon facilitating listings of (document) identifiers.

As 2013 progressed towards X-mas we saw the utilization of our category/subcategory Sphinx cluster sky-rocketing which caused the response times to increase heavily. During peak hours we performed about 500 queries per second with response times in milliseconds while sometimes all of a sudden the response times of the application would go up near sub-second response times. We quickly added response time capturing inside the application and compared it against the load spikes on the Sphinx hosts:

Sphinx response time vs load

One of the major contributors to the load increase was the indexing process. Just like the Sphinx search daemon this indexing process is multi-threaded and this means it will suck up all idle cpu time of all cpu cores. Coincidentally this covered about 80% of the load spikes. This meant for the first ever we had to fight with a genuinely multi-threaded application.

Continue reading