Using Ceilometer with Graphite

At Spil Games we love OpenStack and we love metrics.
We tried to run Ceilometer in the past but we experienced performance issues. We heavily use Graphite to store metrics we thought it would be a good idea to push Ceilometer metrics into Graphite. The data is directly sent from the compute node to the graphite backend so there are no bottlenecks.The quick and dirty proof of concept code provided here works great in our environment ;) Note that this solution ONLY offers some compute graphs and does nothing more then this.

What you get:

Per vm graphs.
e.g. cpu usage of all machines on a single hypervisor:

graphite.spilgames.com

Installation

This installation is tested/based on SL 6 and the Icehouse RDO packages.
These steps need to be done on all OpenStack hypervisors where you want graphs from.

  • Install the openstack-ceilometer-compute package.
  • Configure ceilometer.conf:
    - Make sure to have the rabbitmq and keystone settings configured.
    - Add the graphite settings: prefix and append_hostname

Example:

  • Ad a publisher to the graphite entry points.
    e.g. /usr/lib/python2.6/site-packages/ceilometer-2014.1.1-py2.6.egg-info/entry_points.txt

  • Clone our git repo for the example pipeline.yaml and graphite publisher.
  • Copy the pipeline.yaml to /etc/ceilometer/pipeline.yaml
    - make sure you edit the yaml publishers to send it to the correct graphite server.

  • Install the graphite publisher.
    - Copy the graphite.py to /usr/lib/python2.6/site-packages/ceilometer/publisher/graphite.py
  • Restart the openstack-ceilometer-compute agent

You should start seeing graphs now.

Getting it upstream

Since we already have some code we decided to put this on the web.
There is a blueprint here to get things officially upstream but there is still some discussion going on there.

 

How I mastered WebRTC

WebRTC, short for ‘Web Real Time Communication’, is an open source API that supports voice/video chats and peer-to-peer connections for browsers without any plugins. The project started in the beginning of 2011. Now, a couple of years later, the project is getting more mature and with that more useful. Although not all browsers support WebRTC it is supported by three major browsers, Chrome, Firefox and Opera.

Continue reading

FOSDEM 2014 logo

Follow up on FOSDEM

After my presentation at FOSDEM I got a few questions regarding our Galera implementation and why we did things the way we want.
First of all, the slides:

Second of all, the questions I got:
Q: Why first copy all the data to a new MySQL server using innobackupex and then perform the mysqldump?
This is a question regarding the consolidation of multiple existing asynchronous replicated clusters to a new Galera cluster.
In the slides I showed we use an active-inactive Master-Master setup where one of the MySQL masters is receiving all write-traffic while the inactive master is receiving read-traffic. If we would perform the mysqldump on the inactive master we either have to drain the inactive master from read-traffic and stop replication or it will lock the tables and we would not have a frozen snapshot.
A related question was asked why we do not use innobackupex to feed the backup to the Galera cluster and then create the cluster from this as a starting point. That could be done for the first node, however we wish to consolidate multiple clusters into one Galera cluster we have to ensure the data gets replicated into the new cluster online. Therefore mysqldump is the only viable solution here.

Q: Why are you using MMM?
This is a choice we made five years ago. It worked fine enough for us and we stuck to it till today. We do know it is flawed (some say be design) and we know it has a lot of drawbacks and it is actually one of the drivers to start using Galera. ;)

Q: Why don’t you expect clashes when writing the same data twice at the same time?
In our sharded environment (the Spil Storage Platform) will never write the same data twice as every piece of data that is sharded by user, function and location will have its own owner process in this platform. This means there will never be a second process writing the same piece of data. In other words: our environment allows us to isolate writes and we never expect clashes.
In our other (current) environments the number of writes is low, so the chance of a clash will be low.

If you have any other question, don’t hesitate to reach out to us or place a comment below.

Functional testing of Puppet modules and Docker

In this article I will describe our way of testing Puppet modules and how features of Docker (and lxc containers) are used in our testing framework.

In Spil Games we were early adopters of Puppet. In 2013 decision were made to update our Puppet infrastructure to version 3.* Of course we decided to follow all the best practices and do it agile :-) . While the official documentation provides a more or less clear overview of basic components (modules, hiera, node classification),  we found there is no optimal (ready to use) way of testing the functionality of the modules. That’s why we came up with own testing solution based on lxc containers. This solution in connection with Gerrit and Jenkins gives us a very solid and fast framework for testing module functionality.

Continue reading

Sphinx Search logo

How we tamed Sphinx Search

It is no secret that Spil Games is a heavy user of Sphinx Search. We use it in many ways including game-search, profile-search and since a few months ago to even build our category and subcategory listings. In all cases we do not use it as an extension of MySQL but rather as a standalone daemon facilitating listings of (document) identifiers.

As 2013 progressed towards X-mas we saw the utilization of our category/subcategory Sphinx cluster sky-rocketing which caused the response times to increase heavily. During peak hours we performed about 500 queries per second with response times in milliseconds while sometimes all of a sudden the response times of the application would go up near sub-second response times. We quickly added response time capturing inside the application and compared it against the load spikes on the Sphinx hosts:

Sphinx response time vs load

One of the major contributors to the load increase was the indexing process. Just like the Sphinx search daemon this indexing process is multi-threaded and this means it will suck up all idle cpu time of all cpu cores. Coincidentally this covered about 80% of the load spikes. This meant for the first ever we had to fight with a genuinely multi-threaded application.

Continue reading

MySQL StatsD project on Github

Done is better than perfect.

That is one of Spil Games Engineering’s principles and it is really something I embrace. Almost one and a half year ago, Engineering started to use StatsD and Graphite for collecting and graphing performance metrics and one year ago I got fed up that there was, except for a couple of abandoned projects, no real drop in solution for getting a daemon sending performance metrics via StatsD into Graphite . So I created my own daemon on a Friday afternoon in Python and it was indeed done and not perfect.

Continue reading

OpenStack Swift & many small files

At Spil Games we have been running Swift for more than two years now and are hosting over 400 million files with an average file size of about 50 KB per object. We have a replica count of three so there are 1.2 billion files to be stored on the object servers. Generally speaking, Swift has turned out to be a solid object storage system. We did however run into some performance issues. Below we’ll describe how we analyzed and solved them.

Continue reading

Making RUM actionable

Real User Measurements (or Monitoring) has seen a steep increase in use in the past year. Companies see it as the next Holy Grail in web performance monitoring and are shifting their attention to it. With current RUM techniques, it’s still impossible to pinpoint what the cause of a slowdown is. How do you get insights from the RUM data and how fast can you notice shifts in trends? The second post in our series about RUM will dive into it. Missed the first about the general introduction of RUM? Click here.

Continue reading