MaxScale

Spil Games hackday: research MaxScale

Spil Games hackday

At Spil Games we organise a monthly hackday (actually the last two Thursday/Friday of the month) where several teams try to hack something in limited amount of time. The only rules for the hackday are: make something cool, keep it simple, document it, learn something and have fun. In general it means time is too short so we really have to go for it. ;)

Last week the August hackday(s) happened on Thursday and Friday and me and my team decided (last minute) to join in with a research project on MaxScale.

What was there to hack?

Context

In our current database connectivity strategy we (as in Spil Games) use the load balancers+routers to connect to our new MySQL database clusters. This has the advantage that we can control per host if they are allowed to access the database or not. At the same time this has the drawback that a lot of administration is necessary to make this happen on the router and thanks to the load balancers masking the original client host on the database we can’t use the normal MySQL grants to filter per host level. Also on the client host the administration of databases has to be done in the application configuration file for both database server and user. Config management in Spil is not the easiest, so if we could simplify this by making the database configurations easier it would help a lot with deployments and setting up similar environments for any project.

Hack

We want to research the feasibility of using the MaxScale MySQL proxy inside Spil Games instead of the current load balancer+router approach. MaxScale can be set up both as a central proxy or as a local(host) proxy and we like to have tested. MaxScale also features the loading of the grants on start up and this has to be tested for multiple databases at the same time. Also the security needs to be tested: it could be dangerous if MaxScale is too permissive (e.g. allowing resources to be accessed where it should not).

For the central proxy we need to find out how to set it up as a highly available solution. (floating ips, pacemaker, etc)

For the local proxy we need to find a way to automatically configure MaxScale (puppet) to access the right resources and we also need to test its maintainability through its api, telnet and http access.

More information on MaxScale can be found here:

https://mariadb.com/resources/maxscale

https://github.com/skysql/MaxScale

Who?

  • Art van Scheppingen
  • Jaakko Pesonen (joined day 1)
  • Jing Rao (joined day 2)

How to use MaxScale?

First of all, this is the current situation for our Multi-Master setups:  Multi-Master setup used at Spil Games

The Multi-Master setup means we have two MySQL masters that slave from each other (see it as circular replication) where we keep one master active and the other as inactive. MMM monitors the two masters and determines which one is supposed to be the active one. This works okayish as long as MMM is able to reach these servers.

Here the servers running an application can connect directly to a floating ip address on the MySQL servers where MMM is maintaining the state and the ip addresses (with a DNS entry for naming them). In this setup it is easy to determine the master as it is only one single ip address (db-something) and a reader can also be found quickly by choosing a random node from the readers: db-something-r1 and db-something-r2.

There are a couple of drawbacks on this design:

  • Usage of a lot of floating ip addresses
  • Choosing a (random) reader address could still be a connection to the active master
  • Configuration of applications differs between dev/staging/production

MaxScale can be used in this scenario in two ways: as a centralised clustered proxy or as a local proxy on application servers.

The clustered solution would look like this:

MaxScale as a clustered proxy

The biggest advantage is that there is only one set of hosts that need to be maintained and can be placed inside a DMZ. Connectivity can be arranged from each and every network segment using our internal DNS to resolve to the same set of hosts while inter-network connectivity is not allowed.

There are some big drawbacks to this scenario: clustering needs to be done and we need to have a couple of failover scenarios ready. It will also concentrate network activity on the MaxScale proxies and given the bandwidth of our Openstack environment this could potentially be a bottleneck if we host these proxies in Openstack and once network saturation happens it would also affect other nodes on the same hypervisors.

Also if we have 3 proxies running and one of them fails the fallback would be on only one of the other proxies. This could potentially take down the other proxy as well with the cascaded effect that it could take down the entire MySQL infrastructure. So if we are clustering MaxScale the cluster-size should be large enough to allow at least one node to fail.

MaxScale can also be used as a localhost proxy:

MaxScale as a local proxy

For us this is the preferred solution as this makes it more transparent for the developers at Spil Games (their database is always on localhost! ;) ) and we don’t have to worry too much for any MaxScale instance to fail: it would become immediately apparent on the application server and it could not potentially take down the entire MySQL infrastructure. It also allows us to make the various environments (dev/staging/production) more transparent.

For our future databases we will use Galera replication instead of the asynchronous replication used in the example(s) above. For Galera we can also use MaxScale where we could use its capabilities to do read/write splitting based upon the queries sent to MaxScale.

Questions

What we are trying to answer during the hackday were the following questions:

  • Are we going for the local proxy on every node or centralised clustered HA proxy?
    • Centralised clustered HA proxy:
    • Local proxy
      • How to easily configure the instance
      • How to control them remotely?
  • How does the authentication/authorization work
    • Per cluster?
    • Per shard?
    • On local proxies?
    • Caching of the auth entries:
      • How frequently are they reloaded?
      • And how to flush them manually?
  • What interfaces are there and how (easy) can we use them?
  • How do we add multiple clusters?

After we have answered most of these questions we can make a choice on the product and on how we will configure/control them.

Worklog Day 1

We installed MaxScale on a test node in our Openstack database test tenant using the downloadable rpm from the SkySQL website.

Installing through yum went like a breeze and it installed the tool in:

Starting MaxScale without any config did not give an error (only the service failed to start) with its logfile written in the MaxScale location and not in the standard /var/log location or via syslog.

In other words: MaxScale only logs by default in its own directory. Need to change this in the config if possible.

Note: Promising is that authentication can also be done through LDAP and Keystone. Will look into this in the future…

Created a config file in /usr/local/skysql/maxscale/etc/MaxScale.cnf according to the MaxScale config documentation, but it gives errors on the password.

Day 1 ended (after only 2 hours of hackday. Hackdays where people plan half a day off in advance are not very productive… :( )

Worklog Day 2

As MaxScale does not support syslog yet we moved the log directory to /var/log/maxscale and symlinked this back to /usr/local/skysql/maxscale/log

MaxScale now only complains about not being able to find the .secrets file, so we creates one manually using the documentation:

This means we will have to do the log location and creation of the secrets file through puppet after installation of the rpm.

Getting it to work

It took us a while to get the first readconnrouter configured and after finally getting it to work:

  • Monitor on the nodes is necessary to determine which host is master and which one is slave (not necessary for Galera though). If these hosts are not in the monitor it means MaxScale does not know its status and refuses to use the node. This obviously makes sense.
  • All our MySQL servers are configured to accept ip addresses from specific hosts. As MaxScale caches the MySQL authentication/authorization locally it means it will match against the received ip address. Authentication for the localhost proxy has to be done via the external ip address as otherwise MaxScale doesn’t know where the traffic originates from and if it is allowed to connect (otherwise it is 127.0.0.1)
  • To monitor connections the debug and telnet service have to be enabled
Note: Lots of our users have wrongly set up hostname (no discrete ip addresses) which result in ignoring them when loading users: 

  • 2014 08/29 08:14:44 Error : getaddrinfo failed for [xxx.xxx.xxx.xxx/32] due [Name or service not known]
  • 2014 08/29 08:14:44 139849716033504 [getUsers()] setipaddress failed: user someuser@xxx.xxx.xxx.xxx/32 not added
  • 2014 08/29 08:14:44 Error : getaddrinfo failed for [xxx.xxx.xxx.%] due [Name or service not known]
  •  2014 08/29 08:14:44 139849716033504 [getUsers()] setipaddress failed: user someotheruser@xxx.xxx.xxx.% not added

Authentication

For us especially the localhost authentication/authorization is an issue when we want to have a localhost proxy. After a small brainstorm session we concluded that double administration with 127.0.0.1 could work as not allowed nodes are denied access even if the user manages to authenticate to MaxScale on localhost. Suppose the following grants:

MaxScale localhost authentication

In this case it means server-1 and server-2 are allowed to access db1, but server-3 is not allowed to access this database. MaxScale will load the grants from the MySQL servers and keep this in memory. Due to the localhost entry the MaxScale proxy will allow the client to establish a connection and then use the proxy to set up the connection to the database, however due to server-3 being not allowed to access this database it will deny access. The same applies to server-1 if it tries to do anything with db2.

Workaround for the localhost vs external ip address works by creating two records:

  • create user ‘maxtest’@’127.0.0.1′ identified by ‘test1234′;
  • create user ‘maxtest’@’xxx.xxx.xxx.xxx’ identified by ‘test1234′;
  • grant select on somedb.* to ‘maxtest’@’127.0.0.1′;
  • grant select on somedb.* to ‘maxtest’@’xxx.xxx.xxx.xxx';
Note: MaxScale itself only looks if 127.0.0.1 is allowed and authenticates against that. If there isn’t a matching record for the external ip address it will not be allowed to connect after all. This gives errors like: 

  • ERROR 2006 (HY000) at line 1: MySQL server has gone away
  • ERROR 2003 (HY000): Can’t connect to MySQL server on ‘127.0.0.1’ (111)

Multiple clusters

At the same time we tried to install multiple clusters in MaxScale. On one of the nodes in Openstack we successfully configured this while on the other one failed. Apparently we forgot to configure the additional nodes to the monitor and hence it will deny these servers. Changing the config and reloading the monitor did not help: the monitor still showed only two hosts.

Only after restarting MaxScale the monitored showed the correct nodes:

Issues we found that need to be verified/fixed:

  1. Loading of dbusers failed for subnet/wildcards. Need to verify why and find a solution to overcome this (apart from specifying all 200+ hosts separately)
  2. cli interface (port 6603) did not appear after a config reload. Not critical but surprised us.
  3. switching services around (3306 to 3307) and renaming them worked fine during reload of the config, but crashed once “show services” was issued in the telnet interface.
  4. reloading configs or restarting monitors doesn’t really seem to work well. Needs additional testing.
  5. master-master topology does not seem to be supported. Raised the question on Google Groups and there is hope: https://groups.google.com/forum/#!topic/maxscale/26GD_okOfLw

Conclusion

Answers to the questions

What we are trying to answer during the hackday were the following questions:

  • Are we going for the local proxy on every node or centralised clustered HA proxy?
    • Centralised clustered HA proxy:
      • Unanswered as we managed to get it to work on localhost
    • Local proxy
      • How to easily configure the instance
        • Configuration is easy and we can easily puppetize the config using our CMDB to determine the relationship between clusters/databases and the worker nodes that get MaxScale installed.
      • How to control them remotely?
        • Either the maxadmin or telnet interface will work nicely.
        • Reloading the configuration works but port assignment does not
        • Reloading users is easy
        • Health checks can easily be implemented using the telnet or httpd interface. (we did not test the httpd interface due to lack of time)
  • How does the authentication/authorization work
    • Per cluster?
    • Per shard?
    • On local proxies?
    • Caching of the auth entries:
  • What interfaces are there and how (easy) can we use them?
    • telnet, cli and httpd.
    • Telnet and cli have been tested, httpd is still on the todo list.
  • How do we add multiple clusters?
    • Multiple clusters is easily doable by creating the readconnrouter and readconnlistener for the new cluster.

The verdict

The final verdict is that MaxScale matured from the 0.3 alpha to the 1.0 beta now. The beta still has issues but nothing that can’t be resolved. SkySQL does advice against using it for production as it is still a beta.

It took us about 3 hours to get MaxScale to work and most of the delays were due to our own misconfigurations. If we would have taken an existing config example we would have been done in maybe 30 minutes, but doing it from scratch using the documentation made us understand the internals of MaxScale a lot better.

Are we going to use MaxScale to proxy actual traffic?

I think that it is a no brainer that MaxScale would help us to simplify our infrastructure and solve some of our resource discovery issues. Once the Multi-Master issue has been resolved we could already start migrating our staging environment to MaxScale and learn from that.