OpenStack Swift & many small files

At Spil Games we have been running Swift for more than two years now and are hosting over 400 million files with an average file size of about 50 KB per object. We have a replica count of three so there are 1.2 billion files to be stored on the object servers. Generally speaking, Swift has turned out to be a solid object storage system. We did however run into some performance issues. Below we’ll describe how we analyzed and solved them.

The problem

When you start to put many small files into Swift you will start to notice things will slow down, especially on PUT requests. At some point we maxed out at 50 PUT requests a second.
The slowness we see is not caused by Swift components but by the “filesystem” on the object servers.
On some of our nodes we currently have disks with over 60 million inodes in use and running ls on /srv/node/disk/objects/xxxxxxxx/ can take up to 30 seconds.
The issue seems to be due to the inode tree no longer to be able to fit in memory.
When this happens you will get a lot of inode cache misses which will result in extra reads  from the disks to get the inode information.

Digging into the issue: xfs xs_dir_lookup, xs_ig_missed and slabtop

To see how many inode cache misses you have, you can look at some XFS statistics (from the xfs runtime stats wiki )

xs_dir_look
This is a count of the number of file name directory lookups in XFS filesystems. It counts only those lookups which miss in the operating system’s directory name lookup cache and must search the real directory structure for the name in question. The count is incremented once for each level of a pathname search that results in a directory lookup.

xs_ig_missed
This is the number of times the operating system looked for an XFS inode in the inode cache and the inode was not there. The further this count is from the ig_attempts count, the better.

We are collecting these values using collectd, using this collection script:

 

There is also a way to look at the amount of cache is used for caching inodes:

Slabtop:
Displays detailed kernel slab cache information in real time. It displays a listing of the top caches sorted by one of the listed sort criterias. Relevant info:
xfs_inode: inode cache
dentry: directory entry cache

Memory recommendations

On average on our system, an inode will take up 1066 bytes:
To find out how much memory is used for the inode cache we set the
sysctl vm.vfs_cache_pressure=0 and run a find command on the object server.
This will put all inodes into memory.
On a machine with 300 million inodes 320GB  of memory is used to cache everything, which equals to 1066 bytes per inode.
The inodes consist of 100 million files and 200 million directories.
We had to extrapolated the result since we have nodes with 256GB.
We ran the find up to 230GB of memory usage at looked at time the find command took to fill that up. From this we calculated the total by the time find took to run.
Looking at usage of our real-world usage of machines with vm.vfs_cache_pressure=1
128GB for 100 million files (+ 200 million directories) seems to be enough for our use case.

Tuning the OS and Swift

sysctl

To have the least amount of inode cache misses we want as much memory to be assigned to  inode / dentry cache.
The only way to influence how much cache is used is setting the sysctl variable vm.vfs_cache_pressure which controls the tendency of the kernel to reclaim the memory which is used for caching of directory and inode objects:

Note that you should never set this value to zero!  Setting it to zero will never clean the inode cache. Doing a rebalance in Swift will shuffle files around and makes sure you will run out of memory sooner or later with triggering the OOM killer as a result.
Another thing to keep in mind is that this will happen at the cost of regular caching.
E.g. if you use the “keep_cache_size” setting in Swift this could actually cost performance.
Finding the right vm.vfs_cache_pressure setting can only be done through testing.
We do no caching within Swift because we use CDNs and reverse proxy’s so caching on the Swift nodes do not make sense.

swift

Running replicators and auditors has a big impact on the load of the nodes. We run only one of each on a node to reduce the impact.

(io)nice

ionice can give preference of io calls to certain processes.
We run a cronjob to set the priority of the processes that impact Swift performance

inode size

According to the maillinglist recent kernel developments no longer require an inode size of 1KB. However, this does not seem to be the case for the Red Hat 6 kernel yet, so we still use an inode size of 1 KB.

Container size

It is advisable to keep the number of files per container below one million.
This is due to the sqlite database performance.
This number does not appear to be a limit if you use SSDs for the container server but we stuck to it anyway.

What we use at Spil Games

Running a cluster with so many files turned out to be more challenging than expected.
This is a description of our current setup and what we changed in our hardware setup over the years.

Performance

To get some indication what we are talking about, this is our current usage:
GET:

200 requests per second, average latency of 20ms

PUT:
5 requests per second, average latency of 170ms

Maximum PUT:
We benchmarked the write requests and max out at about 200 new files per second.

Proxy nodes

Proxy nodes provide access to data through the API.
There are three nodes which are load-balanced and we can easily turn of two machines without seriously impacting performance.
These machines just need a bit of CPU power and not a lot more.
Since the machines are mostly idle we decided to run a NGINX reverse-proxy on the same machines to cache the Swift (get) requests.
We have a 50% hit ratio on our cache, after caching we are doing about 200 requests a second to the Swift proxy nodes.

Account and container nodes

We run this on a 3 cluster nodes with a RAID 10 of 4 x 120 GB SSDs .
If you get a lot of objects you need SSDs to keep the SQLite databases fast.
At this time the machines are not a bottleneck and are doing about 1600 IOPS writes.
This is not a workload you want to run on spinning disks!

Object nodes

The Object nodes store the actual data.
These nodes are by far the most challenging to get right and we learned that the hard way.
Most people tend to buy machines with the biggest disks you can find, and so did we.
This will work fine unless you only have lots of small objects.
We started with this config:

Config 1.0:
6 x 2TB, 3.5 inch, 7200 RPM
24GB Memory

We soon noticed that machines were not performing as expected, especially with writes.
We first mitigated the issue a bit by doing the performance tuning described above. These things slightly helped the issue but not enough so we decided to modify the hardware and add memory + flashcache:

Config 1.1:
6 x 2 TB, 3.5 inch, 7200 RPM
1 x SSD for OS and flashcache, flashcache of 10GB per disk
48GB memory

Although we did see an improvement it became clear that the machines still could not cope very well. Still a very high load and it was obvious the sheer amount files on a node was the issue, so we decided on a new config:

Config 2.0:
10 x 1TB, 2.5 inch, 7200RPM
2 x SSD for OS and flashcache, flashcache of 10GB per disk
256GB memory

These machines are not taxed in any way yet since we run in a mixed config and the old nodes are a bottleneck:
7 x config 1.1
4 x config 2.0
It looks like the machines will also be able to run fine with 128GB memory, so we will change that when we order new nodes.

Future plans

Re-balance:
We currently do not need faster PUT performance but if we would, we would rebalance the cluster:
Currently a 2TB disk gets two times the number of files as a 1TB disk. We would change this so it gets as much files as the 1TB disk. Ofcourse this means losing quite a bit of capacity so we would need another three 2.0 nodes.

Add more big files:
The easiest way out of the issue is getting more big files into Swift. Having an average of 500KB objects instead of 50KB would need 10 times the number of Swift nodes which would greatly reduce all the issues described above.
We already added Swift as a backend for Glance and are looking at multiple ways of adding more data, e.g. Hadoop.

Comments

  1. […] in Infrastructure, OpenStack, Planet Openstack, Swift and tagged Openstack, performance, Swift on August 28, 2013 by Robert van […]

  2. jee says:

    Interesting and helpful read for someone new to swift. Thanks for sharing; HBase technology may be helpful if file sizes are < 1MB.