Having a bit of time on this last day of the year I decided to document two of our more challenging issues we had in 2014:
We had 2 cases where hypervisors spontaneously started to lock up or reboot.
Note that we are using Scientific Linux 6.5 and OpenStack Icehouse from RDO.
Issue 1: XFS kernel panic / deadlock
These lockups, sometimes resulting into a kernel panic, started to occur when we started to run bigger ELK instances (Elasticsearch, Logstash, Kibana) on our hypervisors.
When a hypervisor or instance locked up the following could be found in the kernel log of the hypervisor:
XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
The message pointed to memory allocation issues. However, over 100GB of free memory was present on the hypervisor.
With some help of the xfs irc channel the culprit was found:
The RAW backing file for the instance consisted of too many extents.
Because the data in the ELK instances grew in very small increments (e.g. compared to mysql which allocates big blocks of data at once) the RAW backing file had lots and lots of extents.
Apparently xfs started to have issues keeping track of the extents.
(the xfs_bmap command took quite some time to complete…)
We changed to pre-allocated RAW files for this instance type.
Pre-allocated files are allocated in one go you have just a few big extents instead of many small extents whenever the file grows.
Issue 2: Bridge/netfilter kernel panic
The issue just started without an apparent reason and we had a random hypervisor reboot once every few days. There was nothing to see in the kernel log and we do an automatic reboot after a kernel panic.
Finding this issue required us to first to capture some more info.
We enabled the kernel crash dumps (just enable the kdump service) and now we had a kernel log with the important part:
<4>RIP: 0010:[<ffffffffa048893d>] [<ffffffffa048893d>] br_nf_pre_routing_finish+0x18d/0x350 [bridge]
<1>RIP [<ffffffffa048893d>] br_nf_pre_routing_finish+0x18d/0x350 [bridge]
This points to a bridge / netfilter issue.
This specific message is not documented but other people have had kernel panics when (e.g. bridge) info is missing from a packet.
Dropping all traffic not specifically allowed by Neutron in the iptables FORWARD chain fixed the issue. Although this theoretically should not have made a difference (no packets should hit this rule) we have not had any reboots since this rule was applied.
Note that we do not use namespaces and it is quite possible that using namespaces also prevents this issue from happening.