Openstack Neutron L3 failover issues
In the Cloud Services team at the Wikimedia Foundation we use Openstack Neutron to build our
virtual network, and in particular, we rely on the neutron-l3-agent
for implementing all the L3
connectivity, topology and policing. This includes basic packet firewalling and NAT.
As of this writing, we are using Openstack version Train. We run the neutron-l3-agent
on
standard linux hardware servers with 10G NICs, and in general it works really well. Our setup is
rather simple: we have a couple of servers for redundancy (note: upstream recommends having 3) and
each server runs an instance of neutron-l3-agent
. We don’t use DVR, so all ingress/egress network
traffic (or north-south traffic) flows using these servers. Today we use a flat network topology in
our cloud. This means that all of our virtual machines share the same router gateway. Therefore, we
only have one software-defined router.
Neutron does a very smart thing: each software-defined router is implemented on a linux network namespace (netns). Each router living on its own netns, the namespace contains all IP addresses, routes, interfaces, netfilter firewalling rules, NAT configuration, etc.
Additionally, we configure the agents and software-defined routers to be deployed on an high
availability fashion. Neutron implements this by running an instance of keepalived
(VRRP) inside
each router netns. The gateway IP is therefore a virtual address that can move between the two
instances of the neutron-l3-agent
.
In our setup we rely very heavily on IPv4 NAT, we use it for the egress traffic (SNAT) and also for
Openstack floating IPs (SNAT/DNAT). Neutron uses Netfilter rules for configuring the whole NAT
setup. There is, however, no apparent mechanism in the neutron-l3-agent
to ensure continuity of
NATed TCP connections when a failover happens. If all NATed TCP connections are flowing using
neutron-l3-agent
node A, and a failover happens, node B wont be able to pick up the opened TCP
streams and therefore all connections will need to be re-established. This happens because the
conntrack information that Netfilter uses to perform NAT is not present in node B when the failover
happens. NAT is, in general, a stateful thing and should be treated that way. It seems some
folks upstream are aware of this, and there are even some blueprints to introduce
conntrackd
alongside keepalived
. The actual implementation didn’t happen so far,
apparently.
I wonder, are we the only cloud deployment suffering this issue? Our users have been experiencing
annoying connection cuts every time we failover the neutron-l3-agent
for whatever reason: an
upgrade, server reboot, etc. After numerous incidents I opened a ticket in our phabricator
instance and started working on it. I’ve been trying to improve this situation by following a
couple of strategies:
- running
conntrackd
by hand inside the software-defined router netns. This deserves its own blog post and won’t comment more on this today. - setting additional sysctl configuration on the affected nents. This is the point I will be covering in this post.
To help the conntrack engine deal with NATed TCP connections I decided to try setting the sysctl
key net.netfilter.nf_conntrack_tcp_be_liberal=1
. But it turns out this sysctl key is not
inherited from the main netns when the software-defined netns is created (i.e, is per-netns). I
needed a mechanism to detect netns events (creation) and then inject the sysctl setting. Neutron
apparently doesn’t have any facility to hook/call commands at netns creation time.
This felt like a very fun coding challenge, so I did it!
I learned that netns have a /var/run/netns/<netns>
entry, so I thought it would be easy to listen
to file system events and react to them. I wanted to use some python, so I decided to go with the
pyinotify
library.
I ended up creating a simple netns-events.py daemon/script. By using regexes it allows matching on netns names and running arbitrary commands on certain situations:
- inmediatly when the daemon itself starts, if a matching netns exists (
daemon_startup_actions
) - on inotify events (
inotify_actions
), such asIN_CREATE
.
The daemon reads a yaml configuration file like this:
---
# $NETNS env var is provided by the runner daemon
- netns_regex: ^qrouter-.*
daemon_startup_actions:
- ip netns exec $NETNS sysctl net.netfilter.nf_conntrack_tcp_be_liberal=1
- ip netns exec $NETNS sysctl net.netfilter.nf_conntrack_tcp_loose=1
inotify_actions:
- IN_CREATE:
- ip netns exec $NETNS sysctl net.netfilter.nf_conntrack_tcp_be_liberal=1
- ip netns exec $NETNS sysctl net.netfilter.nf_conntrack_tcp_loose=1
We run this daemon as a systemd service, side by side with the neutron-l3-agent
systemd service.
Apparently the nf_conntrack_tcp_be_liberal
sysctl key is only read either at module load
time or netns creation time, so it is really important the netns-events.py
daemon runs before
neutron-l3-agent
starts doing its thing.
Anyway, this won’t probably solve all of our problems. I’m still on a massive rabbit hole with
conntrackd
, that I’ll leave for another day. Other than this small script to set the sysctl keys
I have more questions than answers. Again, I wonder what others do, or if there is something
fundamentally wrong with our cloud network architecture.
I’ll keep researching how to deploy neutron-l3-agent
in a more realiable fashion. If you have
more hints, are in the same situation, or have any other insight, please let me know!