This is my report from the Netfilter Workshop 2022. The event was held on 2022-10-20/2022-10-21 in Seville, and the venue was the offices of Zevenet. We started on Thursday with Pablo Neira (head of the project) giving a short welcome / opening speech. The previous iteration of this event was in virtual fashion in 2020, two years ago. In the year 2021 we were unable to meet either in person or online.
This year, the number of participants was just eight people, and this allowed the setup to be a bit more informal. We had kind of an un-conference style meeting, in which whoever had something prepared just went ahead and opened a topic for debate.
In the opening speech, Pablo did a quick recap on the legal problems the Netfilter project had a few years ago, a topic that was settled for good some months ago, in January 2022. There were no news in this front, which was definitely a good thing.
Moving into the technical topics, the workshop proper, Pablo started to comment on the recent developments to instrument a way to perform inner matching for tunnel protocols. The current implementation supports VXLAN, IPIP, GRE and GENEVE. Using nftables you can match packet headers that are encapsulated inside these protocols. He mentioned the design and the goals, that was to have a kernel space setup that allows adding more protocols by just patching userspace. In that sense, more tunnel protocols will be supported soon, such as IP6IP, UDP, and ESP. Pablo requested our opinion on whether if nftables should generate the matching dependencies. For example, if a given tunnel is UDP-based, a dependency match should be there otherwise the rule won’t work as expected. The agreement was to assist the user in the setup when possible and if not, print clear error messages. By the way, this inner thing is pure stateless packet filtering. Doing inner-conntracking is an open topic that will be worked on in the future.
Pablo continued with the next topic: nftables automatic ruleset optimizations. The times of linear ruleset evaluation are over, but some people have a hard time understanding / creating rulesets that leverage maps, sets, and concatenations. This is where the ruleset optimizations kick in: it can transform a given ruleset to be more optimal by using such advanced data structures. This is purely about optimizing the ruleset, not about validating the usefulness of it, which could be another interesting project. There were a couple of problems mentioned, however. The ruleset optimizer can be slow, O(n!) in worst case. And the user needs to use nested syntax. More improvements to come in the future.
Next was Stefano Brivio’s turn (Red Hat engineer). He had been involved lately in a couple of migrations to nftables, in particular libvirt and KubeVirt. We were pointed to https://libvirt.org/firewall.html, and Stefano walked us through the 3 or 4 different virtual networks that libvirt can create. He evaluated some options to generate efficient rulesets in nftables to instrument such networks, and commented on a couple of ideas: having a “null” matcher in nftables set expression. Or perhaps having kind of subsets, something similar to a ‘view’ in a SQL database. The room spent quite a bit of time debating how the nft_lookup API could be extended to support such new search operations. We also discussed if having intermediate facilities such as firewalld could provide the abstraction levels that could make developers more comfortable. Using firewalld also may have the advantage that coordination between different system components writing ruleset to nftables is handled by firewalld itself and developers are freed of the responsibility of doing it right.
Next was Fernando F. Mancera (Red Hat engineer). He wanted to improve error reporting when deleting table/chain/rules
with nftables. In general, there are some inconsistencies on how tables can be deleted (or flushed). And there seems
to be no correct way to make a single table go away with all its content in a single command.
The room agreed in that the commands
destroy table and
delete table should be defined consistently, with
the following meanings:
- destroy: nuke the table, don’t fail if it doesn’t exist
- delete: delete the table, but the command will fail if it doesn’t exist
This topic diverted into another: how to reload/replace a ruleset but keep stateful information (such as counters).
Next was Phil Sutter (Netfilter coreteam member and Red Hat engineer). He was interested in discussing options to make iptables-nft backward compatible. The use case he brought was simple: What happens if a container running iptables 1.8.7 creates a ruleset with features not supported by 1.8.6. A later container running 1.8.6 may fail to operate. Phil’s first approach was to attach additional metadata into rules to assist older iptables-nft in decoding and printing the ruleset. But in general, there are no obvious or easy solutions to this problem. Some people are mixing different tooling version, and there is no way all cases can be predicted/covered. iptables-nft already refuses to work in some of the most basic failure scenarios.
An other way to approach the issue could be to introduce some kind of support to print raw expressions in
-m nft xyz. Which feels ugly, but may work. We also explored playing with the semantics of
release version numbers. And another idea: store strings in the nft rule userdata area with the equivalent
matching information for older iptables-nft.
In fact, what Phil may have been looking for is not backwards but forward compatibility. Phil was undecided which path to follow, but perhaps the most common-sense approach is to fall back to a major release version bump (2.x.y) and declaring compatibility breakage with older iptables 1.x.y.
That was pretty much it for the first day. We had dinner together and went to sleep for the next day.
The second day was opened by Florian Westphal (Netfilter coreteam member and Red Hat engineer). Florian has been trying to improve nftables performance in kernels with RETPOLINE mitigations enabled. He commented that several workarounds have been collected over the years to avoid the performance penalty of such mitigations. The basic strategy is to avoid function indirect calls in the kernel.
Florian also described how BPF programs work around this more effectively. And actually, Florian tried translating
nf_hook_slow() to BPF. Some preliminary benchmarks results were showed, with about 2% performance improvement in
MB/s and PPS. The flowtable infrastructure is specially benefited from this approach. The software
flowtable infrastructure already offers a 5x performance improvement with regards the classic forwarding path, and the
change being researched by Florian would be an addition on top of that.
We then moved into discussing the meeting Florian had with Alexei in Zurich. My personal opinion was that Netfilter offers interesting user-facing interfaces and semantics that BPF does not. Whereas BPF may be more performant in certain scenarios. The idea of both things going hand in hand may feel natural for some people. Others also shared my view, but no particular agreement was reached in this topic. Florian will probably continue exploring options on that front.
The next topic was opened by Fernando. He wanted to discuss Netfilter involvement in Google Summer of Code and Outreachy. Pablo had some personal stuff going on last year that prevented him from engaging in such projects. After all, GSoC is not fundamental or a priority for Netfilter. Also, Pablo mentioned the lack of support from others in the project for mentoring activities. There was no particular decision made here. Netfilter may be present again in such initiatives in the future, perhaps under the umbrella of other organizations.
Again, Fernando proposed the next topic: nftables JSON support. Fernando shared his plan of going over all features and introduce programmatic tests from them. He also mentioned that the nftables wiki was incomplete and couldn’t be used as a reference for missing tests. Phil suggested running the nftables python test-suite in JSON mode, which should complain about missing features. The py test suite should cover pretty much all statements and variations on how the nftables expression are invoked.
Next, Phil commented on nftables xtables support. This is, supporting legacy xtables extensions in nftables.
The most prominent problem was that some translations had some corner cases that resulted in a listed ruleset that
couldn’t be fed back into the kernel. Also, iptables-to-nftables translations can be sloppy, and the resulting
rule won’t work in some cases. In general,
nft list ruleset | nft -f may fail in rulesets created by iptables-nft
and there is no trivial way to solve it.
Phil also commented on potential iptables-tests.py speed-ups. Running the test suite may take very long time depending on the hardware. Phil will try to re-architect it, so it runs faster. Some alternatives had been explored, including collecting all rules into a single iptables-restore run, instead of hundreds of individual iptables calls.
Next topic was about documentation on the nftables wiki. Phil is interested in having all nftables code-flows documented, and presented some improvements in that front. We are trying to organize all developer-oriented docs on a mediawiki portal, but the extension was not active yet. Since I worked at the Wikimedia Foundation, all the room stared at me, so at the end I kind of committed to exploring and enabling the mediawiki portal extension. Note to self: is this perhaps https://www.mediawiki.org/wiki/Portals ?
Next presentation was by Pablo. He had a list of assorted topics for quick review and comment.
- We discussed nftables accept/drop semantics. People that gets two or more rulesets from different software are requesting additional semantics here. A typical case is fail2ban integration. One option is quick accept (no further evaluation if accepted) and the other is lazy drop (don’t actually drop the packet, but delay decision until the whole ruleset has been evaluated). There was no clear way to move forward with this.
- A debate on nft userspace memory usage followed. Some people are running nftables on low end devices with very
little memory (such as 128 MB). Pablo was exploring a potential solution: introducing
struct constant_expr, which can reduce 12.5% mem usage.
- Next we talked about repository licensing (or better, relicensing to GPLv2+). Pablo went over a list of files in the nftables tree which had diverging licenses. All people in the room agreed on this relicensing effort. A mention to the libreadline situation was made.
- Another quick topic: a bogus EEXIST in nft_rbtree. Pablo & Stefano to work in a patch.
- Next one was conntrack early drop in flowtable. Pablo is studying use cases for some legitimate UDP unidirectional flows (like RTP traffic).
- Pablo and Stefano discussed pipapo not being atomic on updates. Stefano already looked into it, and one of the ideas was to introduce a new commit API for sets.
- The last of the quick topics was an idea to have a global table in nftables. Or some global items, like sets. Folk in the community keep asking for this. Some ideas were discussed, like perhaps adding a family agnostic family. But then there would be a challenge: nftables would need to generate byte code that works in any of the hooks. There was no immediate way of addressing this. The idea of having templated tables/sets circulated again as a way of reusing data across namespaces/families.
Following this, a new topic was introduced by Stefano. He wanted to talk about nft_set_pipapo, documentation, what to do next, etc. He did a nice explanation of how the pipapo algorithm works for element inserts, lookups, and deletion. The source code is pretty well documented, by the way. He showed performance measurements of different data types being stored in the structure. After some lengthly debate on how to introduce changes without breaking usage for users, he declared some action items: writing more docs, addressing problems with non-atomic set reloads and a potential rework of nft_rbtree.
After that, the next topic was ‘kubernetes & netfilter’, also by Stefano. Actually, this topic was very similar to what we already discussed regarding libvirt. Developers want to reduce packet matching effort, but also often don’t leverage nftables most performant features, like sets, maps or concatenations.
Some Red Hat developers are already working on replacing everything with native nftables & firewalld integrations. But some rules generators are very bad. Kubernetes (kube-proxy) is a known case. Developers simply won’t learn how to code better ruleset generators. There was a good question floating around: What are people missing on first encounter with nftables?
The Netfilter project doesn’t have a training or marketing department or something like that. We cannot force-educate developers on how to use nftables in the right way. Perhaps we need to create a set of dedicated guidelines, or best practices, in the wiki for app developers that rely on nftables. Jozsef Kadlecsik (Netfilter coreteam) supported this idea, and suggested going beyond: such documents should be written exclusively from the nftables point of view: stop approaching the docs as a comparison to the old iptables semantics.
Related to that last topic, next was Laura García (Zevenet engineer, and venue host). She shared the same information as she presented in the Kubernetes network SIG in August 2020. She walked us through nftlb and kube-nftlb, a proof-of-concept replacement for kube-proxy based on nftlb that can outperform it. For whatever reason, kube-nftlb wasn’t adopted by the upstream kubernetes community.
She also covered latest changes to nftlb and some missing features, such as integration with nftables egress. nftlb is being extended to be a full proxy service and a more robust overall solution for service abstractions. In a nutshell, nftlb uses a templated ruleset and only adds elements to sets, which is exactly the right usage of the nftables framework. Some other projects should follow its example. The performance numbers are impressive, and from the early days it was clear that it was outperforming classical LVS-DSR by 10x.
I used this opportunity to bring a topic that I wanted to discuss. I’ve seen some SRE coworkers talking about katran as a replacement for traditional LVS setups. This software is a XDP/BPF based solution for load balancing. I was puzzled about what this software had to offer versus, for example, nftlb or any other nftables-based solutions. I commented on the highlighs of katran, and we discussed the nftables equivalents. nftlb is a simple daemon which does everything using a JSON-enabled REST API. It is already packaged into Debian, ready to use, whereas katran feels more like a collection of steps that you need to run in a certain order to get it working. All the hashing, caching, HA without state sharing, and backend weight selection features of katran are already present in nftlb.
To work on a pure L3/ToR datacenter network setting, katran uses IPIP encapsulation. They can’t just mangle the
MAC address as in traditional DSR because the backend server is on a different L3 domain. It turns out nftables
nft_tunnel expression that can do this encapsulation for complete feature parity. It is only available in
the kernel, but it can be made available easily on the userspace utility too.
Also, we discussed some limitations of katran, for example, inability to handle IP fragmentation, IP options, and potentially others not documented anywhere. This seems to be common with XDP/BPF programs, because handling all possible network scenarios would over-complicate the BPF programs, and at that point you are probably better off by using the normal Linux network stack and nftables.
In summary, we agreed that nftlb can pretty much offer the same as katran, in a more flexible way.
Finally, after many interesting debates over two days, the workshop ended. We all agreed on the need for extending it to 3 days next time, since 2 days feel too intense and too short for all the topics worth discussing.
That’s all on my side! I really enjoyed this Netfilter workshop round.