netdev - Re: [RFC,POC] iptables/nftables to epbf/xdp via common intermediate layer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20180315161321.kqcnkzf52gu55yku@ast-mbp.dhcp.thefacebook.com>
Date:   Thu, 15 Mar 2018 09:13:23 -0700
From:   Alexei Starovoitov <alexei.starovoitov@...il.com>
To:     Edward Cree <ecree@...arflare.com>
Cc:     Florian Westphal <fw@...len.de>,
        Daniel Borkmann <daniel@...earbox.net>, netdev@...r.kernel.org,
        ast@...nel.org, pablo@...filter.org,
        "David S. Miller" <davem@...emloft.net>
Subject: Re: [RFC,POC] iptables/nftables to epbf/xdp via common intermediate
 layer

On Tue, Mar 06, 2018 at 06:18:04PM +0000, Edward Cree wrote:
> On 06/03/18 18:03, Florian Westphal wrote:
> > I don't know.  I suspect we should go for naive algorithm only,
> > but I would defer such decision to Alexei/Daniel.
> >
> > f.e. i don't know if using llvm is a good idea or not,
> Yeah, I wondered about that too.  I think it's probably not a good idea,
>  since it's a big project with a complex interface and the tail would
>  likely wag the dog.
> 
> > I did not
> > intend to turn proposed imr into full blown compiler in any case,
> > I only want to avoid code duplication for iptables/nftables -> ebpf
> > translator.
> Well, I think anything that calling code does by hand will entail limits
>  on what the imr->ebpf step can do.  E.g. if imr uses registers by name
>  then the backend can't expand any highlevel constructs that need to use
>  scratch registers, unless some regs are reserved for the backend and
>  can't be used by the imr.
> 
> Ultimately I think the imr will have to become practically a compiler
>  back-end, because any jobs it punts to the caller will necessarily be
>  duplicated between iptables and nftables.

few years back we had a prototype of nft->bpf converter.
It was reading nft netlink, generating really dumb C code,
and feeding into clang/llvm. Performance of generated code was
surprisingly pretty good. llvm did particularly well combining
the same loads from multiple rules into single load and
combining all 'if' conditions into a few.
So full blown compiler approach known to work, but ...
we cannot use it here.
We cannot have bpfilter umh module depend on llvm.
The converter have to be fully contained within kernel source tree.
It's probably fine to start simple with compiler-like thingy and
later improve it with sophisticated optimizations and fancy
register allocator, but I suspect we can achieve better performance
and easier to maintain code if we specialize the converters.

The way this IMR defined today looks pretty much like nft and
it feels a bit too low level than iptable conversion would need.
Converting iptable to this IMR is like going old days of gcc
when C/C++ front-ends were generating RTL and all optimizations
were done on RTL representation. Eventually gcc had to adopt
gimple (variant of ssa) format to do optimizations that were
very hard to do with RTL.
I suspect if we go iptable->nft (same as this imr) the history
will repeat itself. It will work, but will be difficult to apply
interesting optimizations.

Similarly I don't quite see why this particular IMR is needed
for nft->bpf either.
I think it would be simpler to have user space only extensions
and opcodes added to bpf for the purpose of the translation.
Like there is no bpf instruction called 'load from IP header',
but we can make one. Just extend extended bpf with an instruction
like this and on the first pass do full conversion of nft
directly into this 'extended extended bpf'.
Then do second pass of lowering 'extended extended bpf' into eBPF
that kernel actually understands.
That would be similar to what llvm does when it lowers higher level
represenation into lower one.
The higher level IR doesn't need to be completely different
from lower level IR. Often lower level IR is a subset of higher level.

For iptable->bpf I think it's easier to construct decision tree
out of all iptable rules, optimize it as a tree and then
convert it into a set of map lookups glued with a bit of bpf code.
Converting iptable rules one by one isn't going to scale.
It was fine for prototype, but now we're talking production.
If there are 1k iptables rules and we generate just 4 bpf insns
for every rule, we will already hit 4k insn limit.
We can easily bump the limit, but even if it's 128k insns
it's still the limit. All iptables rules should be converted together.