[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230208022511.qhkyqio2b2jvcaid@macbook-pro-6.dhcp.thefacebook.com>
Date: Tue, 7 Feb 2023 18:25:11 -0800
From: Alexei Starovoitov <alexei.starovoitov@...il.com>
To: Andrii Nakryiko <andrii.nakryiko@...il.com>
Cc: Martin KaFai Lau <martin.lau@...ux.dev>,
Joanne Koong <joannelkoong@...il.com>,
Daniel Borkmann <daniel@...earbox.net>,
Andrii Nakryiko <andrii@...nel.org>,
Martin KaFai Lau <martin.lau@...nel.org>,
Alexei Starovoitov <ast@...nel.org>,
Network Development <netdev@...r.kernel.org>,
Kumar Kartikeya Dwivedi <memxor@...il.com>,
Kernel Team <kernel-team@...com>, bpf <bpf@...r.kernel.org>
Subject: Re: [PATCH v9 bpf-next 3/5] bpf: Add skb dynptrs
On Fri, Feb 03, 2023 at 01:37:46PM -0800, Andrii Nakryiko wrote:
> On Thu, Feb 2, 2023 at 3:43 AM Alexei Starovoitov
> <alexei.starovoitov@...il.com> wrote:
> >
> > On Wed, Feb 1, 2023 at 5:21 PM Andrii Nakryiko
> > <andrii.nakryiko@...il.com> wrote:
> > >
> > > On Tue, Jan 31, 2023 at 4:40 PM Alexei Starovoitov
> > > <alexei.starovoitov@...il.com> wrote:
> > > >
> > > > On Tue, Jan 31, 2023 at 04:11:47PM -0800, Andrii Nakryiko wrote:
> > > > > >
> > > > > > When prog is just parsing the packet it doesn't need to finalize with bpf_dynptr_write.
> > > > > > The prog can always write into the pointer followed by if (p == buf) bpf_dynptr_write.
> > > > > > No need for rdonly flag, but extra copy is there in case of cloned which
> > > > > > could have been avoided with extra rd_only flag.
> > > > >
> > > > > Yep, given we are designing bpf_dynptr_slice for performance, extra
> > > > > copy on reads is unfortunate. ro/rw flag or have separate
> > > > > bpf_dynptr_slice_rw vs bpf_dynptr_slice_ro?
> > > >
> > > > Either flag or two kfuncs sound good to me.
> > >
> > > Would it make sense to make bpf_dynptr_slice() as read-only variant,
> > > and bpf_dynptr_slice_rw() for read/write? I think the common case is
> > > read-only, right? And if users mistakenly use bpf_dynptr_slice() for
> > > r/w case, they will get a verifier error when trying to write into the
> > > returned pointer. While if we make bpf_dynptr_slice() as read-write,
> > > users won't realize they are paying a performance penalty for
> > > something that they don't actually need.
> >
> > Makes sense and it matches skb_header_pointer() usage in the kernel
> > which is read-only. Since there is no verifier the read-only-ness
> > is not enforced, but we can do it.
> >
> > Looks like we've converged on bpf_dynptr_slice() and bpf_dynptr_slice_rw().
> > The question remains what to do with bpf_dynptr_data() backed by skb/xdp.
> > Should we return EINVAL to discourage its usage?
> > Of course, we can come up with sensible behavior for bpf_dynptr_data(),
> > but it will have quirks that will be not easy to document.
> > Even with extensive docs the users might be surprised by the behavior.
>
> I feel like having bpf_dynptr_data() working in the common case for
> skb/xdp would be nice (e.g., so basically at least work in cases when
> we don't need to pull).
>
> But we've been discussing bpf_dynptr_slice() with Joanne today, and we
> came to the conclusion that bpf_dynptr_slice()/bpf_dynptr_slice_rw()
> should work for any kind of dynptr (LOCAL, RINGBUF, SKB, XDP). So
> generic code that wants to work with any dynptr would be able to just
> use bpf_dynptr_slice, even for LOCAL/RINGBUF, even though buffer won't
> ever be filled for LOCAL/RINGBUF.
great
> In application, though, if I know I'm working with LOCAL or RINGBUF
> (or MALLOC, once we have it), I'd use bpf_dynptr_data() to fill out
> fixed parts, of course. bpf_dynptr_slice() would be cumbersome for
> such cases (especially if I have some huge fixed part that I *know* is
> available in RINGBUF/MALLOC case).
bpf_dynptr_data() for local and ringbuf is fine, of course.
It already exists and has to continue working.
bpf_dynptr_data() for xdp is probably ok as well,
but bpf_dynptr_data() for skb is problematic.
data/data_end concept looked great back in 2016 when it was introduced
and lots of programs were written, but we underestimated the impact
of driver's copybreak on programs.
Network parsing progs consume headers one by one and would typically
be written as:
if (header > data_end)
return DROP;
Some drivers copybreak fixed number of bytes. Others try to be smart
and copy only headers into linear part of skb.
The drivers also change. At one point we tried to upgrade the kernel
and suddenly bpf firewall started blocking valid traffic.
Turned out the driver copybreak heuristic was changed in that kernel.
The bpf prog was converted to use skb_load_bytes() and the fire was extinguished.
It was a hard lesson.
Others learned the danger of data/data_end the hard way as well.
Take a look at cloudflare's progs/test_cls_redirect.c.
It's a complicated combination of data/data_end and skb_load_bytes().
It's essentially implementing skb_header_pointer.
I wish we could use bpf_dynptr_slice only and remove data/data_end,
but we cannot, since it's uapi.
But we shouldn't repeat the same mistake. If we do bpf_dynptr_data()
that returns linear part people will be hitting the same fragility and
difficult to debug bugs.
bpf_dynptr_data() for XDP is ok-ish, since most of XDP is still
page-per-packet, but patches to split headers in HW are starting to appear.
So even for XDP data/data_end concept may bite us.
Hence my preference is to EINVAL in bpf_dynptr_data() at least for skb,
since bpf_dynptr_slice() is a strictly better alternative.
Powered by blists - more mailing lists