[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAEf4BzY+_+r9gyRCKhROPqEKtQ=f0CycRgv9c6b2zisV9XHO7Q@mail.gmail.com>
Date: Fri, 23 May 2025 10:27:52 -0700
From: Andrii Nakryiko <andrii.nakryiko@...il.com>
To: Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc: Alan Maguire <alan.maguire@...cle.com>, Tony Ambardar <tony.ambardar@...il.com>,
Andrii Nakryiko <andrii@...nel.org>, Arnd Bergmann <arnd@...db.de>, Alexei Starovoitov <ast@...nel.org>,
bpf <bpf@...r.kernel.org>, Daniel Borkmann <daniel@...earbox.net>, Eduard <eddyz87@...il.com>,
Hao Luo <haoluo@...gle.com>, John Fastabend <john.fastabend@...il.com>,
Jiri Olsa <jolsa@...nel.org>, KP Singh <kpsingh@...nel.org>,
linux-arch <linux-arch@...r.kernel.org>, LKML <linux-kernel@...r.kernel.org>,
"open list:KERNEL SELFTEST FRAMEWORK" <linux-kselftest@...r.kernel.org>, Lorenz Bauer <lmb@...valent.com>,
Martin KaFai Lau <martin.lau@...ux.dev>, Mykola Lysenko <mykolal@...com>,
Stanislav Fomichev <sdf@...ichev.me>, Shuah Khan <shuah@...nel.org>, Song Liu <song@...nel.org>,
Yonghong Song <yonghong.song@...ux.dev>
Subject: Re: vmlinux BTF as a module (was Re: [PATCH bpf-next v4 0/3] Allow
mmap of /sys/kernel/btf/vmlinux)
On Thu, May 22, 2025 at 6:04 PM Alexei Starovoitov
<alexei.starovoitov@...il.com> wrote:
>
> On Wed, May 21, 2025 at 8:00 AM Alan Maguire <alan.maguire@...cle.com> wrote:
> >
> > > Hi Alan,
> > >
> > > Thanks for taking a look at this. I've been following your related effort
> > > to allow /sys/kernel/btf/vmlinux as a module in support of small systems
> > > with kernel-size constraints, and wondered how this series might affect
> > > that work? Such support would be well-received in the embedded space when
> > > it happens, so am keen to understand.
> > >
> > > Thanks,
> > > Tony
> >
> > hi Tony
> >
> > I had something nearly working a few months back but there are a bunch
> > of complications that made it a bit trickier than I'd first anticipated.
> > One challenge for example is that we want /sys/kernel/btf to behave just
> > as it would if vmlinux BTF was not a module. My original hope was to
> > just have the vmlinux BTF module forceload early, but the request module
> > approach won't work since the vmlinux_btf.ko module would have to be
> > part of the initrd image. A question for you on this - I presume that's
> > what you want to avoid, right? So I'm assuming that we need to extract
> > the .BTF section out of the vmlinu[xz] binary and out of initrd into a
> > later-loading vmlinux_btf.ko module for small-footprint systems. Is that
> > correct?
> >
> > The reason I ask is having a later-loading vmlinux_btf.ko is a bit of a
> > pain since we need to walk the set of kernel modules and load their BTF,
> > relocate it and do kfunc registration. If we can simplify things via a
> > shared module dependency on vmlinux_btf.ko that would be great, but I'd
> > like to better understand the constraints from the small system
> > perspective first. Thanks!
>
> We cannot require other modules to depend on vmlinux_btf.ko.
> Some of them might load during the boot. So adding to the dependency
> will defeat the point of vmlinux_btf.ko.
> The only option I see is to let modules load and ignore their BTFs
> and vmlinux BTF is not present.
> Later vmlinux_btf.ko can be loaded and modules loaded after that
> time will succeed in loading their BTFs too.
> So some modules will have their BTF and some don't.
> I don't think it's an issue.
>
> If an admin loads a module with kfuncs and vmlixnu_btf.ko is not loaded yet
> the kfunc registration will fail, of course. It's an issue,
> but I don't think we need to fix it right now by messing with depmod.
>
> The bigger issue is how to split vmlinux_btf.ko itself.
> The kernel has a bunch of kfuncs and they need BTF ids for protos
> and for all types they reference, so vmlinux BTF cannot be empty.
> minimize_btf() can probably help.
> So before we proceed with vmlinux_btf.ko we need to see the data
> how big the mandatory part of vmlinux BTF will be vs
> the rest of BTF in vmlinux_btf.ko.
I think there is a way to avoid all these problems by switching kfunc
registration to a lazy validation model. I'll explain what I mean.
1) vmlinux_btf.ko isn't loaded by default, but kernel is aware that
there is vmlinux BTF available, if necessary.
2) when user-space tries to access /sys/kernel/btf/vmlinux, we
automatically try to load vmlinux_btf.ko; similarly, if kernel
internally needs vmlinux BTF information, we provided that
transparently through automatic loading of vmlinux_btf.ko
3a) if kernel module is loaded and it needs to register kfuncs, we
allow that, but instead of eagerly validating kfunc's associated BTF
information for correctness, we just record the fact that there is a
kfunc registered, with name ABC and associated BTF ID XYZ.
3b) when user tries to verify BPF program that needs to use kfunc ABC
from that module, that's the time when we load vmlinux_btf.ko and
validate kfunc's BTF information for correctness. If that information
is broken, report error, maybe log dmesg. If not, we are golden (and
that's the expected outcome) and we proceed with verification just
like today.
The key observation here is that with BTF there is no direct pointer
involved. It's all just stable integer IDs, so it doesn't really
matter whether we have instantiated BTF information at the kernel
module loading time or not. We can always (later) access this data
through BTF ID.
The biggest change is handling of kernel modules with broken kfuncs.
Right now we'll reject the load, because registration will fail. In
the new lazy model, this will be delayed until the very first use of
that kfunc. And if no one ever use that kfunc, it, technically,
doesn't matter. It's basically the same approach as with BPF CO-RE and
dead code elimination in verifier: if there is unknown/unsupported
code, but it's guaranteed to never execute, it's OK from the
verifier's POV.
I think that's an acceptable tradeoff, because really it's not an
expected typical situation to have such a broken module. On the other
hand, we don't need to complicate and extend BTF itself to accommodate
this, it all will works as is and will keep working in the future.
P.S. And of course all this can/should be cached, so we don't redo all
this validation, but that's just an optimization.
Powered by blists - more mailing lists