netdev - Re: [PATCH bpf-next 1/2] Revert "bpftool: Use libbpf 1.0 API mode instead of RLIMIT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALOAHbCvWzOJ169fPTCp1KsFpkEVukKgGnH4mDeYGOEv6hsEpQ@mail.gmail.com>
Date:   Tue, 14 Jun 2022 20:37:55 +0800
From:   Yafang Shao <laoar.shao@...il.com>
To:     Stanislav Fomichev <sdf@...gle.com>
Cc:     Quentin Monnet <quentin@...valent.com>,
        Alexei Starovoitov <ast@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        Andrii Nakryiko <andrii@...nel.org>,
        Harsh Modi <harshmodi@...gle.com>,
        Paul Chaignon <paul@...ium.io>,
        netdev <netdev@...r.kernel.org>, bpf <bpf@...r.kernel.org>
Subject: Re: [PATCH bpf-next 1/2] Revert "bpftool: Use libbpf 1.0 API mode
 instead of RLIMIT_MEMLOCK"

On Sat, Jun 11, 2022 at 1:17 AM Stanislav Fomichev <sdf@...gle.com> wrote:
>
> On Fri, Jun 10, 2022 at 10:00 AM Quentin Monnet <quentin@...valent.com> wrote:
> >
> > 2022-06-10 09:46 UTC-0700 ~ Stanislav Fomichev <sdf@...gle.com>
> > > On Fri, Jun 10, 2022 at 9:34 AM Quentin Monnet <quentin@...valent.com> wrote:
> > >>
> > >> 2022-06-10 09:07 UTC-0700 ~ sdf@...gle.com
> > >>> On 06/10, Quentin Monnet wrote:
> > >>>> This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8.
> > >>>
> > >>>> In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of
> > >>>> RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the
> > >>>> kernel has switched to memcg-based memory accounting. Thanks to the
> > >>>> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility
> > >>>> with other systems and ask libbpf to raise the limit for us if
> > >>>> necessary.
> > >>>
> > >>>> How do we know if memcg-based accounting is supported? There is a probe
> > >>>> in libbpf to check this. But this probe currently relies on the
> > >>>> availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which
> > >>>> landed in the same kernel version as the memory accounting change. This
> > >>>> works in the generic case, but it may fail, for example, if the helper
> > >>>> function has been backported to an older kernel. This has been observed
> > >>>> for Google Cloud's Container-Optimized OS (COS), where the helper is
> > >>>> available but rlimit is still in use. The probe succeeds, the rlimit is
> > >>>> not raised, and probing features with bpftool, for example, fails.
> > >>>
> > >>>> A patch was submitted [0] to update this probe in libbpf, based on what
> > >>>> the cilium/ebpf Go library does [1]. It would lower the soft rlimit to
> > >>>> 0, attempt to load a BPF object, and reset the rlimit. But it may induce
> > >>>> some hard-to-debug flakiness if another process starts, or the current
> > >>>> application is killed, while the rlimit is reduced, and the approach was
> > >>>> discarded.
> > >>>
> > >>>> As a workaround to ensure that the rlimit bump does not depend on the
> > >>>> availability of a given helper, we restore the unconditional rlimit bump
> > >>>> in bpftool for now.
> > >>>
> > >>>> [0]
> > >>>> https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/
> > >>>> [1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39
> > >>>
> > >>>> Cc: Yafang Shao <laoar.shao@...il.com>
> > >>>> Signed-off-by: Quentin Monnet <quentin@...valent.com>
> > >>>> ---
> > >>>>   tools/bpf/bpftool/common.c     | 8 ++++++++
> > >>>>   tools/bpf/bpftool/feature.c    | 2 ++
> > >>>>   tools/bpf/bpftool/main.c       | 6 +++---
> > >>>>   tools/bpf/bpftool/main.h       | 2 ++
> > >>>>   tools/bpf/bpftool/map.c        | 2 ++
> > >>>>   tools/bpf/bpftool/pids.c       | 1 +
> > >>>>   tools/bpf/bpftool/prog.c       | 3 +++
> > >>>>   tools/bpf/bpftool/struct_ops.c | 2 ++
> > >>>>   8 files changed, 23 insertions(+), 3 deletions(-)
> > >>>
> > >>>> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
> > >>>> index a45b42ee8ab0..a0d4acd7c54a 100644
> > >>>> --- a/tools/bpf/bpftool/common.c
> > >>>> +++ b/tools/bpf/bpftool/common.c
> > >>>> @@ -17,6 +17,7 @@
> > >>>>   #include <linux/magic.h>
> > >>>>   #include <net/if.h>
> > >>>>   #include <sys/mount.h>
> > >>>> +#include <sys/resource.h>
> > >>>>   #include <sys/stat.h>
> > >>>>   #include <sys/vfs.h>
> > >>>
> > >>>> @@ -72,6 +73,13 @@ static bool is_bpffs(char *path)
> > >>>>       return (unsigned long)st_fs.f_type == BPF_FS_MAGIC;
> > >>>>   }
> > >>>
> > >>>> +void set_max_rlimit(void)
> > >>>> +{
> > >>>> +    struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY };
> > >>>> +
> > >>>> +    setrlimit(RLIMIT_MEMLOCK, &rinf);
> > >>>
> > >>> Do you think it might make sense to print to stderr some warning if
> > >>> we actually happen to adjust this limit?
> > >>>
> > >>> if (getrlimit(MEMLOCK) != RLIM_INFINITY) {
> > >>>     fprintf(stderr, "Warning: resetting MEMLOCK rlimit to
> > >>>     infinity!\n");
> > >>>     setrlimit(RLIMIT_MEMLOCK, &rinf);
> > >>> }
> > >>>
> > >>> ?
> > >>>
> > >>> Because while it's nice that we automatically do this, this might still
> > >>> lead to surprises for some users. OTOH, not sure whether people
> > >>> actually read those warnings? :-/
> > >>
> > >> I'm not strictly opposed to a warning, but I'm not completely sure this
> > >> is desirable.
> > >>
> > >> Bpftool has raised the rlimit for a long time, it changed only in April,
> > >> so I don't think it would come up as a surprise for people who have used
> > >> it for a while. I think this is also something that several other
> > >> BPF-related applications (BCC I think?, bpftrace, Cilium come to mind)
> > >> have been doing too.
> > >
> > > In this case ignore me and let's continue doing that :-)
> > >
> > > Btw, eventually we'd still like to stop doing that I'd presume?
> >
> > Agreed. I was thinking either finding a way to improve the probe in
> > libbpf, or waiting for some more time until 5.11 gets old, but this may
> > take years :/
> >
> > > Should
> > > we at some point follow up with something like:
> > >
> > > if (kernel_version >= 5.11) { don't touch memlock; }
> > >
> > > ?
> > >
> > > I guess we care only about <5.11 because of the backports, but 5.11+
> > > kernels are guaranteed to have memcg.
> >
> > You mean from uname() and parsing the release? Yes I suppose we could do
> > that, can do as a follow-up.
>
> Yeah, uname-based, I don't think we can do better? Given that probing
> is problematic as well :-(
> But idk, up to you.
>

Agreed with the uname-based solution. Another possible solution is to
probe the member 'memcg' in struct bpf_map, in case someone may
backport memcg-based  memory accounting, but that will be a little
over-engineering. The uname-based solution is simple and can work.

-- 
Regards
Yafang