netdev - Re: [PATCH bpf-next 1/2] Revert "bpftool: Use libbpf 1.0 API mode instead of RLIMIT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKH8qBsFyakQRd1q6XWggdv4F5+HrHoC4njg9jQFDOfq+kRBCQ@mail.gmail.com>
Date:   Fri, 10 Jun 2022 10:17:31 -0700
From:   Stanislav Fomichev <sdf@...gle.com>
To:     Quentin Monnet <quentin@...valent.com>
Cc:     Alexei Starovoitov <ast@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        Andrii Nakryiko <andrii@...nel.org>,
        Yafang Shao <laoar.shao@...il.com>,
        Harsh Modi <harshmodi@...gle.com>,
        Paul Chaignon <paul@...ium.io>, netdev@...r.kernel.org,
        bpf@...r.kernel.org
Subject: Re: [PATCH bpf-next 1/2] Revert "bpftool: Use libbpf 1.0 API mode
 instead of RLIMIT_MEMLOCK"

On Fri, Jun 10, 2022 at 10:00 AM Quentin Monnet <quentin@...valent.com> wrote:
>
> 2022-06-10 09:46 UTC-0700 ~ Stanislav Fomichev <sdf@...gle.com>
> > On Fri, Jun 10, 2022 at 9:34 AM Quentin Monnet <quentin@...valent.com> wrote:
> >>
> >> 2022-06-10 09:07 UTC-0700 ~ sdf@...gle.com
> >>> On 06/10, Quentin Monnet wrote:
> >>>> This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8.
> >>>
> >>>> In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of
> >>>> RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the
> >>>> kernel has switched to memcg-based memory accounting. Thanks to the
> >>>> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility
> >>>> with other systems and ask libbpf to raise the limit for us if
> >>>> necessary.
> >>>
> >>>> How do we know if memcg-based accounting is supported? There is a probe
> >>>> in libbpf to check this. But this probe currently relies on the
> >>>> availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which
> >>>> landed in the same kernel version as the memory accounting change. This
> >>>> works in the generic case, but it may fail, for example, if the helper
> >>>> function has been backported to an older kernel. This has been observed
> >>>> for Google Cloud's Container-Optimized OS (COS), where the helper is
> >>>> available but rlimit is still in use. The probe succeeds, the rlimit is
> >>>> not raised, and probing features with bpftool, for example, fails.
> >>>
> >>>> A patch was submitted [0] to update this probe in libbpf, based on what
> >>>> the cilium/ebpf Go library does [1]. It would lower the soft rlimit to
> >>>> 0, attempt to load a BPF object, and reset the rlimit. But it may induce
> >>>> some hard-to-debug flakiness if another process starts, or the current
> >>>> application is killed, while the rlimit is reduced, and the approach was
> >>>> discarded.
> >>>
> >>>> As a workaround to ensure that the rlimit bump does not depend on the
> >>>> availability of a given helper, we restore the unconditional rlimit bump
> >>>> in bpftool for now.
> >>>
> >>>> [0]
> >>>> https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/
> >>>> [1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39
> >>>
> >>>> Cc: Yafang Shao <laoar.shao@...il.com>
> >>>> Signed-off-by: Quentin Monnet <quentin@...valent.com>
> >>>> ---
> >>>>   tools/bpf/bpftool/common.c     | 8 ++++++++
> >>>>   tools/bpf/bpftool/feature.c    | 2 ++
> >>>>   tools/bpf/bpftool/main.c       | 6 +++---
> >>>>   tools/bpf/bpftool/main.h       | 2 ++
> >>>>   tools/bpf/bpftool/map.c        | 2 ++
> >>>>   tools/bpf/bpftool/pids.c       | 1 +
> >>>>   tools/bpf/bpftool/prog.c       | 3 +++
> >>>>   tools/bpf/bpftool/struct_ops.c | 2 ++
> >>>>   8 files changed, 23 insertions(+), 3 deletions(-)
> >>>
> >>>> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
> >>>> index a45b42ee8ab0..a0d4acd7c54a 100644
> >>>> --- a/tools/bpf/bpftool/common.c
> >>>> +++ b/tools/bpf/bpftool/common.c
> >>>> @@ -17,6 +17,7 @@
> >>>>   #include <linux/magic.h>
> >>>>   #include <net/if.h>
> >>>>   #include <sys/mount.h>
> >>>> +#include <sys/resource.h>
> >>>>   #include <sys/stat.h>
> >>>>   #include <sys/vfs.h>
> >>>
> >>>> @@ -72,6 +73,13 @@ static bool is_bpffs(char *path)
> >>>>       return (unsigned long)st_fs.f_type == BPF_FS_MAGIC;
> >>>>   }
> >>>
> >>>> +void set_max_rlimit(void)
> >>>> +{
> >>>> +    struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY };
> >>>> +
> >>>> +    setrlimit(RLIMIT_MEMLOCK, &rinf);
> >>>
> >>> Do you think it might make sense to print to stderr some warning if
> >>> we actually happen to adjust this limit?
> >>>
> >>> if (getrlimit(MEMLOCK) != RLIM_INFINITY) {
> >>>     fprintf(stderr, "Warning: resetting MEMLOCK rlimit to
> >>>     infinity!\n");
> >>>     setrlimit(RLIMIT_MEMLOCK, &rinf);
> >>> }
> >>>
> >>> ?
> >>>
> >>> Because while it's nice that we automatically do this, this might still
> >>> lead to surprises for some users. OTOH, not sure whether people
> >>> actually read those warnings? :-/
> >>
> >> I'm not strictly opposed to a warning, but I'm not completely sure this
> >> is desirable.
> >>
> >> Bpftool has raised the rlimit for a long time, it changed only in April,
> >> so I don't think it would come up as a surprise for people who have used
> >> it for a while. I think this is also something that several other
> >> BPF-related applications (BCC I think?, bpftrace, Cilium come to mind)
> >> have been doing too.
> >
> > In this case ignore me and let's continue doing that :-)
> >
> > Btw, eventually we'd still like to stop doing that I'd presume?
>
> Agreed. I was thinking either finding a way to improve the probe in
> libbpf, or waiting for some more time until 5.11 gets old, but this may
> take years :/
>
> > Should
> > we at some point follow up with something like:
> >
> > if (kernel_version >= 5.11) { don't touch memlock; }
> >
> > ?
> >
> > I guess we care only about <5.11 because of the backports, but 5.11+
> > kernels are guaranteed to have memcg.
>
> You mean from uname() and parsing the release? Yes I suppose we could do
> that, can do as a follow-up.

Yeah, uname-based, I don't think we can do better? Given that probing
is problematic as well :-(
But idk, up to you.

> > I'm not sure whether memlock is used out there in the distros (and
> > especially for root/bpf_capable), so I'm also not sure whether we
> > really care or not.
>
> Not sure either. For what it's worth, I've never seen complaints so far
> from users about the rlimit being raised (from bpftool or other BPF apps).