netdev - Re: [PATCH bpf-next 1/2] Revert "bpftool: Use libbpf 1.0 API mode instead of RLIMIT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKH8qBvvq0f+D8BXChw_8krH896J_cYg0yhRfnDOSO_U1n394w@mail.gmail.com>
Date:   Fri, 10 Jun 2022 09:46:09 -0700
From:   Stanislav Fomichev <sdf@...gle.com>
To:     Quentin Monnet <quentin@...valent.com>
Cc:     Alexei Starovoitov <ast@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        Andrii Nakryiko <andrii@...nel.org>,
        Yafang Shao <laoar.shao@...il.com>,
        Harsh Modi <harshmodi@...gle.com>,
        Paul Chaignon <paul@...ium.io>, netdev@...r.kernel.org,
        bpf@...r.kernel.org
Subject: Re: [PATCH bpf-next 1/2] Revert "bpftool: Use libbpf 1.0 API mode
 instead of RLIMIT_MEMLOCK"

On Fri, Jun 10, 2022 at 9:34 AM Quentin Monnet <quentin@...valent.com> wrote:
>
> 2022-06-10 09:07 UTC-0700 ~ sdf@...gle.com
> > On 06/10, Quentin Monnet wrote:
> >> This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8.
> >
> >> In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of
> >> RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the
> >> kernel has switched to memcg-based memory accounting. Thanks to the
> >> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility
> >> with other systems and ask libbpf to raise the limit for us if
> >> necessary.
> >
> >> How do we know if memcg-based accounting is supported? There is a probe
> >> in libbpf to check this. But this probe currently relies on the
> >> availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which
> >> landed in the same kernel version as the memory accounting change. This
> >> works in the generic case, but it may fail, for example, if the helper
> >> function has been backported to an older kernel. This has been observed
> >> for Google Cloud's Container-Optimized OS (COS), where the helper is
> >> available but rlimit is still in use. The probe succeeds, the rlimit is
> >> not raised, and probing features with bpftool, for example, fails.
> >
> >> A patch was submitted [0] to update this probe in libbpf, based on what
> >> the cilium/ebpf Go library does [1]. It would lower the soft rlimit to
> >> 0, attempt to load a BPF object, and reset the rlimit. But it may induce
> >> some hard-to-debug flakiness if another process starts, or the current
> >> application is killed, while the rlimit is reduced, and the approach was
> >> discarded.
> >
> >> As a workaround to ensure that the rlimit bump does not depend on the
> >> availability of a given helper, we restore the unconditional rlimit bump
> >> in bpftool for now.
> >
> >> [0]
> >> https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/
> >> [1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39
> >
> >> Cc: Yafang Shao <laoar.shao@...il.com>
> >> Signed-off-by: Quentin Monnet <quentin@...valent.com>
> >> ---
> >>   tools/bpf/bpftool/common.c     | 8 ++++++++
> >>   tools/bpf/bpftool/feature.c    | 2 ++
> >>   tools/bpf/bpftool/main.c       | 6 +++---
> >>   tools/bpf/bpftool/main.h       | 2 ++
> >>   tools/bpf/bpftool/map.c        | 2 ++
> >>   tools/bpf/bpftool/pids.c       | 1 +
> >>   tools/bpf/bpftool/prog.c       | 3 +++
> >>   tools/bpf/bpftool/struct_ops.c | 2 ++
> >>   8 files changed, 23 insertions(+), 3 deletions(-)
> >
> >> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
> >> index a45b42ee8ab0..a0d4acd7c54a 100644
> >> --- a/tools/bpf/bpftool/common.c
> >> +++ b/tools/bpf/bpftool/common.c
> >> @@ -17,6 +17,7 @@
> >>   #include <linux/magic.h>
> >>   #include <net/if.h>
> >>   #include <sys/mount.h>
> >> +#include <sys/resource.h>
> >>   #include <sys/stat.h>
> >>   #include <sys/vfs.h>
> >
> >> @@ -72,6 +73,13 @@ static bool is_bpffs(char *path)
> >>       return (unsigned long)st_fs.f_type == BPF_FS_MAGIC;
> >>   }
> >
> >> +void set_max_rlimit(void)
> >> +{
> >> +    struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY };
> >> +
> >> +    setrlimit(RLIMIT_MEMLOCK, &rinf);
> >
> > Do you think it might make sense to print to stderr some warning if
> > we actually happen to adjust this limit?
> >
> > if (getrlimit(MEMLOCK) != RLIM_INFINITY) {
> >     fprintf(stderr, "Warning: resetting MEMLOCK rlimit to
> >     infinity!\n");
> >     setrlimit(RLIMIT_MEMLOCK, &rinf);
> > }
> >
> > ?
> >
> > Because while it's nice that we automatically do this, this might still
> > lead to surprises for some users. OTOH, not sure whether people
> > actually read those warnings? :-/
>
> I'm not strictly opposed to a warning, but I'm not completely sure this
> is desirable.
>
> Bpftool has raised the rlimit for a long time, it changed only in April,
> so I don't think it would come up as a surprise for people who have used
> it for a while. I think this is also something that several other
> BPF-related applications (BCC I think?, bpftrace, Cilium come to mind)
> have been doing too.

In this case ignore me and let's continue doing that :-)

Btw, eventually we'd still like to stop doing that I'd presume? Should
we at some point follow up with something like:

if (kernel_version >= 5.11) { don't touch memlock; }

?

I guess we care only about <5.11 because of the backports, but 5.11+
kernels are guaranteed to have memcg.

I'm not sure whether memlock is used out there in the distros (and
especially for root/bpf_capable), so I'm also not sure whether we
really care or not.

> For new users, I agree the warning may be helpful. But then the message
> is likely to appear the very first time a user runs the command - likely
> as root - and I fear this might worry people not familiar with rlimits,
> who would wonder if they just broke something on their system? Maybe
> with a different phrasing.
>
> Alternatively we could document it in the relevant man pages (not that
> people would see it better, but at least it would be mentioned somewhere
> if people take the time to read the docs)? What do you think?
>
> Quentin