[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALOAHbCTJOm8wLAeuhnjnN96_9c+hB5YjLGO_4NsgiB=8_QQOA@mail.gmail.com>
Date: Thu, 16 Jun 2022 22:54:12 +0800
From: Yafang Shao <laoar.shao@...il.com>
To: Quentin Monnet <quentin@...valent.com>
Cc: Stanislav Fomichev <sdf@...gle.com>,
Alexei Starovoitov <ast@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>,
Andrii Nakryiko <andrii@...nel.org>,
Harsh Modi <harshmodi@...gle.com>,
Paul Chaignon <paul@...ium.io>,
netdev <netdev@...r.kernel.org>, bpf <bpf@...r.kernel.org>
Subject: Re: [PATCH bpf-next 1/2] Revert "bpftool: Use libbpf 1.0 API mode
instead of RLIMIT_MEMLOCK"
On Thu, Jun 16, 2022 at 9:59 PM Quentin Monnet <quentin@...valent.com> wrote:
>
> 2022-06-16 00:05 UTC+0800 ~ Yafang Shao <laoar.shao@...il.com>
> > On Wed, Jun 15, 2022 at 11:52 PM Stanislav Fomichev <sdf@...gle.com> wrote:
> >>
> >> On Wed, Jun 15, 2022 at 6:23 AM Yafang Shao <laoar.shao@...il.com> wrote:
> >>>
> >>> On Tue, Jun 14, 2022 at 10:20 PM Quentin Monnet <quentin@...valent.com> wrote:
> >>>>
> >>>> 2022-06-14 20:37 UTC+0800 ~ Yafang Shao <laoar.shao@...il.com>
> >>>>> On Sat, Jun 11, 2022 at 1:17 AM Stanislav Fomichev <sdf@...gle.com> wrote:
> >>>>>>
> >>>>>> On Fri, Jun 10, 2022 at 10:00 AM Quentin Monnet <quentin@...valent.com> wrote:
> >>>>>>>
> >>>>>>> 2022-06-10 09:46 UTC-0700 ~ Stanislav Fomichev <sdf@...gle.com>
> >>>>>>>> On Fri, Jun 10, 2022 at 9:34 AM Quentin Monnet <quentin@...valent.com> wrote:
> >>>>>>>>>
> >>>>>>>>> 2022-06-10 09:07 UTC-0700 ~ sdf@...gle.com
> >>>>>>>>>> On 06/10, Quentin Monnet wrote:
> >>>>>>>>>>> This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8.
> >>>>>>>>>>
> >>>>>>>>>>> In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of
> >>>>>>>>>>> RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the
> >>>>>>>>>>> kernel has switched to memcg-based memory accounting. Thanks to the
> >>>>>>>>>>> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility
> >>>>>>>>>>> with other systems and ask libbpf to raise the limit for us if
> >>>>>>>>>>> necessary.
> >>>>>>>>>>
> >>>>>>>>>>> How do we know if memcg-based accounting is supported? There is a probe
> >>>>>>>>>>> in libbpf to check this. But this probe currently relies on the
> >>>>>>>>>>> availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which
> >>>>>>>>>>> landed in the same kernel version as the memory accounting change. This
> >>>>>>>>>>> works in the generic case, but it may fail, for example, if the helper
> >>>>>>>>>>> function has been backported to an older kernel. This has been observed
> >>>>>>>>>>> for Google Cloud's Container-Optimized OS (COS), where the helper is
> >>>>>>>>>>> available but rlimit is still in use. The probe succeeds, the rlimit is
> >>>>>>>>>>> not raised, and probing features with bpftool, for example, fails.
> >>>>>>>>>>
> >>>>>>>>>>> A patch was submitted [0] to update this probe in libbpf, based on what
> >>>>>>>>>>> the cilium/ebpf Go library does [1]. It would lower the soft rlimit to
> >>>>>>>>>>> 0, attempt to load a BPF object, and reset the rlimit. But it may induce
> >>>>>>>>>>> some hard-to-debug flakiness if another process starts, or the current
> >>>>>>>>>>> application is killed, while the rlimit is reduced, and the approach was
> >>>>>>>>>>> discarded.
> >>>>>>>>>>
> >>>>>>>>>>> As a workaround to ensure that the rlimit bump does not depend on the
> >>>>>>>>>>> availability of a given helper, we restore the unconditional rlimit bump
> >>>>>>>>>>> in bpftool for now.
> >>>>>>>>>>
> >>>>>>>>>>> [0]
> >>>>>>>>>>> https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/
> >>>>>>>>>>> [1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39
> >>>>>>>>>>
> >>>>>>>>>>> Cc: Yafang Shao <laoar.shao@...il.com>
> >>>>>>>>>>> Signed-off-by: Quentin Monnet <quentin@...valent.com>
> >>>>>>>>>>> ---
> >>>>>>>>>>> tools/bpf/bpftool/common.c | 8 ++++++++
> >>>>>>>>>>> tools/bpf/bpftool/feature.c | 2 ++
> >>>>>>>>>>> tools/bpf/bpftool/main.c | 6 +++---
> >>>>>>>>>>> tools/bpf/bpftool/main.h | 2 ++
> >>>>>>>>>>> tools/bpf/bpftool/map.c | 2 ++
> >>>>>>>>>>> tools/bpf/bpftool/pids.c | 1 +
> >>>>>>>>>>> tools/bpf/bpftool/prog.c | 3 +++
> >>>>>>>>>>> tools/bpf/bpftool/struct_ops.c | 2 ++
> >>>>>>>>>>> 8 files changed, 23 insertions(+), 3 deletions(-)
> >>>>>>>>>>
> >>>>>>>>>>> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
> >>>>>>>>>>> index a45b42ee8ab0..a0d4acd7c54a 100644
> >>>>>>>>>>> --- a/tools/bpf/bpftool/common.c
> >>>>>>>>>>> +++ b/tools/bpf/bpftool/common.c
> >>>>>>>>>>> @@ -17,6 +17,7 @@
> >>>>>>>>>>> #include <linux/magic.h>
> >>>>>>>>>>> #include <net/if.h>
> >>>>>>>>>>> #include <sys/mount.h>
> >>>>>>>>>>> +#include <sys/resource.h>
> >>>>>>>>>>> #include <sys/stat.h>
> >>>>>>>>>>> #include <sys/vfs.h>
> >>>>>>>>>>
> >>>>>>>>>>> @@ -72,6 +73,13 @@ static bool is_bpffs(char *path)
> >>>>>>>>>>> return (unsigned long)st_fs.f_type == BPF_FS_MAGIC;
> >>>>>>>>>>> }
> >>>>>>>>>>
> >>>>>>>>>>> +void set_max_rlimit(void)
> >>>>>>>>>>> +{
> >>>>>>>>>>> + struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY };
> >>>>>>>>>>> +
> >>>>>>>>>>> + setrlimit(RLIMIT_MEMLOCK, &rinf);
> >>>>>>>>>>
> >>>>>>>>>> Do you think it might make sense to print to stderr some warning if
> >>>>>>>>>> we actually happen to adjust this limit?
> >>>>>>>>>>
> >>>>>>>>>> if (getrlimit(MEMLOCK) != RLIM_INFINITY) {
> >>>>>>>>>> fprintf(stderr, "Warning: resetting MEMLOCK rlimit to
> >>>>>>>>>> infinity!\n");
> >>>>>>>>>> setrlimit(RLIMIT_MEMLOCK, &rinf);
> >>>>>>>>>> }
> >>>>>>>>>>
> >>>>>>>>>> ?
> >>>>>>>>>>
> >>>>>>>>>> Because while it's nice that we automatically do this, this might still
> >>>>>>>>>> lead to surprises for some users. OTOH, not sure whether people
> >>>>>>>>>> actually read those warnings? :-/
> >>>>>>>>>
> >>>>>>>>> I'm not strictly opposed to a warning, but I'm not completely sure this
> >>>>>>>>> is desirable.
> >>>>>>>>>
> >>>>>>>>> Bpftool has raised the rlimit for a long time, it changed only in April,
> >>>>>>>>> so I don't think it would come up as a surprise for people who have used
> >>>>>>>>> it for a while. I think this is also something that several other
> >>>>>>>>> BPF-related applications (BCC I think?, bpftrace, Cilium come to mind)
> >>>>>>>>> have been doing too.
> >>>>>>>>
> >>>>>>>> In this case ignore me and let's continue doing that :-)
> >>>>>>>>
> >>>>>>>> Btw, eventually we'd still like to stop doing that I'd presume?
> >>>>>>>
> >>>>>>> Agreed. I was thinking either finding a way to improve the probe in
> >>>>>>> libbpf, or waiting for some more time until 5.11 gets old, but this may
> >>>>>>> take years :/
> >>>>>>>
> >>>>>>>> Should
> >>>>>>>> we at some point follow up with something like:
> >>>>>>>>
> >>>>>>>> if (kernel_version >= 5.11) { don't touch memlock; }
> >>>>>>>>
> >>>>>>>> ?
> >>>>>>>>
> >>>>>>>> I guess we care only about <5.11 because of the backports, but 5.11+
> >>>>>>>> kernels are guaranteed to have memcg.
> >>>>>>>
> >>>>>>> You mean from uname() and parsing the release? Yes I suppose we could do
> >>>>>>> that, can do as a follow-up.
> >>>>>>
> >>>>>> Yeah, uname-based, I don't think we can do better? Given that probing
> >>>>>> is problematic as well :-(
> >>>>>> But idk, up to you.
> >>>>>>
> >>>>>
> >>>>> Agreed with the uname-based solution. Another possible solution is to
> >>>>> probe the member 'memcg' in struct bpf_map, in case someone may
> >>>>> backport memcg-based memory accounting, but that will be a little
> >>>>> over-engineering. The uname-based solution is simple and can work.
> >>>>>
> >>>>
> >>>> Thanks! Yes, memcg would be more complex: the struct is not exposed to
> >>>> user space, and BTF is not a hard dependency for bpftool. I'll work on
> >>>> the uname-based test as a follow-up to this set.
> >>>>
> >>>
> >>> After a second thought, the uname-based test may not work, because
> >>> CONFIG_MEMCG_KMEM can be disabled.
> >>
> >> Does it matter? Regardless of whether there is memcg or not, we
> >> shouldn't touch ulimit on 5.11+
> >> If there is no memcg, there is no bpf memory enforcement.
> >
> > Right, rlimit-based accounting is totally removed, that is not the
> > same with what I thought before, while I thought it will fallback to
> > rlimit-based if kmemcg is disabled.
>
> Agreed, and so I've got a patch ready for the uname-based probe.
>
> But talking about this with Daniel, we were wondering if it would make
> sense instead to have the probe I had initially submitted (lower the
> rlimit to 0, attempt to load a program, reset rlimit - see [0]), but
> only for bpftool instead of libbpf? My understanding is that the memlock
> rlimit is per-process, right? So this shouldn't affect any other
> process, and because bpftool is not multithreaded, nothing other than
> probing would happen while the rlimit is at zero?
Makes sense.
It is safe to do the probe within bpftool.
> Or is it just simpler,
> if less accurate, to stick to the uname probe?
>
> Quentin
>
> [0]
> https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/
--
Regards
Yafang
Powered by blists - more mailing lists