netdev - Re: [PATCH bpf-next 1/2] Revert "bpftool: Use libbpf 1.0 API mode instead of RLIMIT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <68e789f7-aa68-cc6a-fb33-b686d14f80a5@isovalent.com>
Date:   Thu, 16 Jun 2022 14:59:25 +0100
From:   Quentin Monnet <quentin@...valent.com>
To:     Yafang Shao <laoar.shao@...il.com>,
        Stanislav Fomichev <sdf@...gle.com>
Cc:     Alexei Starovoitov <ast@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        Andrii Nakryiko <andrii@...nel.org>,
        Harsh Modi <harshmodi@...gle.com>,
        Paul Chaignon <paul@...ium.io>,
        netdev <netdev@...r.kernel.org>, bpf <bpf@...r.kernel.org>
Subject: Re: [PATCH bpf-next 1/2] Revert "bpftool: Use libbpf 1.0 API mode
 instead of RLIMIT_MEMLOCK"

2022-06-16 00:05 UTC+0800 ~ Yafang Shao <laoar.shao@...il.com>
> On Wed, Jun 15, 2022 at 11:52 PM Stanislav Fomichev <sdf@...gle.com> wrote:
>>
>> On Wed, Jun 15, 2022 at 6:23 AM Yafang Shao <laoar.shao@...il.com> wrote:
>>>
>>> On Tue, Jun 14, 2022 at 10:20 PM Quentin Monnet <quentin@...valent.com> wrote:
>>>>
>>>> 2022-06-14 20:37 UTC+0800 ~ Yafang Shao <laoar.shao@...il.com>
>>>>> On Sat, Jun 11, 2022 at 1:17 AM Stanislav Fomichev <sdf@...gle.com> wrote:
>>>>>>
>>>>>> On Fri, Jun 10, 2022 at 10:00 AM Quentin Monnet <quentin@...valent.com> wrote:
>>>>>>>
>>>>>>> 2022-06-10 09:46 UTC-0700 ~ Stanislav Fomichev <sdf@...gle.com>
>>>>>>>> On Fri, Jun 10, 2022 at 9:34 AM Quentin Monnet <quentin@...valent.com> wrote:
>>>>>>>>>
>>>>>>>>> 2022-06-10 09:07 UTC-0700 ~ sdf@...gle.com
>>>>>>>>>> On 06/10, Quentin Monnet wrote:
>>>>>>>>>>> This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8.
>>>>>>>>>>
>>>>>>>>>>> In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of
>>>>>>>>>>> RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the
>>>>>>>>>>> kernel has switched to memcg-based memory accounting. Thanks to the
>>>>>>>>>>> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility
>>>>>>>>>>> with other systems and ask libbpf to raise the limit for us if
>>>>>>>>>>> necessary.
>>>>>>>>>>
>>>>>>>>>>> How do we know if memcg-based accounting is supported? There is a probe
>>>>>>>>>>> in libbpf to check this. But this probe currently relies on the
>>>>>>>>>>> availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which
>>>>>>>>>>> landed in the same kernel version as the memory accounting change. This
>>>>>>>>>>> works in the generic case, but it may fail, for example, if the helper
>>>>>>>>>>> function has been backported to an older kernel. This has been observed
>>>>>>>>>>> for Google Cloud's Container-Optimized OS (COS), where the helper is
>>>>>>>>>>> available but rlimit is still in use. The probe succeeds, the rlimit is
>>>>>>>>>>> not raised, and probing features with bpftool, for example, fails.
>>>>>>>>>>
>>>>>>>>>>> A patch was submitted [0] to update this probe in libbpf, based on what
>>>>>>>>>>> the cilium/ebpf Go library does [1]. It would lower the soft rlimit to
>>>>>>>>>>> 0, attempt to load a BPF object, and reset the rlimit. But it may induce
>>>>>>>>>>> some hard-to-debug flakiness if another process starts, or the current
>>>>>>>>>>> application is killed, while the rlimit is reduced, and the approach was
>>>>>>>>>>> discarded.
>>>>>>>>>>
>>>>>>>>>>> As a workaround to ensure that the rlimit bump does not depend on the
>>>>>>>>>>> availability of a given helper, we restore the unconditional rlimit bump
>>>>>>>>>>> in bpftool for now.
>>>>>>>>>>
>>>>>>>>>>> [0]
>>>>>>>>>>> https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/
>>>>>>>>>>> [1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39
>>>>>>>>>>
>>>>>>>>>>> Cc: Yafang Shao <laoar.shao@...il.com>
>>>>>>>>>>> Signed-off-by: Quentin Monnet <quentin@...valent.com>
>>>>>>>>>>> ---
>>>>>>>>>>>   tools/bpf/bpftool/common.c     | 8 ++++++++
>>>>>>>>>>>   tools/bpf/bpftool/feature.c    | 2 ++
>>>>>>>>>>>   tools/bpf/bpftool/main.c       | 6 +++---
>>>>>>>>>>>   tools/bpf/bpftool/main.h       | 2 ++
>>>>>>>>>>>   tools/bpf/bpftool/map.c        | 2 ++
>>>>>>>>>>>   tools/bpf/bpftool/pids.c       | 1 +
>>>>>>>>>>>   tools/bpf/bpftool/prog.c       | 3 +++
>>>>>>>>>>>   tools/bpf/bpftool/struct_ops.c | 2 ++
>>>>>>>>>>>   8 files changed, 23 insertions(+), 3 deletions(-)
>>>>>>>>>>
>>>>>>>>>>> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
>>>>>>>>>>> index a45b42ee8ab0..a0d4acd7c54a 100644
>>>>>>>>>>> --- a/tools/bpf/bpftool/common.c
>>>>>>>>>>> +++ b/tools/bpf/bpftool/common.c
>>>>>>>>>>> @@ -17,6 +17,7 @@
>>>>>>>>>>>   #include <linux/magic.h>
>>>>>>>>>>>   #include <net/if.h>
>>>>>>>>>>>   #include <sys/mount.h>
>>>>>>>>>>> +#include <sys/resource.h>
>>>>>>>>>>>   #include <sys/stat.h>
>>>>>>>>>>>   #include <sys/vfs.h>
>>>>>>>>>>
>>>>>>>>>>> @@ -72,6 +73,13 @@ static bool is_bpffs(char *path)
>>>>>>>>>>>       return (unsigned long)st_fs.f_type == BPF_FS_MAGIC;
>>>>>>>>>>>   }
>>>>>>>>>>
>>>>>>>>>>> +void set_max_rlimit(void)
>>>>>>>>>>> +{
>>>>>>>>>>> +    struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY };
>>>>>>>>>>> +
>>>>>>>>>>> +    setrlimit(RLIMIT_MEMLOCK, &rinf);
>>>>>>>>>>
>>>>>>>>>> Do you think it might make sense to print to stderr some warning if
>>>>>>>>>> we actually happen to adjust this limit?
>>>>>>>>>>
>>>>>>>>>> if (getrlimit(MEMLOCK) != RLIM_INFINITY) {
>>>>>>>>>>     fprintf(stderr, "Warning: resetting MEMLOCK rlimit to
>>>>>>>>>>     infinity!\n");
>>>>>>>>>>     setrlimit(RLIMIT_MEMLOCK, &rinf);
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> ?
>>>>>>>>>>
>>>>>>>>>> Because while it's nice that we automatically do this, this might still
>>>>>>>>>> lead to surprises for some users. OTOH, not sure whether people
>>>>>>>>>> actually read those warnings? :-/
>>>>>>>>>
>>>>>>>>> I'm not strictly opposed to a warning, but I'm not completely sure this
>>>>>>>>> is desirable.
>>>>>>>>>
>>>>>>>>> Bpftool has raised the rlimit for a long time, it changed only in April,
>>>>>>>>> so I don't think it would come up as a surprise for people who have used
>>>>>>>>> it for a while. I think this is also something that several other
>>>>>>>>> BPF-related applications (BCC I think?, bpftrace, Cilium come to mind)
>>>>>>>>> have been doing too.
>>>>>>>>
>>>>>>>> In this case ignore me and let's continue doing that :-)
>>>>>>>>
>>>>>>>> Btw, eventually we'd still like to stop doing that I'd presume?
>>>>>>>
>>>>>>> Agreed. I was thinking either finding a way to improve the probe in
>>>>>>> libbpf, or waiting for some more time until 5.11 gets old, but this may
>>>>>>> take years :/
>>>>>>>
>>>>>>>> Should
>>>>>>>> we at some point follow up with something like:
>>>>>>>>
>>>>>>>> if (kernel_version >= 5.11) { don't touch memlock; }
>>>>>>>>
>>>>>>>> ?
>>>>>>>>
>>>>>>>> I guess we care only about <5.11 because of the backports, but 5.11+
>>>>>>>> kernels are guaranteed to have memcg.
>>>>>>>
>>>>>>> You mean from uname() and parsing the release? Yes I suppose we could do
>>>>>>> that, can do as a follow-up.
>>>>>>
>>>>>> Yeah, uname-based, I don't think we can do better? Given that probing
>>>>>> is problematic as well :-(
>>>>>> But idk, up to you.
>>>>>>
>>>>>
>>>>> Agreed with the uname-based solution. Another possible solution is to
>>>>> probe the member 'memcg' in struct bpf_map, in case someone may
>>>>> backport memcg-based  memory accounting, but that will be a little
>>>>> over-engineering. The uname-based solution is simple and can work.
>>>>>
>>>>
>>>> Thanks! Yes, memcg would be more complex: the struct is not exposed to
>>>> user space, and BTF is not a hard dependency for bpftool. I'll work on
>>>> the uname-based test as a follow-up to this set.
>>>>
>>>
>>> After a second thought, the uname-based test may not work, because
>>> CONFIG_MEMCG_KMEM can be disabled.
>>
>> Does it matter? Regardless of whether there is memcg or not, we
>> shouldn't touch ulimit on 5.11+
>> If there is no memcg, there is no bpf memory enforcement.
> 
> Right, rlimit-based accounting is totally removed, that is not the
> same with what I thought before, while I thought it will fallback to
> rlimit-based if kmemcg is disabled.

Agreed, and so I've got a patch ready for the uname-based probe.

But talking about this with Daniel, we were wondering if it would make
sense instead to have the probe I had initially submitted (lower the
rlimit to 0, attempt to load a program, reset rlimit - see [0]), but
only for bpftool instead of libbpf? My understanding is that the memlock
rlimit is per-process, right? So this shouldn't affect any other
process, and because bpftool is not multithreaded, nothing other than
probing would happen while the rlimit is at zero? Or is it just simpler,
if less accurate, to stick to the uname probe?

Quentin

[0]
https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/