netdev - Re: [PATCH bpf-next 1/2] Revert "bpftool: Use libbpf 1.0 API mode instead of RLIMIT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e9aa57d2-4ce7-23f2-0ba1-ea58f3254353@isovalent.com>
Date:   Tue, 14 Jun 2022 15:20:03 +0100
From:   Quentin Monnet <quentin@...valent.com>
To:     Yafang Shao <laoar.shao@...il.com>,
        Stanislav Fomichev <sdf@...gle.com>
Cc:     Alexei Starovoitov <ast@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        Andrii Nakryiko <andrii@...nel.org>,
        Harsh Modi <harshmodi@...gle.com>,
        Paul Chaignon <paul@...ium.io>,
        netdev <netdev@...r.kernel.org>, bpf <bpf@...r.kernel.org>
Subject: Re: [PATCH bpf-next 1/2] Revert "bpftool: Use libbpf 1.0 API mode
 instead of RLIMIT_MEMLOCK"

2022-06-14 20:37 UTC+0800 ~ Yafang Shao <laoar.shao@...il.com>
> On Sat, Jun 11, 2022 at 1:17 AM Stanislav Fomichev <sdf@...gle.com> wrote:
>>
>> On Fri, Jun 10, 2022 at 10:00 AM Quentin Monnet <quentin@...valent.com> wrote:
>>>
>>> 2022-06-10 09:46 UTC-0700 ~ Stanislav Fomichev <sdf@...gle.com>
>>>> On Fri, Jun 10, 2022 at 9:34 AM Quentin Monnet <quentin@...valent.com> wrote:
>>>>>
>>>>> 2022-06-10 09:07 UTC-0700 ~ sdf@...gle.com
>>>>>> On 06/10, Quentin Monnet wrote:
>>>>>>> This reverts commit a777e18f1bcd32528ff5dfd10a6629b655b05eb8.
>>>>>>
>>>>>>> In commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of
>>>>>>> RLIMIT_MEMLOCK"), we removed the rlimit bump in bpftool, because the
>>>>>>> kernel has switched to memcg-based memory accounting. Thanks to the
>>>>>>> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK, we attempted to keep compatibility
>>>>>>> with other systems and ask libbpf to raise the limit for us if
>>>>>>> necessary.
>>>>>>
>>>>>>> How do we know if memcg-based accounting is supported? There is a probe
>>>>>>> in libbpf to check this. But this probe currently relies on the
>>>>>>> availability of a given BPF helper, bpf_ktime_get_coarse_ns(), which
>>>>>>> landed in the same kernel version as the memory accounting change. This
>>>>>>> works in the generic case, but it may fail, for example, if the helper
>>>>>>> function has been backported to an older kernel. This has been observed
>>>>>>> for Google Cloud's Container-Optimized OS (COS), where the helper is
>>>>>>> available but rlimit is still in use. The probe succeeds, the rlimit is
>>>>>>> not raised, and probing features with bpftool, for example, fails.
>>>>>>
>>>>>>> A patch was submitted [0] to update this probe in libbpf, based on what
>>>>>>> the cilium/ebpf Go library does [1]. It would lower the soft rlimit to
>>>>>>> 0, attempt to load a BPF object, and reset the rlimit. But it may induce
>>>>>>> some hard-to-debug flakiness if another process starts, or the current
>>>>>>> application is killed, while the rlimit is reduced, and the approach was
>>>>>>> discarded.
>>>>>>
>>>>>>> As a workaround to ensure that the rlimit bump does not depend on the
>>>>>>> availability of a given helper, we restore the unconditional rlimit bump
>>>>>>> in bpftool for now.
>>>>>>
>>>>>>> [0]
>>>>>>> https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/
>>>>>>> [1] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39
>>>>>>
>>>>>>> Cc: Yafang Shao <laoar.shao@...il.com>
>>>>>>> Signed-off-by: Quentin Monnet <quentin@...valent.com>
>>>>>>> ---
>>>>>>>   tools/bpf/bpftool/common.c     | 8 ++++++++
>>>>>>>   tools/bpf/bpftool/feature.c    | 2 ++
>>>>>>>   tools/bpf/bpftool/main.c       | 6 +++---
>>>>>>>   tools/bpf/bpftool/main.h       | 2 ++
>>>>>>>   tools/bpf/bpftool/map.c        | 2 ++
>>>>>>>   tools/bpf/bpftool/pids.c       | 1 +
>>>>>>>   tools/bpf/bpftool/prog.c       | 3 +++
>>>>>>>   tools/bpf/bpftool/struct_ops.c | 2 ++
>>>>>>>   8 files changed, 23 insertions(+), 3 deletions(-)
>>>>>>
>>>>>>> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
>>>>>>> index a45b42ee8ab0..a0d4acd7c54a 100644
>>>>>>> --- a/tools/bpf/bpftool/common.c
>>>>>>> +++ b/tools/bpf/bpftool/common.c
>>>>>>> @@ -17,6 +17,7 @@
>>>>>>>   #include <linux/magic.h>
>>>>>>>   #include <net/if.h>
>>>>>>>   #include <sys/mount.h>
>>>>>>> +#include <sys/resource.h>
>>>>>>>   #include <sys/stat.h>
>>>>>>>   #include <sys/vfs.h>
>>>>>>
>>>>>>> @@ -72,6 +73,13 @@ static bool is_bpffs(char *path)
>>>>>>>       return (unsigned long)st_fs.f_type == BPF_FS_MAGIC;
>>>>>>>   }
>>>>>>
>>>>>>> +void set_max_rlimit(void)
>>>>>>> +{
>>>>>>> +    struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY };
>>>>>>> +
>>>>>>> +    setrlimit(RLIMIT_MEMLOCK, &rinf);
>>>>>>
>>>>>> Do you think it might make sense to print to stderr some warning if
>>>>>> we actually happen to adjust this limit?
>>>>>>
>>>>>> if (getrlimit(MEMLOCK) != RLIM_INFINITY) {
>>>>>>     fprintf(stderr, "Warning: resetting MEMLOCK rlimit to
>>>>>>     infinity!\n");
>>>>>>     setrlimit(RLIMIT_MEMLOCK, &rinf);
>>>>>> }
>>>>>>
>>>>>> ?
>>>>>>
>>>>>> Because while it's nice that we automatically do this, this might still
>>>>>> lead to surprises for some users. OTOH, not sure whether people
>>>>>> actually read those warnings? :-/
>>>>>
>>>>> I'm not strictly opposed to a warning, but I'm not completely sure this
>>>>> is desirable.
>>>>>
>>>>> Bpftool has raised the rlimit for a long time, it changed only in April,
>>>>> so I don't think it would come up as a surprise for people who have used
>>>>> it for a while. I think this is also something that several other
>>>>> BPF-related applications (BCC I think?, bpftrace, Cilium come to mind)
>>>>> have been doing too.
>>>>
>>>> In this case ignore me and let's continue doing that :-)
>>>>
>>>> Btw, eventually we'd still like to stop doing that I'd presume?
>>>
>>> Agreed. I was thinking either finding a way to improve the probe in
>>> libbpf, or waiting for some more time until 5.11 gets old, but this may
>>> take years :/
>>>
>>>> Should
>>>> we at some point follow up with something like:
>>>>
>>>> if (kernel_version >= 5.11) { don't touch memlock; }
>>>>
>>>> ?
>>>>
>>>> I guess we care only about <5.11 because of the backports, but 5.11+
>>>> kernels are guaranteed to have memcg.
>>>
>>> You mean from uname() and parsing the release? Yes I suppose we could do
>>> that, can do as a follow-up.
>>
>> Yeah, uname-based, I don't think we can do better? Given that probing
>> is problematic as well :-(
>> But idk, up to you.
>>
> 
> Agreed with the uname-based solution. Another possible solution is to
> probe the member 'memcg' in struct bpf_map, in case someone may
> backport memcg-based  memory accounting, but that will be a little
> over-engineering. The uname-based solution is simple and can work.
> 

Thanks! Yes, memcg would be more complex: the struct is not exposed to
user space, and BTF is not a hard dependency for bpftool. I'll work on
the uname-based test as a follow-up to this set.

Quentin