netdev - Re: [PATCH bpf-next] bpftool: Probe for memcg-based accounting before bumping rlimit

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKH8qBvzZvHoUpkVPXN-v=XrvdPQ-1tEJOcd=PrGosHbY7+KdA@mail.gmail.com>
Date:   Tue, 28 Jun 2022 10:53:45 -0700
From:   Stanislav Fomichev <sdf@...gle.com>
To:     Quentin Monnet <quentin@...valent.com>
Cc:     Alexei Starovoitov <ast@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        Andrii Nakryiko <andrii@...nel.org>,
        Yafang Shao <laoar.shao@...il.com>, netdev@...r.kernel.org,
        bpf@...r.kernel.org
Subject: Re: [PATCH bpf-next] bpftool: Probe for memcg-based accounting before
 bumping rlimit

On Tue, Jun 28, 2022 at 9:45 AM Quentin Monnet <quentin@...valent.com> wrote:
>
> Bpftool used to bump the memlock rlimit to make sure to be able to load
> BPF objects. After the kernel has switched to memcg-based memory
> accounting [0] in 5.11, bpftool has relied on libbpf to probe the system
> for memcg-based accounting support and for raising the rlimit if
> necessary [1]. But this was later reverted, because the probe would
> sometimes fail, resulting in bpftool not being able to load all required
> objects [2].
>
> Here we add a more efficient probe, in bpftool itself. We first lower
> the rlimit to 0, then we attempt to load a BPF object (and finally reset
> the rlimit): if the load succeeds, then memcg-based memory accounting is
> supported.
>
> This approach was earlier proposed for the probe in libbpf itself [3],
> but given that the library may be used in multithreaded applications,
> the probe could have undesirable consequences if one thread attempts to
> lock kernel memory while memlock rlimit is at 0. Since bpftool is
> single-threaded and the rlimit is process-based, this is fine to do in
> bpftool itself.
>
> This probe was inspired by the similar one from the cilium/ebpf Go
> library [4].
>
> [0] commit 97306be45fbe ("Merge branch 'switch to memcg-based memory accounting'")
> [1] commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of RLIMIT_MEMLOCK")
> [2] commit 6b4384ff1088 ("Revert "bpftool: Use libbpf 1.0 API mode instead of RLIMIT_MEMLOCK"")
> [3] https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/t/#u
> [4] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39
>
> Cc: Stanislav Fomichev <sdf@...gle.com>
> Cc: Yafang Shao <laoar.shao@...il.com>
> Suggested-by: Daniel Borkmann <daniel@...earbox.net>
> Signed-off-by: Quentin Monnet <quentin@...valent.com>
> ---
>  tools/bpf/bpftool/common.c   | 71 ++++++++++++++++++++++++++++++++++--
>  tools/include/linux/kernel.h |  5 +++
>  2 files changed, 73 insertions(+), 3 deletions(-)
>
> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
> index a0d4acd7c54a..e07769802f76 100644
> --- a/tools/bpf/bpftool/common.c
> +++ b/tools/bpf/bpftool/common.c
> @@ -13,14 +13,17 @@
>  #include <stdlib.h>
>  #include <string.h>
>  #include <unistd.h>
> -#include <linux/limits.h>
> -#include <linux/magic.h>
>  #include <net/if.h>
>  #include <sys/mount.h>
>  #include <sys/resource.h>
>  #include <sys/stat.h>
>  #include <sys/vfs.h>
>
> +#include <linux/filter.h>
> +#include <linux/limits.h>
> +#include <linux/magic.h>
> +#include <linux/unistd.h>
> +
>  #include <bpf/bpf.h>
>  #include <bpf/hashmap.h>
>  #include <bpf/libbpf.h> /* libbpf_num_possible_cpus */
> @@ -73,11 +76,73 @@ static bool is_bpffs(char *path)
>         return (unsigned long)st_fs.f_type == BPF_FS_MAGIC;
>  }
>
> +/* Probe whether kernel switched from memlock-based (RLIMIT_MEMLOCK) to
> + * memcg-based memory accounting for BPF maps and programs. This was done in
> + * commit 97306be45fbe ("Merge branch 'switch to memcg-based memory
> + * accounting'"), in Linux 5.11.
> + *
> + * Libbpf also offers to probe for memcg-based accounting vs rlimit, but does
> + * so by checking for the availability of a given BPF helper and this has
> + * failed on some kernels with backports in the past, see commit 6b4384ff1088
> + * ("Revert "bpftool: Use libbpf 1.0 API mode instead of RLIMIT_MEMLOCK"").
> + * Instead, we can probe by lowering the process-based rlimit to 0, trying to
> + * load a BPF object, and resetting the rlimit. If the load succeeds then
> + * memcg-based accounting is supported.
> + *
> + * This would be too dangerous to do in the library, because multithreaded
> + * applications might attempt to load items while the rlimit is at 0. Given
> + * that bpftool is single-threaded, this is fine to do here.
> + */
> +static bool known_to_need_rlimit(void)
> +{
> +       const size_t prog_load_attr_sz = offsetofend(union bpf_attr, attach_btf_obj_fd);

nit:
Any specific reason you're hard coding this sz via offseofend? Why not
use sizeof(bpf_attr) directly as a syscall/memset size?
The kernel should handle all these cases where bpftool has extra zero
padding, right?

> +       struct bpf_insn insns[] = {
> +               BPF_EXIT_INSN(),
> +       };
> +       struct rlimit rlim_init, rlim_cur_zero = {};
> +       size_t insn_cnt = ARRAY_SIZE(insns);
> +       union bpf_attr attr;
> +       int prog_fd, err;
> +
> +       memset(&attr, 0, prog_load_attr_sz);
> +       attr.prog_type = BPF_PROG_TYPE_SOCKET_FILTER;
> +       attr.insns = ptr_to_u64(insns);
> +       attr.insn_cnt = insn_cnt;
> +       attr.license = ptr_to_u64("GPL");
> +
> +       if (getrlimit(RLIMIT_MEMLOCK, &rlim_init))
> +               return false;
> +
> +       /* Drop the soft limit to zero. We maintain the hard limit to its
> +        * current value, because lowering it would be a permanent operation
> +        * for unprivileged users.
> +        */
> +       rlim_cur_zero.rlim_max = rlim_init.rlim_max;
> +       if (setrlimit(RLIMIT_MEMLOCK, &rlim_cur_zero))
> +               return false;
> +
> +       /* Do not use bpf_prog_load() from libbpf here, because it calls
> +        * bump_rlimit_memlock(), interfering with the current probe.
> +        */
> +       prog_fd = syscall(__NR_bpf, BPF_PROG_LOAD, &attr, prog_load_attr_sz);
> +       err = errno;
> +
> +       /* reset soft rlimit to its initial value */
> +       setrlimit(RLIMIT_MEMLOCK, &rlim_init);
> +
> +       if (prog_fd < 0)
> +               return err == EPERM;
> +
> +       close(prog_fd);
> +       return false;
> +}
> +
>  void set_max_rlimit(void)
>  {
>         struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY };
>
> -       setrlimit(RLIMIT_MEMLOCK, &rinf);
> +       if (known_to_need_rlimit())
> +               setrlimit(RLIMIT_MEMLOCK, &rinf);
>  }
>
>  static int
> diff --git a/tools/include/linux/kernel.h b/tools/include/linux/kernel.h
> index 4b0673bf52c2..5c90d65cc2d3 100644
> --- a/tools/include/linux/kernel.h
> +++ b/tools/include/linux/kernel.h
> @@ -24,6 +24,11 @@
>  #define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER)
>  #endif
>
> +#ifndef offsetofend
> +# define offsetofend(TYPE, FIELD) \
> +       (offsetof(TYPE, FIELD) + sizeof(((TYPE *)0)->FIELD))
> +#endif
> +
>  #ifndef container_of
>  /**
>   * container_of - cast a member of a structure out to the containing structure
> --
> 2.34.1
>