[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKH8qBvzZvHoUpkVPXN-v=XrvdPQ-1tEJOcd=PrGosHbY7+KdA@mail.gmail.com>
Date: Tue, 28 Jun 2022 10:53:45 -0700
From: Stanislav Fomichev <sdf@...gle.com>
To: Quentin Monnet <quentin@...valent.com>
Cc: Alexei Starovoitov <ast@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>,
Andrii Nakryiko <andrii@...nel.org>,
Yafang Shao <laoar.shao@...il.com>, netdev@...r.kernel.org,
bpf@...r.kernel.org
Subject: Re: [PATCH bpf-next] bpftool: Probe for memcg-based accounting before
bumping rlimit
On Tue, Jun 28, 2022 at 9:45 AM Quentin Monnet <quentin@...valent.com> wrote:
>
> Bpftool used to bump the memlock rlimit to make sure to be able to load
> BPF objects. After the kernel has switched to memcg-based memory
> accounting [0] in 5.11, bpftool has relied on libbpf to probe the system
> for memcg-based accounting support and for raising the rlimit if
> necessary [1]. But this was later reverted, because the probe would
> sometimes fail, resulting in bpftool not being able to load all required
> objects [2].
>
> Here we add a more efficient probe, in bpftool itself. We first lower
> the rlimit to 0, then we attempt to load a BPF object (and finally reset
> the rlimit): if the load succeeds, then memcg-based memory accounting is
> supported.
>
> This approach was earlier proposed for the probe in libbpf itself [3],
> but given that the library may be used in multithreaded applications,
> the probe could have undesirable consequences if one thread attempts to
> lock kernel memory while memlock rlimit is at 0. Since bpftool is
> single-threaded and the rlimit is process-based, this is fine to do in
> bpftool itself.
>
> This probe was inspired by the similar one from the cilium/ebpf Go
> library [4].
>
> [0] commit 97306be45fbe ("Merge branch 'switch to memcg-based memory accounting'")
> [1] commit a777e18f1bcd ("bpftool: Use libbpf 1.0 API mode instead of RLIMIT_MEMLOCK")
> [2] commit 6b4384ff1088 ("Revert "bpftool: Use libbpf 1.0 API mode instead of RLIMIT_MEMLOCK"")
> [3] https://lore.kernel.org/bpf/20220609143614.97837-1-quentin@isovalent.com/t/#u
> [4] https://github.com/cilium/ebpf/blob/v0.9.0/rlimit/rlimit.go#L39
>
> Cc: Stanislav Fomichev <sdf@...gle.com>
> Cc: Yafang Shao <laoar.shao@...il.com>
> Suggested-by: Daniel Borkmann <daniel@...earbox.net>
> Signed-off-by: Quentin Monnet <quentin@...valent.com>
> ---
> tools/bpf/bpftool/common.c | 71 ++++++++++++++++++++++++++++++++++--
> tools/include/linux/kernel.h | 5 +++
> 2 files changed, 73 insertions(+), 3 deletions(-)
>
> diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
> index a0d4acd7c54a..e07769802f76 100644
> --- a/tools/bpf/bpftool/common.c
> +++ b/tools/bpf/bpftool/common.c
> @@ -13,14 +13,17 @@
> #include <stdlib.h>
> #include <string.h>
> #include <unistd.h>
> -#include <linux/limits.h>
> -#include <linux/magic.h>
> #include <net/if.h>
> #include <sys/mount.h>
> #include <sys/resource.h>
> #include <sys/stat.h>
> #include <sys/vfs.h>
>
> +#include <linux/filter.h>
> +#include <linux/limits.h>
> +#include <linux/magic.h>
> +#include <linux/unistd.h>
> +
> #include <bpf/bpf.h>
> #include <bpf/hashmap.h>
> #include <bpf/libbpf.h> /* libbpf_num_possible_cpus */
> @@ -73,11 +76,73 @@ static bool is_bpffs(char *path)
> return (unsigned long)st_fs.f_type == BPF_FS_MAGIC;
> }
>
> +/* Probe whether kernel switched from memlock-based (RLIMIT_MEMLOCK) to
> + * memcg-based memory accounting for BPF maps and programs. This was done in
> + * commit 97306be45fbe ("Merge branch 'switch to memcg-based memory
> + * accounting'"), in Linux 5.11.
> + *
> + * Libbpf also offers to probe for memcg-based accounting vs rlimit, but does
> + * so by checking for the availability of a given BPF helper and this has
> + * failed on some kernels with backports in the past, see commit 6b4384ff1088
> + * ("Revert "bpftool: Use libbpf 1.0 API mode instead of RLIMIT_MEMLOCK"").
> + * Instead, we can probe by lowering the process-based rlimit to 0, trying to
> + * load a BPF object, and resetting the rlimit. If the load succeeds then
> + * memcg-based accounting is supported.
> + *
> + * This would be too dangerous to do in the library, because multithreaded
> + * applications might attempt to load items while the rlimit is at 0. Given
> + * that bpftool is single-threaded, this is fine to do here.
> + */
> +static bool known_to_need_rlimit(void)
> +{
> + const size_t prog_load_attr_sz = offsetofend(union bpf_attr, attach_btf_obj_fd);
nit:
Any specific reason you're hard coding this sz via offseofend? Why not
use sizeof(bpf_attr) directly as a syscall/memset size?
The kernel should handle all these cases where bpftool has extra zero
padding, right?
> + struct bpf_insn insns[] = {
> + BPF_EXIT_INSN(),
> + };
> + struct rlimit rlim_init, rlim_cur_zero = {};
> + size_t insn_cnt = ARRAY_SIZE(insns);
> + union bpf_attr attr;
> + int prog_fd, err;
> +
> + memset(&attr, 0, prog_load_attr_sz);
> + attr.prog_type = BPF_PROG_TYPE_SOCKET_FILTER;
> + attr.insns = ptr_to_u64(insns);
> + attr.insn_cnt = insn_cnt;
> + attr.license = ptr_to_u64("GPL");
> +
> + if (getrlimit(RLIMIT_MEMLOCK, &rlim_init))
> + return false;
> +
> + /* Drop the soft limit to zero. We maintain the hard limit to its
> + * current value, because lowering it would be a permanent operation
> + * for unprivileged users.
> + */
> + rlim_cur_zero.rlim_max = rlim_init.rlim_max;
> + if (setrlimit(RLIMIT_MEMLOCK, &rlim_cur_zero))
> + return false;
> +
> + /* Do not use bpf_prog_load() from libbpf here, because it calls
> + * bump_rlimit_memlock(), interfering with the current probe.
> + */
> + prog_fd = syscall(__NR_bpf, BPF_PROG_LOAD, &attr, prog_load_attr_sz);
> + err = errno;
> +
> + /* reset soft rlimit to its initial value */
> + setrlimit(RLIMIT_MEMLOCK, &rlim_init);
> +
> + if (prog_fd < 0)
> + return err == EPERM;
> +
> + close(prog_fd);
> + return false;
> +}
> +
> void set_max_rlimit(void)
> {
> struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY };
>
> - setrlimit(RLIMIT_MEMLOCK, &rinf);
> + if (known_to_need_rlimit())
> + setrlimit(RLIMIT_MEMLOCK, &rinf);
> }
>
> static int
> diff --git a/tools/include/linux/kernel.h b/tools/include/linux/kernel.h
> index 4b0673bf52c2..5c90d65cc2d3 100644
> --- a/tools/include/linux/kernel.h
> +++ b/tools/include/linux/kernel.h
> @@ -24,6 +24,11 @@
> #define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER)
> #endif
>
> +#ifndef offsetofend
> +# define offsetofend(TYPE, FIELD) \
> + (offsetof(TYPE, FIELD) + sizeof(((TYPE *)0)->FIELD))
> +#endif
> +
> #ifndef container_of
> /**
> * container_of - cast a member of a structure out to the containing structure
> --
> 2.34.1
>
Powered by blists - more mailing lists