[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20230814-devcg_guard-v1-0-654971ab88b1@aisec.fraunhofer.de>
Date: Mon, 14 Aug 2023 16:26:08 +0200
From: Michael Weiß <michael.weiss@...ec.fraunhofer.de>
To: Alexander Mikhalitsyn <alexander@...alicyn.com>,
Christian Brauner <brauner@...nel.org>,
Alexei Starovoitov <ast@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>,
Andrii Nakryiko <andrii@...nel.org>,
Martin KaFai Lau <martin.lau@...ux.dev>,
Song Liu <song@...nel.org>, Yonghong Song <yhs@...com>,
John Fastabend <john.fastabend@...il.com>,
KP Singh <kpsingh@...nel.org>,
Stanislav Fomichev <sdf@...gle.com>,
Hao Luo <haoluo@...gle.com>, Jiri Olsa <jolsa@...nel.org>,
Quentin Monnet <quentin@...valent.com>,
Alexander Viro <viro@...iv.linux.org.uk>
Cc: bpf@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-fsdevel@...r.kernel.org, gyroidos@...ec.fraunhofer.de,
Michael Weiß <michael.weiss@...ec.fraunhofer.de>
Subject: [PATCH RFC 0/4] bpf: cgroup device guard for non-initial user
namespace
Introduce the BPF_F_CGROUP_DEVICE_GUARD flag for BPF_PROG_LOAD
which allows to set a cgroup device program to be a device guard.
This may be used to guard actions on device nodes in non-initial
userns, e.g., mknod.
If a container manager restricts its unprivileged (user namespaced)
children by a device cgroup, it is not necessary to deny mknod
anymore. Thus, user space applications may map devices on different
locations in the file system by using mknod() inside the container.
A use case for this, we also use in GyroidOS, is to run virsh for
VMs inside an unprivileged container. virsh creates device nodes,
e.g., "/var/run/libvirt/qemu/11-fgfg.dev/null" which currently fails
in a non-initial userns, even if a cgroup device white list with the
corresponding major, minor of /dev/null exists. Thus, in this case
the usual bind mounts or pre populated device nodes under /dev are
not sufficient.
To circumvent this limitation, we allow mknod() in the VFS if a
bpf cgroup device guard is enabled for the current task and check
CAP_MKNOD for the current user namespace instead of the init userns.
To avoid unusable device nodes on file systems mounted in
non-initial user namespace, may_open_dev() ignores the SB_I_NODEV
for cgroup device guarded tasks.
Tested for a GyroidOS container generated by the cmld using the
following user space patch: https://github.com/gyroidos/cml/pull/394
I discussed this internally with Christian in the UAPI group, earlier.
I put this to the public list now, since also LXC/LXD Folks have
announced interest on this.
This series applies to the latest mainline v6.5-rc6 tag.
Signed-off-by: Michael Weiß <michael.weiss@...ec.fraunhofer.de>
---
Michael Weiß (4):
bpf: add cgroup device guard to flag a cgroup device prog
bpf: provide cgroup_device_guard in bpf_prog_info to user space
device_cgroup: wrapper for bpf cgroup device guard
fs: allow mknod in non-initial userns using cgroup device guard
fs/namei.c | 19 ++++++++++++++++---
include/linux/bpf-cgroup.h | 7 +++++++
include/linux/bpf.h | 1 +
include/linux/device_cgroup.h | 7 +++++++
include/uapi/linux/bpf.h | 8 +++++++-
kernel/bpf/cgroup.c | 30 ++++++++++++++++++++++++++++++
kernel/bpf/syscall.c | 6 +++++-
security/device_cgroup.c | 10 ++++++++++
tools/bpf/bpftool/prog.c | 2 ++
tools/include/uapi/linux/bpf.h | 8 +++++++-
10 files changed, 92 insertions(+), 6 deletions(-)
---
base-commit: 2ccdd1b13c591d306f0401d98dedc4bdcd02b421
change-id: 20230814-devcg_guard-5398ef84bf7b
Best regards,
--
Michael Weiß <michael.weiss@...ec.fraunhofer.de>
Powered by blists - more mailing lists