linux-kernel - Re: [PATCH RFC 1/4] bpf: add cgroup device guard to flag a cgroup device prog

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230904-harfe-haargenau-4c6cb31c304a@brauner>
Date:   Mon, 4 Sep 2023 13:44:20 +0200
From:   Christian Brauner <brauner@...nel.org>
To:     Alexander Mikhalitsyn <alexander@...alicyn.com>,
        Alexei Starovoitov <alexei.starovoitov@...il.com>,
        Michael Weiß <michael.weiss@...ec.fraunhofer.de>
Cc:     Daniel Borkmann <daniel@...earbox.net>,
        Andrii Nakryiko <andrii@...nel.org>,
        Martin KaFai Lau <martin.lau@...ux.dev>,
        Song Liu <song@...nel.org>, Yonghong Song <yhs@...com>,
        John Fastabend <john.fastabend@...il.com>,
        KP Singh <kpsingh@...nel.org>,
        Stanislav Fomichev <sdf@...gle.com>,
        Hao Luo <haoluo@...gle.com>, Jiri Olsa <jolsa@...nel.org>,
        Quentin Monnet <quentin@...valent.com>,
        Alexander Viro <viro@...iv.linux.org.uk>, bpf@...r.kernel.org,
        linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org,
        gyroidos@...ec.fraunhofer.de, paul@...l-moore.com,
        Miklos Szeredi <miklos@...redi.hu>,
        Amir Goldstein <amir73il@...il.com>
Subject: Re: [PATCH RFC 1/4] bpf: add cgroup device guard to flag a cgroup
 device prog

On Tue, Aug 29, 2023 at 03:35:46PM +0200, Alexander Mikhalitsyn wrote:
> On Fri, Aug 18, 2023 at 12:11 AM Alexei Starovoitov
> <alexei.starovoitov@...il.com> wrote:
> >
> > On Tue, Aug 15, 2023 at 10:59:22AM +0200, Christian Brauner wrote:
> > > On Mon, Aug 14, 2023 at 04:26:09PM +0200, Michael Weiß wrote:
> > > > Introduce the BPF_F_CGROUP_DEVICE_GUARD flag for BPF_PROG_LOAD
> > > > which allows to set a cgroup device program to be a device guard.
> > >
> > > Currently we block access to devices unconditionally in may_open_dev().
> > > Anything that's mounted by an unprivileged containers will get
> > > SB_I_NODEV set in s_i_flags.
> > >
> > > Then we currently mediate device access in:
> > >
> > > * inode_permission()
> > >   -> devcgroup_inode_permission()
> > > * vfs_mknod()
> > >   -> devcgroup_inode_mknod()
> > > * blkdev_get_by_dev() // sget()/sget_fc(), other ways to open block devices and friends
> > >   -> devcgroup_check_permission()
> > > * drivers/gpu/drm/amd/amdkfd // weird restrictions on showing gpu info afaict
> > >   -> devcgroup_check_permission()
> > >
> > > All your new flag does is to bypass that SB_I_NODEV check afaict and let
> > > it proceed to the devcgroup_*() checks for the vfs layer.
> > >
> > > But I don't get the semantics yet.
> > > Is that a flag which is set on BPF_PROG_TYPE_CGROUP_DEVICE programs or
> > > is that a flag on random bpf programs? It looks like it would be the
> > > latter but design-wise I would expect this to be a property of the
> > > device program itself.
> >
> > Looks like patch 4 attemps to bypass usual permission checks with:
> > @@ -3976,9 +3979,19 @@ int vfs_mknod(struct mnt_idmap *idmap, struct inode *dir,
> >         if (error)
> >                 return error;
> >
> > -       if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout &&
> > -           !capable(CAP_MKNOD))
> > -               return -EPERM;
> > +       /*
> > +        * In case of a device cgroup restirction allow mknod in user
> > +        * namespace. Otherwise just check global capability; thus,
> > +        * mknod is also disabled for user namespace other than the
> > +        * initial one.
> > +        */
> > +       if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout) {
> > +               if (devcgroup_task_is_guarded(current)) {
> > +                       if (!ns_capable(current_user_ns(), CAP_MKNOD))
> > +                               return -EPERM;
> > +               } else if (!capable(CAP_MKNOD))
> > +                       return -EPERM;
> > +       }
> >
> 
> Dear colleagues,
> 
> > which pretty much sounds like authoritative LSM that was brought up in the past
> > and LSM folks didn't like it.
> 
> Thanks for pointing this out, Alexei!
> I've searched through the LKML archives and found a thread about this:
> https://lore.kernel.org/all/CAEf4BzaBt0W3sWh_L4RRXEFYdBotzVEnQdqC7BO+PNWtD7eSUA@mail.gmail.com/
> 
> As far as I understand, disagreement here is about a practice of
> skipping kernel-built capability checks based
> on LSM hooks, right?
> 
> +CC Paul Moore <paul@...l-moore.com>
> 
> >
> > If vfs folks are ok with this special bypass of permissions in vfs_mknod()
> > we can talk about kernel->bpf api details.
> > The way it's done with BPF_F_CGROUP_DEVICE_GUARD flag is definitely no go,
> > but no point going into bpf details now until agreement on bypass is made.

Afaiu the original concern was specifically about an LSM allowing to
bypass other LSMs or DAC permissions. But this wouldn't be the case
here. The general inode access LSM permission mediation is separate from
specific device access management: the security_inode_permission() LSM
hook would still be called and thus LSMs restrictions would continue to
apply exactly as they do now.

For cgroup v1 device access management was a cgroup controller with
management interface through files. It then was ported to an eBPF
program attachable to cgroups for cgroup v2. Arguably, it should
probably have been ported to an LSM hook or a separate LSM and untied
from cgroups completely. The confusion here seems to indicate that that
would have been the right way to go.

Because right now device access management seems its own form of
mandatory access control.