linux-kernel - Re: [PATCH 0/2] capability: Introduce CAP_BLOCK

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <60f6f1f0-4918-5fea-9827-9bf9d1e496e3@linux.alibaba.com>
Date:   Tue, 23 May 2023 11:05:38 +0800
From:   Tianjia Zhang <tianjia.zhang@...ux.alibaba.com>
To:     Casey Schaufler <casey@...aufler-ca.com>,
        Serge Hallyn <serge@...lyn.com>,
        Paul Moore <paul@...l-moore.com>,
        Stephen Smalley <stephen.smalley.work@...il.com>,
        Eric Paris <eparis@...isplace.org>,
        Frederick Lawler <fred@...udflare.com>,
        Jens Axboe <axboe@...nel.dk>,
        Joseph Qi <joseph.qi@...ux.alibaba.com>,
        linux-security-module@...r.kernel.org, selinux@...r.kernel.org,
        linux-block@...r.kernel.org, linux-kernel@...r.kernel.org,
        louxiao.lx@...baba-inc.com
Subject: Re: [PATCH 0/2] capability: Introduce CAP_BLOCK_ADMIN



On 5/23/23 3:13 AM, Casey Schaufler wrote:
> On 5/21/2023 7:53 PM, Tianjia Zhang wrote:
>> Hi Casey,
>>
>> On 5/18/23 8:01 AM, Casey Schaufler wrote:
>>> On 5/16/2023 5:05 AM, Tianjia Zhang wrote:
>>>> Hi Casey,
>>>>
>>>> On 5/12/23 12:17 AM, Casey Schaufler wrote:
>>>>> On 5/11/2023 12:05 AM, Tianjia Zhang wrote:
>>>>>> Separated fine-grained capability CAP_BLOCK_ADMIN from CAP_SYS_ADMIN.
>>>>>> For backward compatibility, the CAP_BLOCK_ADMIN capability is
>>>>>> included
>>>>>> within CAP_SYS_ADMIN.
>>>>>>
>>>>>> Some database products rely on shared storage to complete the
>>>>>> write-once-read-multiple and write-multiple-read-multiple functions.
>>>>>> When HA occurs, they rely on the PR (Persistent Reservations)
>>>>>> protocol
>>>>>> provided by the storage layer to manage block device permissions to
>>>>>> ensure data correctness.
>>>>>>
>>>>>> CAP_SYS_ADMIN is required in the PR protocol implementation of
>>>>>> existing
>>>>>> block devices in the Linux kernel, which has too many sensitive
>>>>>> permissions, which may lead to risks such as container escape. The
>>>>>> kernel needs to provide more fine-grained permission management like
>>>>>> CAP_NET_ADMIN to avoid online products directly relying on root to
>>>>>> run.
>>>>>>
>>>>>> CAP_BLOCK_ADMIN can also provide support for other block device
>>>>>> operations that require CAP_SYS_ADMIN capabilities in the future,
>>>>>> ensuring that applications run with least privilege.
>>>>>
>>>>> Can you demonstrate that there are cases where a program that needs
>>>>> CAP_BLOCK_ADMIN does not also require CAP_SYS_ADMIN for other
>>>>> operations?
>>>>> How much of what's allowed by CAP_SYS_ADMIN would be allowed by
>>>>> CAP_BLOCK_ADMIN? If use of a new capability is rare it's difficult to
>>>>> justify.
>>>>>
>>>>
>>>> For the previous non-container scenarios, the block device is a shared
>>>> device, because the business-system generally operates the file system
>>>> on the block. Therefore, directly operating the block device has a high
>>>> probability of affecting other processes on the same host, and it is a
>>>> reasonable requirement to need the CAP_SYS_ADMIN capability.
>>>>
>>>> But for a database running in a container scenario, especially a
>>>> container scenario on the cloud, it is likely that a container
>>>> exclusively occupies a block device. That is to say, for a container,
>>>> its access to the block device will not affect other process, there is
>>>> no need to obtain a higher CAP_SYS_ADMIN capability.
>>>
>>> If I understand correctly, you're saying that the process that requires
>>> CAP_BLOCK_ADMIN in the container won't also require CAP_SYS_ADMIN for
>>> other operations.
>>>
>>> That's good, but it isn't clear how a process on bare metal would
>>> require CAP_SYS_ADMIN while the same process in a container wouldn't.
>>>
>>>>
>>>> For a file system similar to distributed write-once-read-many, it is
>>>> necessary to ensure the correctness of recovery, then when recovery
>>>> occurs, it is necessary to ensure that no inflighting-io is completed
>>>> after recovery.
>>>>
>>>> This can be guaranteed by performing operations such as SCSI/NVME
>>>> Persistent Reservations on block devices on the distributed file
>>>> system.
>>>
>>> Does your cloud based system always run "real" devices? My
>>> understanding is that cloud based deployment usually uses
>>> virtual machines and virtio or other simulated devices.
>>> A container deployment in the cloud seems unlikely to be able
>>> to take advantage of block administration. But I can't say
>>> I know the specifics of your environment.
>>>
>>>> Therefore, at present, it is only necessary to have the relevant
>>>> permission support of the control command of such container-exclusive
>>>> block devices.
>>>
>>> This looks like an extremely special case in which breaking out
>>> block management would make sense.
>>>
>> Our scenario is like this. In simply terms, a distributed database has
>> a read-write instance and one or more read-only instances. Each instance
>> runs in an isolated container. All containers share the same block
>> device.
>>
>> In addition to the database instance, there is also a control program
>> running on the control plane in the container. The database ensures
>> the correctness of the data through the PR (Persistent Reservations)
>> of the block device. This operation is also the only operation in the
>> container that requires CAP_SYS_ADMIN privileges.
>>
>> This system as a whole, whether it is running on VM or bare metal, the
>> difference is not big.
>>
>> In order to support the PR of block devices, we need to grant
>> CAP_SYS_ADMIN permissions to the container, which not only greatly
>> increases the risk of container escape, but also makes us have to
>> carefully configure the permissions of the container. Many container
>> escapes that have occurred are also caused by these reasons.
>>
>> This is essentially a problem of permission isolation. We hope to
>> share the smallest possible permissions from CAP_SYS_ADMIN to support
>> necessary operations, and avoid providing CAP_SYS_ADMIN permissions
>> to containers as much as possible.
> 
> Your use case is interesting, but not compelling. While you may have
> come up with a specific case where you can completely break CAP_BLOCK_ADMIN
> out from CAP_SYS_ADMIN, it's hardly general.
> 

It sounds a pity, thanks for your reply, we try to provide support
through self-developed patches first.

Kind regards,
Tianjia