linux-kernel - Re: [PATCH v2] ipc: Store mqueue sysctls in the ipc namespace

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <40ca86a1-ea36-0185-1ba5-c69005e46d3f@colorfullife.com>
Date:   Tue, 4 Jan 2022 21:48:54 +0100
From:   Manfred Spraul <manfred@...orfullife.com>
To:     "Eric W. Biederman" <ebiederm@...ssion.com>
Cc:     Alexey Gladkov <legion@...nel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Linux Containers <containers@...ts.linux.dev>,
        Alexander Mikhalitsyn <alexander.mikhalitsyn@...tuozzo.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Christian Brauner <christian.brauner@...ntu.com>,
        Daniel Walsh <dwalsh@...hat.com>,
        Davidlohr Bueso <dbueso@...e.de>,
        Kirill Tkhai <ktkhai@...tuozzo.com>,
        Serge Hallyn <serge@...lyn.com>,
        Varad Gautam <varad.gautam@...e.com>,
        Vasily Averin <vvs@...tuozzo.com>,
        kernel test robot <lkp@...el.com>
Subject: Re: [PATCH v2] ipc: Store mqueue sysctls in the ipc namespace

Hello Eric,

On 1/4/22 19:42, Eric W. Biederman wrote:
> Manfred Spraul <manfred@...orfullife.com> writes:
>   Hi Alexey,
>> On 1/4/22 12:51, Alexey Gladkov wrote:
>>> Right now, the mqueue sysctls take ipc namespaces into account in a
>>> rather hacky way. This works in most cases, but does not respect the
>>> user namespace.
>>>
>>> Within the user namespace, the user cannot change the /proc/sys/fs/mqueue/*
>>> parametres. This poses a problem in the rootless containers.
>>>
>>> To solve this I changed the implementation of the mqueue sysctls just
>>> like some other sysctls.
>>>
>>> Before this change:
>>>
>>> $ echo 5 | unshare -r -U -i tee /proc/sys/fs/mqueue/msg_max
>>> tee: /proc/sys/fs/mqueue/msg_max: Permission denied
>>> 5
>> Could you crosscheck that all (relevant) allocations in ipc/mqueue.c
>> use GFP_KERNEL_ACCOUNT?
> They are not.
>
>> We should not allow normal users to use up all memory.
>>
>> Otherwise:
>> The idea is good, the limits do not really prevent using up all
>> memory, _ACCOUNT is the better approach.
>> And with _ACCOUNT, it doesn't hurt that the namespace root is able to
>> set limits.
> Saying the cgroup kernel memory limit is the only thing that works, and
> that is always better is silly.
>
>
> First the cgroup kernel memory limits noted with ACCOUNT are not
> acceptable on several kernel hot paths because they are so expensive.

I was not aware that ACCOUNT allocations are very expensive.

OTHO adding ACCOUNT resolved various out of memory crashes for IIRC 
ipc/sem.c and/or ipc/msg.c. But we also do not have an RLIMIT for 
ipc/sem.c or ipc/msg.c

Let me rephrase my question:

When we allow non-root users to write to /proc/sys/fs/mqueue/msg_max, 
are there any _relevant_ allocations that bypass _all_ limits?

As you write, we have RLIMIT_MSGQUEUE.

And several allocations for ipc/mqueue already use ACCOUNT:

- the messages themselves, via load_msg()/alloc_msg().

- the inodes, via mqueue_inode_cachep().


> Further the memory cgroup kernel memory limit is not always delegated to
> non-root users, which precludes using the memory cgroup kernel memory
> limit in many situations.
>
>
> The RLIMIT_MQUEUE limit definitely works, and as I read the kernel
> source correct it defaults to MQ_BYTES_MAX aka 819200.  A limit of
> 800KiB should prevent using up all of system memory, except on very low
> memory machines.

I'd agree that 800 kB is not relevant. But we need to be certain that 
there are no loopholes.

I do not see anything relevant, e.g. 0-byte messages should be covered 
by mq_maxmsg. But perhaps I overlook something.

> So please let's not confuse apples and oranges, and let's use the tools
> in the kernel where they work, and not set them up in contest with each
> other.
>
> Rlimits with generous but real limits in general are good at catching
> when a program misbehaves.  The cgroups are better at setting a total
> memory cap.  In this case the rlimit cap is low enough it simply should
> not matter.
>
> What has been fixed with the ucount rlimits is that (baring
> implementation bugs) it is now not possible to create a user namespace
> and escape your rlimits by using multiple users.
I'll try to check the patch in detail in the next few days.


--

     Manfred