linux-kernel - Re: [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <6f44ddf9-6755-4120-be8b-7a62f0abc0e0@www.fastmail.com>
Date:   Tue, 12 Apr 2022 14:27:52 -0700
From:   "Andy Lutomirski" <luto@...nel.org>
To:     "Jason Gunthorpe" <jgg@...pe.ca>,
        "David Hildenbrand" <david@...hat.com>
Cc:     "Sean Christopherson" <seanjc@...gle.com>,
        "Chao Peng" <chao.p.peng@...ux.intel.com>,
        "kvm list" <kvm@...r.kernel.org>,
        "Linux Kernel Mailing List" <linux-kernel@...r.kernel.org>,
        linux-mm@...ck.org, linux-fsdevel@...r.kernel.org,
        "Linux API" <linux-api@...r.kernel.org>, qemu-devel@...gnu.org,
        "Paolo Bonzini" <pbonzini@...hat.com>,
        "Jonathan Corbet" <corbet@....net>,
        "Vitaly Kuznetsov" <vkuznets@...hat.com>,
        "Wanpeng Li" <wanpengli@...cent.com>,
        "Jim Mattson" <jmattson@...gle.com>,
        "Joerg Roedel" <joro@...tes.org>,
        "Thomas Gleixner" <tglx@...utronix.de>,
        "Ingo Molnar" <mingo@...hat.com>, "Borislav Petkov" <bp@...en8.de>,
        "the arch/x86 maintainers" <x86@...nel.org>,
        "H. Peter Anvin" <hpa@...or.com>,
        "Hugh Dickins" <hughd@...gle.com>,
        "Jeff Layton" <jlayton@...nel.org>,
        "J . Bruce Fields" <bfields@...ldses.org>,
        "Andrew Morton" <akpm@...ux-foundation.org>,
        "Mike Rapoport" <rppt@...nel.org>,
        "Steven Price" <steven.price@....com>,
        "Maciej S . Szmigiero" <mail@...iej.szmigiero.name>,
        "Vlastimil Babka" <vbabka@...e.cz>,
        "Vishal Annapurve" <vannapurve@...gle.com>,
        "Yu Zhang" <yu.c.zhang@...ux.intel.com>,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
        "Nakajima, Jun" <jun.nakajima@...el.com>,
        "Dave Hansen" <dave.hansen@...el.com>,
        "Andi Kleen" <ak@...ux.intel.com>,
        "Eric W. Biederman" <ebiederm@...ssion.com>,
        "Linux API" <linux-api@...r.kernel.org>
Subject: Re: [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against
 RLIMIT_MEMLOCK

On Tue, Apr 12, 2022, at 7:36 AM, Jason Gunthorpe wrote:
> On Fri, Apr 08, 2022 at 08:54:02PM +0200, David Hildenbrand wrote:
>
>> RLIMIT_MEMLOCK was the obvious candidate, but as we discovered int he
>> past already with secretmem, it's not 100% that good of a fit (unmovable
>> is worth than mlocked). But it gets the job done for now at least.
>
> No, it doesn't. There are too many different interpretations how
> MELOCK is supposed to work
>
> eg VFIO accounts per-process so hostile users can just fork to go past
> it.
>
> RDMA is per-process but uses a different counter, so you can double up
>
> iouring is per-user and users a 3rd counter, so it can triple up on
> the above two
>
>> So I'm open for alternative to limit the amount of unmovable memory we
>> might allocate for user space, and then we could convert seretmem as well.
>
> I think it has to be cgroup based considering where we are now :\
>

So this is another situation where the actual backend (TDX, SEV, pKVM, pure software) makes a difference -- depending on exactly what backend we're using, the memory may not be unmoveable.  It might even be swappable (in the potentially distant future).

Anyway, here's a concrete proposal, with a bit of handwaving:

We add new cgroup limits:

memory.unmoveable
memory.locked

These can be set to an actual number or they can be set to the special value ROOT_CAP.  If they're set to ROOT_CAP, then anyone in the cgroup with capable(CAP_SYS_RESOURCE) (i.e. the global capability) can allocate movable or locked memory with this (and potentially other) new APIs.  If it's 0, then they can't.  If it's another value, then the memory can be allocated, charged to the cgroup, up to the limit, with no particular capability needed.  The default at boot is ROOT_CAP.  Anyone who wants to configure it differently is free to do so.  This avoids introducing a DoS, makes it easy to run tests without configuring cgroup, and lets serious users set up their cgroups.

Nothing is charge per mm.

To make this fully sensible, we need to know what the backend is for the private memory before allocating any so that we can charge it accordingly.