linux-kernel - Re: [PATCH] Increase default MLOCK

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ffa66565-d546-a2cf-1748-38b9992fd5b8@redhat.com>
Date:   Mon, 22 Nov 2021 22:56:24 +0100
From:   David Hildenbrand <david@...hat.com>
To:     Jens Axboe <axboe@...nel.dk>,
        Andrew Dona-Couch <andrew@...acou.ch>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Drew DeVault <sir@...wn.com>
Cc:     Ammar Faizi <ammarfaizi2@...weeb.org>,
        linux-kernel@...r.kernel.org, linux-api@...r.kernel.org,
        io_uring Mailing List <io-uring@...r.kernel.org>,
        Pavel Begunkov <asml.silence@...il.com>, linux-mm@...ck.org
Subject: Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB

On 22.11.21 21:44, Jens Axboe wrote:
> On 11/22/21 1:08 PM, David Hildenbrand wrote:
>> On 22.11.21 20:53, Jens Axboe wrote:
>>> On 11/22/21 11:26 AM, David Hildenbrand wrote:
>>>> On 22.11.21 18:55, Andrew Dona-Couch wrote:
>>>>> Forgive me for jumping in to an already overburdened thread.  But can
>>>>> someone pushing back on this clearly explain the issue with applying
>>>>> this patch?
>>>>
>>>> It will allow unprivileged users to easily and even "accidentally"
>>>> allocate more unmovable memory than it should in some environments. Such
>>>> limits exist for a reason. And there are ways for admins/distros to
>>>> tweak these limits if they know what they are doing.
>>>
>>> But that's entirely the point, the cases where this change is needed are
>>> already screwed by a distro and the user is the administrator. This is
>>> _exactly_ the case where things should just work out of the box. If
>>> you're managing farms of servers, yeah you have competent administration
>>> and you can be expected to tweak settings to get the best experience and
>>> performance, but the kernel should provide a sane default. 64K isn't a
>>> sane default.
>>
>> 0.1% of RAM isn't either.
> 
> No default is perfect, byt 0.1% will solve 99% of the problem. And most
> likely solve 100% of the problems for the important case, which is where
> you want things to Just Work on your distro without doing any
> administration.  If you're aiming for perfection, it doesn't exist.

... and my Fedora is already at 16 MiB *sigh*.

And I'm not aiming for perfection, I'm aiming for as little
FOLL_LONGTERM users as possible ;)

> 
>>>> This is not a step into the right direction. This is all just trying to
>>>> hide the fact that we're exposing FOLL_LONGTERM usage to random
>>>> unprivileged users.
>>>>
>>>> Maybe we could instead try getting rid of FOLL_LONGTERM usage and the
>>>> memlock limit in io_uring altogether, for example, by using mmu
>>>> notifiers. But I'm no expert on the io_uring code.
>>>
>>> You can't use mmu notifiers without impacting the fast path. This isn't
>>> just about io_uring, there are other users of memlock right now (like
>>> bpf) which just makes it even worse.
>>
>> 1) Do we have a performance evaluation? Did someone try and come up with
>> a conclusion how bad it would be?
> 
> I honestly don't remember the details, I took a look at it about a year
> ago due to some unrelated reasons. These days it just pertains to
> registered buffers, so it's less of an issue than back then when it
> dealt with the rings as well. Hence might be feasible, I'm certainly not
> against anyone looking into it. Easy enough to review and test for
> performance concerns.

That at least sounds promising.

> 
>> 2) Could be provide a mmu variant to ordinary users that's just good
>> enough but maybe not as fast as what we have today? And limit
>> FOLL_LONGTERM to special, privileged users?
> 
> If it's not as fast, then it's most likely not good enough though...

There is always a compromise of course.

See, FOLL_LONGTERM is *the worst* kind of memory allocation thingy you
could possible do to your MM subsystem. It's absolutely the worst thing
you can do to swap and compaction.

I really don't want random feature X to be next and say "well, io_uring
uses it, so I can just use it for max performance and we'll adjust the
memlock limit, who cares!".

> 
>> 3) Just because there are other memlock users is not an excuse. For
>> example, VFIO/VDPA have to use it for a reason, because there is no way
>> not do use FOLL_LONGTERM.
> 
> It's not an excuse, the statement merely means that the problem is
> _worse_ as there are other memlock users.

Yes, and it will keep getting worse every time we introduce more
FOLL_LONGTERM users that really shouldn't be FOLL_LONGTERM users unless
really required. Again, VFIO/VDPA/RDMA are prime examples, because the
HW forces us to do it. And these are privileged features either way.

> 
>>>
>>> We should just make this 0.1% of RAM (min(0.1% ram, 64KB)) or something
>>> like what was suggested, if that will help move things forward. IMHO the
>>> 32MB machine is mostly a theoretical case, but whatever .
>>
>> 1) I'm deeply concerned about large ZONE_MOVABLE and MIGRATE_CMA ranges
>> where FOLL_LONGTERM cannot be used, as that memory is not available.
>>
>> 2) With 0.1% RAM it's sufficient to start 1000 processes to break any
>> system completely and deeply mess up the MM. Oh my.
> 
> We're talking per-user limits here. But if you want to talk hyperbole,
> then 64K multiplied by some other random number will also allow
> everything to be pinned, potentially.
> 

Right, it's per-user. 0.1% per user FOLL_LONGTERM locked into memory in
the worst case.

-- 
Thanks,

David / dhildenb