linux-kernel - Re: [PATCH] mm: disable `vm.max_map

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <5ca7d54b-5ae4-646d-f3a0-9b85129c9ccf@nvidia.com>
Date:   Tue, 28 Nov 2017 21:14:23 -0800
From:   John Hubbard <jhubbard@...dia.com>
To:     Michal Hocko <mhocko@...nel.org>
CC:     Mikael Pettersson <mikpelinux@...il.com>, <linux-mm@...ck.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        <linux-fsdevel@...r.kernel.org>,
        Linux API <linux-api@...r.kernel.org>
Subject: Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit

On 11/28/2017 12:12 AM, Michal Hocko wrote:
> On Mon 27-11-17 15:26:27, John Hubbard wrote:
> [...]
>> Let me add a belated report, then: we ran into this limit while implementing 
>> an early version of Unified Memory[1], back in 2013. The implementation
>> at the time depended on tracking that assumed "one allocation == one vma".
> 
> And you tried hard to make those VMAs really separate? E.g. with
> prot_none gaps?

We didn't do that, and in fact I'm probably failing to grasp the underlying 
design idea that you have in mind there...hints welcome...

What we did was to hook into the mmap callbacks in the kernel driver, after 
userspace mmap'd a region (via a custom allocator API). And we had an ioctl
in there, to connect up other allocation attributes that couldn't be passed
through via mmap. Again, this was for regions of memory that were to be
migrated between CPU and device (GPU).

> 
>> So, with only 64K vmas, we quickly ran out, and changed the design to work
>> around that. (And later, the design was *completely* changed to use a separate
>> tracking system altogether). exag
>>
>> The existing limit seems rather too low, at least from my perspective. Maybe
>> it would be better, if expressed as a function of RAM size?
> 
> Dunno. Whenever we tried to do RAM scaling it turned out a bad idea
> after years when memory grown much more than the code author expected.
> Just look how we scaled hash table sizes... But maybe you can come up
> with something clever. In any case tuning this from the userspace is a
> trivial thing to do and I am somehow skeptical that any early boot code
> would trip over the limit.
> 

I agree that this is not a limit that boot code is likely to hit. And maybe 
tuning from userspace really is the right approach here, considering that
there is a real cost to going too large. 

Just philosophically here, hard limits like this seem a little awkward if they 
are set once in, say, 1999 (gross exaggeration here, for effect) and then not
updated to stay with the times, right? In other words, one should not routinely 
need to tune most things. That's why I was wondering if something crude and silly
would work, such as just a ratio of RAM to vma count. (I'm more just trying to
understand the "rules" here, than to debate--I don't have a strong opinion 
on this.)

The fact that this apparently failed with hash tables is interesting, I'd
love to read more if you have any notes or links. I spotted a 2014 LWN article
( https://lwn.net/Articles/612100 ) about hash table resizing, and some commits
that fixed resizing bugs, such as

    12311959ecf8a ("rhashtable: fix shift by 64 when shrinking")

...was it just a storm of bugs that showed up?

thanks,
John Hubbard