[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <1b5ce269-bb54-2f65-6919-8b2bb453c09a@redhat.com>
Date: Tue, 27 Jun 2023 12:18:58 +0200
From: David Hildenbrand <david@...hat.com>
To: Lorenzo Stoakes <lstoakes@...il.com>
Cc: Vlastimil Babka <vbabka@...e.cz>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org,
Andrew Morton <akpm@...ux-foundation.org>,
Mike Rapoport <rppt@...nel.org>,
"Liam R . Howlett" <Liam.Howlett@...cle.com>
Subject: Re: [PATCH] mm/mprotect: allow unfaulted VMAs to be unaccounted on
mprotect()
[...]
>>
>> Yeah, and that needs time and you have to motivate me :)
>>
>
> Beer? ;)
Oh, that always works :)
>
>>> Well the motivator for the initial investigation was rppt playing with
>>> R[WO]X (this came from an #mm irc conversation), however in his case he
>>> will be mapping pages between the two.
>>
>> And that's the scenario I think we care about in practice (actually
>> accessing memory).
[...]
>>> In real-use scenarios, yes fuzzers are a thing, but what comes to mind more
>>> immediately is a process that maps a big chunk of virtual memory PROT_NONE
>>> and uses that as part of an internal allocator.
>>>
>>> If the process then allocates memory from this chunk (mprotect() ->
>>> PROT_READ | PROT_WRITE), which then gets freed without being used
>>> (mprotect() -> PROT_NONE) we hit the issue. For OVERCOMMIT_NEVER this could
>>> become quite an issue more so than the VMA fragmentation.
>>
>> Using mprotect() when allocating/freeing memory in an allocator is already
>> horribly harmful for performance (well, and the #VMAs), so I don't think
>> that scenario is relevant in practice.
>
> Chrome for instance maintains vast memory ranges as PROT_NONE. I've not dug
> into what they're doing, but surely to make use of them they'd need to
> mprotect() or mmap()/mremap() (which maybe is what the intent is)
I suspect they are doing something similar than glibc (and some other
allocators like jemalloc IIRC), because they want to minimze the #VMAs.
>
> But fair point. However I can't imagine m[re]map'ing like this would be
> cheap either, as you're doing the same kind of expensive operations, so the
> general _approach_ seems like it's used in some way in practice.
Usually people access memory and not play mprotect() games for fun :)
>
>>
>> What some allocators (iirc even glibc) do is reserve a bigger area with
>> PROT_NONE and grow the accessible part slowly on demand, discarding freed
>> memory using MADV_DONTNEED. So you essentially end up with two VMAs -- one
>> completely accessible, one completely inaccessible.
>>
>> They don't use mprotect() because:
>> (a) It's bad for performance
>> (b) It might increase the #VMAs
>>
>> There is efence, but I remember it simply does mmap()+munmap() and runs into
>> VMA limits easily just by relying on a lot of mappings.
>>
>>
>>>
>>> In addition, I think a user simply doing the artificial test above would
>>> find the split remaining quite confusing, and somebody debugging some code
>>> like this would equally wonder why it happened, so there is benefit in
>>> clarity too (they of course observing the VMA fragmentation from the
>>> perspective of /proc/$pid/[s]maps).
>>
>> My answer would have been "memory gets commited the first time we allow
>> write access, and that wasn't the case for all memory in that range".
>>
>>
>> Now, take your example above and touch the memory.
>>
>>
>> ptr = mmap(NULL, page_size * 3, PROT_READ, MAP_ANON | MAP_PRIVATE, -1, 0);
>> mprotect(ptr + page_size, page_size, PROT_READ | PROT_WRITE);
>> *(ptr + page_size) = 1;
>> mprotect(ptr + page_size, page_size, PROT_READ);
>>
>>
>> And we'll not merge the VMAs.
>>
>> Which, at least to me, makes existing handling more consistent.
>
> Indeed, but I don't think it's currently consistent at all.
>
> The 'correct' solution would be to:-
>
> 1. account for the block when it becomes writable
> 2. unaccount for any pages not used when it becomes unwritable
>
I've been messing with something related (but slightly different) for a
while now in my mind, and I'm not at the point where I can talk about my
work/idea yet.
But because I've been messing with it, I can comment on some existing
oddities. Just imagine:
* userfaultfd() can place anon pages even in PROT_NONE areas
* ptrace can place anon pages in PROT_READ areas
* "fun" like the forbidden shared zeropage on s390x in some VMAs can
place anon pages into PROT_READ areas.
It's all far from "correct" when talking about memory accounting. But it
seems to get the job done for the most case for now.
> However since we can't go from vma -> folios for anon pages without some
> extreme effort this is not feasible.
>
> Therefore the existing code hacks it and just keep things accountable.
>
> The patch reduces the hacking so we get halfway to the correct approach.
>
> So before: "if you ever make this read/write, we account it forever"
> After: "if you ever make this read/write and USE IT, we account it forever"
>
"USE" is probably the wrong word. Maybe "MODIFIED", but there are other
cases (MADV_POPULATE_WRITE)
> To me it is more consistent. Of course this is subjective...
>
You made the conditional more complicated to make it consistent, won't
argue with that :)
>>
>> And users could rightfully wonder "why isn't it getting merged". And the
>> answer would be the same: "memory gets commited the first time we allow
>> write access, and that wasn't the case for all memory in that range".
>>
>
> Yes indeed, a bigger answer is that we don't have fine-grained accounting
> for pages for anon_vma.
Yes, VM_ACCOUNT is all-or nothing, which makes a lot of sense in many
cases (not in all, though).
[...]
>>
>>>>> So in practice programs will likely do the PROT_WRITE in order to actually
>>>>> populate the area, so this won't trigger as I commented above. But it can
>>>>> still help in some cases and is cheap to do, so:
>>>>
>>>> IMHO we should much rather look into getting hugetlb ranges merged. Mt
>>>> recollection is that we'll never end up merging hugetlb VMAs once split.
>>>
>>> I'm not sure how that's relevant to fragmented non-hugetlb VMAs though?
>>
>> It's a VMA merging issue that can be hit in practice, so I raised it.
>>
>>
>> No strong opinion from my side, just my 2 cents reading the patch
>> description and wondering "why do we even invest time thinking about this
>> case" -- and eventually make handling less consistent IMHO (see above).
>
> Hmm it seems ilke you have quite a strong opinion :P but this is why I cc-d
> you, as you are a great scrutiniser.
I might make it sound like a strong opinion (because I am challenging
the motivation), but there is no nak :)
>
> Yeah, the time investment was just by accident, the patch was originally a
> throwaway thing to prove the point :]
>
> I very much appreciate your time though! And I owe you at least one beer now.
>
> I would ask that while you might question the value, whether you think it
> so harmful as not to go in, so Andrew can know whether this debate = don't
> take?
>
> An Ack-with-meh would be fine. But also if you want to nak, it's also
> fine. I will buy you the beer either way ;)
It's more a "no nak" -- I don't see the real benefit but I also don't
see the harm (as long as VMA locking is not an issue). If others see the
benefit, great, so I'll let these decide.
--
Cheers,
David / dhildenb
Powered by blists - more mailing lists