linux-kernel - Re: [RFC PATCH 00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <f63146c0-0316-4182-9a62-48dfc39f27b7@redhat.com>
Date: Wed, 5 Mar 2025 20:35:36 +0100
From: David Hildenbrand <david@...hat.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: Matthew Wilcox <willy@...radead.org>, SeongJae Park <sj@...nel.org>,
 "Liam R. Howlett" <howlett@...il.com>,
 Andrew Morton <akpm@...ux-foundation.org>,
 Shakeel Butt <shakeel.butt@...ux.dev>, Vlastimil Babka <vbabka@...e.cz>,
 kernel-team@...a.com, linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [RFC PATCH 00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED
 and MADV_FREE

On 05.03.25 20:26, Lorenzo Stoakes wrote:
> On Wed, Mar 05, 2025 at 08:19:41PM +0100, David Hildenbrand wrote:
>> On 05.03.25 19:56, Matthew Wilcox wrote:
>>> On Wed, Mar 05, 2025 at 10:15:55AM -0800, SeongJae Park wrote:
>>>> For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes
>>>> can happen for each vma of the given address ranges.  Because such tlb
>>>> flushes are for address ranges of same process, doing those in a batch
>>>> is more efficient while still being safe.  Modify madvise() and
>>>> process_madvise() entry level code path to do such batched tlb flushes,
>>>> while the internal unmap logics do only gathering of the tlb entries to
>>>> flush.
>>>
>>> Do real applications actually do madvise requests that span multiple
>>> VMAs?  It just seems weird to me.  Like, each vma comes from a separate
>>> call to mmap [1], so why would it make sense for an application to
>>> call madvise() across a VMA boundary?
>>
>> I had the same question. If this happens in an app, I would assume that a
>> single MADV_DONTNEED call would usually not span multiples VMAs, and if it
>> does, not that many (and that often) that we would really care about it.
>>
>> OTOH, optimizing tlb flushing when using a vectored MADV_DONTNEED version
>> would make more sense to me. I don't recall if process_madvise() allows for
>> that already, and if it does, is this series primarily tackling optimizing
>> that?
> 
> Yeah it's weird, but people can get caught out by unexpected failures to merge
> if they do fun stuff with mremap().
> 
> Then again mremap() itself _mandates_ that you only span a single VMA (or part
> of one) :)

Maybe some garbage collection use cases that shuffle individual pages, 
and later free larger chunks using MADV_DONTNEED. Doesn't sound unlikely.

> 
> Can we talk about the _true_ horror show that - you can span multiple VMAs _with
> gaps_ and it'll allow you, only it'll return -ENOMEM at the end?
> 
> In madvise_walk_vmas():
> 
> 	for (;;) {
> 		...
> 
> 		if (start < vma->vm_start) {
> 			unmapped_error = -ENOMEM;
> 			start = vma->vm_start;
> 			...
> 		}
> 
> 		...
> 
> 		error = visit(vma, &prev, start, tmp, arg);
> 		if (error)
> 			return error;
> 
> 		...
> 	}
> 
> 	return unmapped_error;
> 
> So, you have no idea if that -ENOMEM is due to a gap, or do to the
> operation returning an -ENOMEM?
 > > I mean can we just drop this? Does anybody in their right mind rely on
> this? Or is it intentional to deal with somehow a racing unmap?
 > > But, no, we hold an mmap lock so that's not it.

Races could still happen if user space would do it from separate 
threads. But then, who would prevent user space from doing another 
mmap() and getting pages zapped ... so that sounds unlikely.

> 
> Yeah OK so can we drop this madness? :) or am I missing some very important
> detail about why we allow this?

I stumbled over that myself a while ago. It's well documented behavior 
in the man page :(

At that point I stopped caring, because apparently somebody else cared 
enough to document that clearly in the man page :)

> 
> I guess spanning multiple VMAs we _have_ to leave in because plausibly
> there are users of that out there?

Spanning multiple VMAs can probably easily happen. At least in QEMU I 
know some sane ways to trigger it on guest memory. But, all corner 
cases, so nothing relevant for performance.


-- 
Cheers,

David / dhildenb