linux-kernel - Re: [PATCH v2 2/5] mm: add PTE_MARKER

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <caf95a99-e975-4f3d-a94b-298a5fc88b5a@suse.cz>
Date: Mon, 21 Oct 2024 16:54:06 +0200
From: Vlastimil Babka <vbabka@...e.cz>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
 Suren Baghdasaryan <surenb@...gle.com>,
 "Liam R . Howlett" <Liam.Howlett@...cle.com>,
 Matthew Wilcox <willy@...radead.org>, "Paul E . McKenney"
 <paulmck@...nel.org>, Jann Horn <jannh@...gle.com>,
 David Hildenbrand <david@...hat.com>, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org, Muchun Song <muchun.song@...ux.dev>,
 Richard Henderson <richard.henderson@...aro.org>,
 Ivan Kokshaysky <ink@...assic.park.msu.ru>, Matt Turner
 <mattst88@...il.com>, Thomas Bogendoerfer <tsbogend@...ha.franken.de>,
 "James E . J . Bottomley" <James.Bottomley@...senpartnership.com>,
 Helge Deller <deller@....de>, Chris Zankel <chris@...kel.net>,
 Max Filippov <jcmvbkbc@...il.com>, Arnd Bergmann <arnd@...db.de>,
 linux-alpha@...r.kernel.org, linux-mips@...r.kernel.org,
 linux-parisc@...r.kernel.org, linux-arch@...r.kernel.org,
 Shuah Khan <shuah@...nel.org>, Christian Brauner <brauner@...nel.org>,
 linux-kselftest@...r.kernel.org, Sidhartha Kumar
 <sidhartha.kumar@...cle.com>, Jeff Xu <jeffxu@...omium.org>,
 Christoph Hellwig <hch@...radead.org>, linux-api@...r.kernel.org,
 John Hubbard <jhubbard@...dia.com>
Subject: Re: [PATCH v2 2/5] mm: add PTE_MARKER_GUARD PTE marker

On 10/21/24 16:33, Lorenzo Stoakes wrote:
> On Mon, Oct 21, 2024 at 04:13:34PM +0200, Vlastimil Babka wrote:
>> On 10/20/24 18:20, Lorenzo Stoakes wrote:
>> > Add a new PTE marker that results in any access causing the accessing
>> > process to segfault.
>> >
>> > This is preferable to PTE_MARKER_POISONED, which results in the same
>> > handling as hardware poisoned memory, and is thus undesirable for cases
>> > where we simply wish to 'soft' poison a range.
>> >
>> > This is in preparation for implementing the ability to specify guard pages
>> > at the page table level, i.e. ranges that, when accessed, should cause
>> > process termination.
>> >
>> > Additionally, rename zap_drop_file_uffd_wp() to zap_drop_markers() - the
>> > function checks the ZAP_FLAG_DROP_MARKER flag so naming it for this single
>> > purpose was simply incorrect.
>> >
>> > We then reuse the same logic to determine whether a zap should clear a
>> > guard entry - this should only be performed on teardown and never on
>> > MADV_DONTNEED or the like.
>>
>> Since I would have personally put MADV_FREE among "or the like" here, it's
>> surprising to me that it in fact it's tearing down the guard entries now. Is
>> that intentional? It should be at least mentioned very explicitly. But I'd
>> really argue against it, as MADV_FREE is to me a weaker form of
>> MADV_DONTNEED - the existing pages are not zapped immediately but
>> prioritized for reclaim. If MADV_DONTNEED leaves guard PTEs in place, why
>> shouldn't MADV_FREE too?
> 
> That is not, as I understand it, what MADV_FREE is, semantically. From the
> man pages:
> 
>        MADV_FREE (since Linux 4.5)
> 
>               The application no longer requires the pages in the range
>               specified by addr and len.  The kernel can thus free these
>               pages, but the freeing could be delayed until memory pressure
>               occurs.
> 
>        MADV_DONTNEED
> 
>               Do not expect access in the near future.  (For the time
>               being, the application is finished with the given range, so
>               the kernel can free resources associated with it.)
> 
> MADV_FREE is 'we are completely done with this range'. MADV_DONTNEED is 'we
> don't expect to use it in the near future'.

I think the description gives a wrong impression. What I think matters it
what happens (limited to anon private case as MADV_FREE doesn't support any
other)

MADV_DONTNEED - pages discarded immediately, further access gives new
zero-filled pages

MADV_FREE - pages prioritized for discarding, if that happens before next
write, it gets zero-filled page on next access, but a write done soon enough
 can cancel the upcoming discard.

In that sense, MADV_FREE is a weaker form of DONTNEED, no?

>>
>> Seems to me rather currently an artifact of MADV_FREE implementation - if it
>> encounters hwpoison entries it will tear them down because why not, we have
>> detected a hw memory error and are lucky the program wants to discard the
>> pages and not access them, so best use the opportunity and get rid of the
>> PTE entries immediately (if MADV_DONTNEED doesn't do that too, it certainly
>> could).
> 
> Right, but we explicitly do not tear them down in the case of MADV_DONTNEED
> which matches the description in the manpages that the user _might_ come
> back to the range, whereas MADV_FREE means they are truly done but just
> don't want the overhead of actually unmapping at this point.

But it's also defined what happens if user comes back to the range after a
MADV_FREE. I think the overhead saved happens in the case of actually coming
back soon enough to prevent the discard. With MADV_DONTNEED its immediate
and unconditional.

> Seems to be this is moreso that MADV_FREE is a not-really-as-efficient
> version of what Rik wants to do with his MADV_LAZYFREE thing.

I think that further optimizes MADV_FREE, which is already more optimized
than MADV_DONTNEED.

>>
>> But to extend this to guard PTEs which are result of an explicit userspace
>> action feels wrong, unless the semantics is the same for MADV_DONTEED. The
>> semantics chosen for MADV_DONTNEED makes sense, so MADV_FREE should behave
>> the same?
> 
> My understanding from the above is that MADV_FREE is a softer version of
> munmap(), i.e. 'totally done with this range', whereas MADV_DONTNEED is a
> 'revert state to when I first mapped this stuff because I'm done with it
> for now but might use it later'.

>From the implementation I get the opposite understanding. Neither tears down
the vma like a proper unmap(). MADV_DONTNEED zaps page tables immediately,
MADV_FREE effectively too but with a delay depending on memory pressure.