linux-kernel - Re: [PATCH v2 2/5] mm: add PTE_MARKER

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <cb0e49be-7b4e-4760-884c-8f4bf74ec1e1@redhat.com>
Date: Mon, 21 Oct 2024 18:00:20 +0200
From: David Hildenbrand <david@...hat.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
 Vlastimil Babka <vbabka@...e.cz>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
 Suren Baghdasaryan <surenb@...gle.com>,
 "Liam R . Howlett" <Liam.Howlett@...cle.com>,
 Matthew Wilcox <willy@...radead.org>, "Paul E . McKenney"
 <paulmck@...nel.org>, Jann Horn <jannh@...gle.com>, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org, Muchun Song <muchun.song@...ux.dev>,
 Richard Henderson <richard.henderson@...aro.org>,
 Ivan Kokshaysky <ink@...assic.park.msu.ru>, Matt Turner
 <mattst88@...il.com>, Thomas Bogendoerfer <tsbogend@...ha.franken.de>,
 "James E . J . Bottomley" <James.Bottomley@...senpartnership.com>,
 Helge Deller <deller@....de>, Chris Zankel <chris@...kel.net>,
 Max Filippov <jcmvbkbc@...il.com>, Arnd Bergmann <arnd@...db.de>,
 linux-alpha@...r.kernel.org, linux-mips@...r.kernel.org,
 linux-parisc@...r.kernel.org, linux-arch@...r.kernel.org,
 Shuah Khan <shuah@...nel.org>, Christian Brauner <brauner@...nel.org>,
 linux-kselftest@...r.kernel.org, Sidhartha Kumar
 <sidhartha.kumar@...cle.com>, Jeff Xu <jeffxu@...omium.org>,
 Christoph Hellwig <hch@...radead.org>, linux-api@...r.kernel.org,
 John Hubbard <jhubbard@...dia.com>
Subject: Re: [PATCH v2 2/5] mm: add PTE_MARKER_GUARD PTE marker

On 21.10.24 17:33, Lorenzo Stoakes wrote:
> On Mon, Oct 21, 2024 at 04:54:06PM +0200, Vlastimil Babka wrote:
>> On 10/21/24 16:33, Lorenzo Stoakes wrote:
>>> On Mon, Oct 21, 2024 at 04:13:34PM +0200, Vlastimil Babka wrote:
>>>> On 10/20/24 18:20, Lorenzo Stoakes wrote:
>>>>> Add a new PTE marker that results in any access causing the accessing
>>>>> process to segfault.
>>>>>
>>>>> This is preferable to PTE_MARKER_POISONED, which results in the same
>>>>> handling as hardware poisoned memory, and is thus undesirable for cases
>>>>> where we simply wish to 'soft' poison a range.
>>>>>
>>>>> This is in preparation for implementing the ability to specify guard pages
>>>>> at the page table level, i.e. ranges that, when accessed, should cause
>>>>> process termination.
>>>>>
>>>>> Additionally, rename zap_drop_file_uffd_wp() to zap_drop_markers() - the
>>>>> function checks the ZAP_FLAG_DROP_MARKER flag so naming it for this single
>>>>> purpose was simply incorrect.
>>>>>
>>>>> We then reuse the same logic to determine whether a zap should clear a
>>>>> guard entry - this should only be performed on teardown and never on
>>>>> MADV_DONTNEED or the like.
>>>>
>>>> Since I would have personally put MADV_FREE among "or the like" here, it's
>>>> surprising to me that it in fact it's tearing down the guard entries now. Is
>>>> that intentional? It should be at least mentioned very explicitly. But I'd
>>>> really argue against it, as MADV_FREE is to me a weaker form of
>>>> MADV_DONTNEED - the existing pages are not zapped immediately but
>>>> prioritized for reclaim. If MADV_DONTNEED leaves guard PTEs in place, why
>>>> shouldn't MADV_FREE too?
>>>
>>> That is not, as I understand it, what MADV_FREE is, semantically. From the
>>> man pages:
>>>
>>>         MADV_FREE (since Linux 4.5)
>>>
>>>                The application no longer requires the pages in the range
>>>                specified by addr and len.  The kernel can thus free these
>>>                pages, but the freeing could be delayed until memory pressure
>>>                occurs.
>>>
>>>         MADV_DONTNEED
>>>
>>>                Do not expect access in the near future.  (For the time
>>>                being, the application is finished with the given range, so
>>>                the kernel can free resources associated with it.)
>>>
>>> MADV_FREE is 'we are completely done with this range'. MADV_DONTNEED is 'we
>>> don't expect to use it in the near future'.
>>
>> I think the description gives a wrong impression. What I think matters it
>> what happens (limited to anon private case as MADV_FREE doesn't support any
>> other)
>>
>> MADV_DONTNEED - pages discarded immediately, further access gives new
>> zero-filled pages
>>
>> MADV_FREE - pages prioritized for discarding, if that happens before next
>> write, it gets zero-filled page on next access, but a write done soon enough
>>   can cancel the upcoming discard.
>>
>> In that sense, MADV_FREE is a weaker form of DONTNEED, no?
>>
>>>>
>>>> Seems to me rather currently an artifact of MADV_FREE implementation - if it
>>>> encounters hwpoison entries it will tear them down because why not, we have
>>>> detected a hw memory error and are lucky the program wants to discard the
>>>> pages and not access them, so best use the opportunity and get rid of the
>>>> PTE entries immediately (if MADV_DONTNEED doesn't do that too, it certainly
>>>> could).
>>>
>>> Right, but we explicitly do not tear them down in the case of MADV_DONTNEED
>>> which matches the description in the manpages that the user _might_ come
>>> back to the range, whereas MADV_FREE means they are truly done but just
>>> don't want the overhead of actually unmapping at this point.
>>
>> But it's also defined what happens if user comes back to the range after a
>> MADV_FREE. I think the overhead saved happens in the case of actually coming
>> back soon enough to prevent the discard. With MADV_DONTNEED its immediate
>> and unconditional.
>>
>>> Seems to be this is moreso that MADV_FREE is a not-really-as-efficient
>>> version of what Rik wants to do with his MADV_LAZYFREE thing.
>>
>> I think that further optimizes MADV_FREE, which is already more optimized
>> than MADV_DONTNEED.
>>
>>>>
>>>> But to extend this to guard PTEs which are result of an explicit userspace
>>>> action feels wrong, unless the semantics is the same for MADV_DONTEED. The
>>>> semantics chosen for MADV_DONTNEED makes sense, so MADV_FREE should behave
>>>> the same?
>>>
>>> My understanding from the above is that MADV_FREE is a softer version of
>>> munmap(), i.e. 'totally done with this range', whereas MADV_DONTNEED is a
>>> 'revert state to when I first mapped this stuff because I'm done with it
>>> for now but might use it later'.
>>
>>  From the implementation I get the opposite understanding. Neither tears down
>> the vma like a proper unmap(). MADV_DONTNEED zaps page tables immediately,
>> MADV_FREE effectively too but with a delay depending on memory pressure.
>>
> 
> OK so based on IRC chat I think the conclusion here is TL;DR yes we have to
> change this, you're right :)
> 
> To summarise for on-list:
> 
> * MADV_FREE, while ostensibly being a 'lazy free' mechanism, has the
>    ability to be 'cancelled' if you write to the memory. Also, after the
>    freeing is complete, you can write to the memory to reuse it, the mapping
>    is still there.
> 
> * For hardware poison markers it makes sense to drop them as you're
>    effectively saying 'I am done with this range that is now unbacked and
>    expect to get an empty page should I use it now'. UFFD WP I am not sure
>    about but presumably also fine.
> 
> * However, guard pages are different - if you 'cancel' and you are left
>    with a block of memory allocated to you by a pthread or userland
>    allocator implementation, you don't want to then no longer be protected
>    from overrunning into other thread memory.

Agreed. What happens on MADV_DONTNEED/MADV_FREE on guard pages? Ignored 
or error? It sounds like a usage "error" to me (in contrast to munmap()).

-- 
Cheers,

David / dhildenb