linux-kernel - Re: [RFC PATCH] madvise: make madvise_cold_or_pageout_pte

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <237fd94f-0062-124c-6317-76fc4207ccd7@redhat.com>
Date:   Fri, 14 Jul 2023 16:12:38 +0200
From:   David Hildenbrand <david@...hat.com>
To:     "Yin, Fengwei" <fengwei.yin@...el.com>,
        Yu Zhao <yuzhao@...gle.com>, Zi Yan <ziy@...dia.com>
Cc:     Minchan Kim <minchan@...nel.org>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, akpm@...ux-foundation.org,
        willy@...radead.org, ryan.roberts@....com, shy828301@...il.com,
        "Vishal Moola (Oracle)" <vishal.moola@...il.com>
Subject: Re: [RFC PATCH] madvise: make madvise_cold_or_pageout_pte_range()
 support large folio

On 14.07.23 15:58, Yin, Fengwei wrote:
> 
> 
> On 7/14/2023 5:25 PM, David Hildenbrand wrote:
>>
>> (1) We're unmapping a single subpage, the compound_mapcount == 0
>>      and the total_mapcount > 0. If the subpage mapcount is now 0, add it
>>      to the deferred split queue.
>>
>> (2) We're unmapping a complete folio (PMD mapping / compound), the
>>      compound_mapcount is 0 and the total_mapcount > 0.
>>
>>   (a) If total mapcount < folio_nr_pages, add it
>>       to the deferred split queue.
>>
>>   (b) If total mapcount >= folio_nr_pages , we have to scan all subpage
>>       mapcounts. If any subpage mapcount == 0, add it to the deferred
>>       split queue.
>>
>>
>> (b) is a bit nasty. It would happen when we fork() with a PMD-mapped THP, the parent splits the THP due to COW, and then our child unmaps or splits the PMD-mapped THP (unmap easily happening during exec()). Fortunately, we'd only scan once when unmapping the PMD.
>>
>>
>> Getting rid of the subpage mapcount usage in (1) would mean that we have to do exactly what we do in (2). But then we'd need to ha handle (2) (B) differently as well.
>>
>> So, for 2 (b) we would either need some other heuristic, or we add it to the deferred split queue more frequently and let that one detect using an rmap walk "well, every subpage is still mapped, let's abort the split".
> 
> Or another option for 2 (b): don't add it to deferred split queue. We
> know the folio in deferred list is mainly scanned when system needs to
> reclaim memory.
> 
> Maybe it's better to let page reclaim choose the large folio here because
> page reclaiming will call folio_referenced() which does rmap_walk()->pvmw().
> And we can reuse rmap_walk() in folio_referenced() to detect whether there
> are pages of folio are not mapped. If it's the case, we can split it then.
> 
> Comparing to deferred list, the advantage is that folio_referenced() doesn't
> unmap page. While in deferred list, we need to add extra rmap_walk() to
> check whether there is page not mapped.

Right, I also came to the conclusion that the unmapping is undesirable. 
However, once benefit of the unmap is that we know when to stop scanning 
(due to page_mapped()). But maybe the temporary unmapping is actually 
counter-productive.

> 
> Just a thought. I could miss something here. Thanks.

Interesting idea.

I also had the thought that adding folios to the deferred split queue 
when removing the rmap is semantically questionable.

Yes, we remove the rmap when zapping/unmapping a pte/pmd. But we also 
(eventually only temporarily!) unmap when splitting a THP or when 
installing migration entries.

Maybe we can flag folios when zapping PTEs that they are a reasonable 
candidate to tell page reclaim code "this one was partially zapped, 
maybe there is some memory to reclaim there, now". Maybe that involves 
the deferred split queue.

-- 
Cheers,

David / dhildenb