[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e4d167fe-cb1e-41d1-a144-00bfa14b7148@gmail.com>
Date: Fri, 7 Jun 2024 11:24:13 +0100
From: Usama Arif <usamaarif642@...il.com>
To: David Hildenbrand <david@...hat.com>, akpm@...ux-foundation.org,
shakeel.butt@...ux.dev, yosryahmed@...gle.com, willy@...radead.org
Cc: hannes@...xchg.org, nphamcs@...il.com, chengming.zhou@...ux.dev,
linux-mm@...ck.org, linux-kernel@...r.kernel.org, kernel-team@...a.com
Subject: Re: [PATCH v2 1/2] mm: clear pte for folios that are zero filled
On 04/06/2024 13:43, David Hildenbrand wrote:
> On 04.06.24 14:30, David Hildenbrand wrote:
>> On 04.06.24 12:58, Usama Arif wrote:
>>> Approximately 10-20% of pages to be swapped out are zero pages [1].
>>> Rather than reading/writing these pages to flash resulting
>>> in increased I/O and flash wear, the pte can be cleared for those
>>> addresses at unmap time while shrinking folio list. When this
>>> causes a page fault, do_pte_missing will take care of this page.
>>> With this patch, NVMe writes in Meta server fleet decreased
>>> by almost 10% with conventional swap setup (zswap disabled).
>>>
>>> [1]
>>> https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/
>>>
>>> Signed-off-by: Usama Arif <usamaarif642@...il.com>
>>> ---
>>> include/linux/rmap.h | 1 +
>>> mm/rmap.c | 163
>>> ++++++++++++++++++++++---------------------
>>> mm/vmscan.c | 89 ++++++++++++++++-------
>>> 3 files changed, 150 insertions(+), 103 deletions(-)
>>>
>>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>>> index bb53e5920b88..b36db1e886e4 100644
>>> --- a/include/linux/rmap.h
>>> +++ b/include/linux/rmap.h
>>> @@ -100,6 +100,7 @@ enum ttu_flags {
>>> * do a final flush if necessary */
>>> TTU_RMAP_LOCKED = 0x80, /* do not grab rmap lock:
>>> * caller holds it */
>>> + TTU_ZERO_FOLIO = 0x100,/* zero folio */
>>> };
>>> #ifdef CONFIG_MMU
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 52357d79917c..d98f70876327 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1819,96 +1819,101 @@ static bool try_to_unmap_one(struct folio
>>> *folio, struct vm_area_struct *vma,
>>> */
>>> dec_mm_counter(mm, mm_counter(folio));
>>> } else if (folio_test_anon(folio)) {
>>> - swp_entry_t entry = page_swap_entry(subpage);
>>> - pte_t swp_pte;
>>> - /*
>>> - * Store the swap location in the pte.
>>> - * See handle_pte_fault() ...
>>> - */
>>> - if (unlikely(folio_test_swapbacked(folio) !=
>>> - folio_test_swapcache(folio))) {
>>> + if (flags & TTU_ZERO_FOLIO) {
>>> + pte_clear(mm, address, pvmw.pte);
>>> + dec_mm_counter(mm, MM_ANONPAGES);
>>
>> Is there an easy way to reduce the code churn and highlight the added
>> code?
>>
>> Like
>>
>> } else if (folio_test_anon(folio) && (flags & TTU_ZERO_FOLIO)) {
>>
>> } else if (folio_test_anon(folio)) {
>>
>>
>>
>> Also to concerns that I want to spell out:
>>
>> (a) what stops the page from getting modified in the meantime? The CPU
>> can write it until the TLB was flushed.
>>
Thanks for pointing this out David and Shakeel. This is a big issue in
this v2, and as Shakeel pointed out in [1] we need to do a second rmap
walk. Looking at how ksm deals with this in try_to_merge_one_page which
calls write_protect_page for each vma (i.e. basically an rmap walk),
this would be much more CPU expensive and complicated compared to v1
[2], where the swap subsystem can handle all complexities. I will go
back to my v1 solution for the next revision as its much more simpler
and the memory usage is very low (0.003%) as pointed out by Johannes [3]
which would likely go away with the memory savings of not having a
zswap_entry for zero filled pages, and the solution being a lot simpler
than what a valid v2 approach would look like.
[1]
https://lore.kernel.org/all/nes73bwc5p6yhwt5tw3upxcqrn5kenn6lvqb6exrf4yppmz6jx@ywhuevpkxlvh/
[2]
https://lore.kernel.org/all/20240530102126.357438-1-usamaarif642@gmail.com/
[3] https://lore.kernel.org/all/20240530122715.GB1222079@cmpxchg.org/
>> (b) do you properly handle if the page is pinned (or just got pinned)
>> and we must not discard it?
>
> Oh, and I forgot, are you handling userfaultd as expected? IIRC there
> are some really nasty side-effects with userfaultfd even when
> userfaultfd is currently not registered for a VMA [1].
>
> [1]
> https://lore.kernel.org/linux-mm/3a4b1027-df6e-31b8-b0de-ff202828228d@redhat.com/
>
> What should work is replacing all-zero anonymous pages by the shared
> zeropage iff the anonymous page is not pinned and we synchronize
> against GUP fast. Well, and we handle possible concurrent writes
> accordingly.
>
> KSM does essentially that when told to de-duplicate the shared
> zeropage, and I was thinking a while ago if we would want a
> zeropage-only KSM version that doesn't need stable tress and all that,
> but only deduplicates zero-filled pages into the shared zeropage in a
> safe way.
>
Thanks for the pointer to KSM code.
Powered by blists - more mailing lists