linux-kernel - Re: [PATCH v2 1/2] mm: clear pte for folios that are zero filled

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <b458af5d-0f88-47d6-952a-8f69d41d1c80@redhat.com>
Date: Fri, 7 Jun 2024 13:16:03 +0200
From: David Hildenbrand <david@...hat.com>
To: Usama Arif <usamaarif642@...il.com>, akpm@...ux-foundation.org,
 shakeel.butt@...ux.dev, yosryahmed@...gle.com, willy@...radead.org
Cc: hannes@...xchg.org, nphamcs@...il.com, chengming.zhou@...ux.dev,
 linux-mm@...ck.org, linux-kernel@...r.kernel.org, kernel-team@...a.com
Subject: Re: [PATCH v2 1/2] mm: clear pte for folios that are zero filled

On 07.06.24 12:24, Usama Arif wrote:
> 
> On 04/06/2024 13:43, David Hildenbrand wrote:
>> On 04.06.24 14:30, David Hildenbrand wrote:
>>> On 04.06.24 12:58, Usama Arif wrote:
>>>> Approximately 10-20% of pages to be swapped out are zero pages [1].
>>>> Rather than reading/writing these pages to flash resulting
>>>> in increased I/O and flash wear, the pte can be cleared for those
>>>> addresses at unmap time while shrinking folio list. When this
>>>> causes a page fault, do_pte_missing will take care of this page.
>>>> With this patch, NVMe writes in Meta server fleet decreased
>>>> by almost 10% with conventional swap setup (zswap disabled).
>>>>
>>>> [1]
>>>> https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/
>>>>
>>>> Signed-off-by: Usama Arif <usamaarif642@...il.com>
>>>> ---
>>>>     include/linux/rmap.h |   1 +
>>>>     mm/rmap.c            | 163
>>>> ++++++++++++++++++++++---------------------
>>>>     mm/vmscan.c          |  89 ++++++++++++++++-------
>>>>     3 files changed, 150 insertions(+), 103 deletions(-)
>>>>
>>>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>>>> index bb53e5920b88..b36db1e886e4 100644
>>>> --- a/include/linux/rmap.h
>>>> +++ b/include/linux/rmap.h
>>>> @@ -100,6 +100,7 @@ enum ttu_flags {
>>>>                          * do a final flush if necessary */
>>>>         TTU_RMAP_LOCKED        = 0x80,    /* do not grab rmap lock:
>>>>                          * caller holds it */
>>>> +    TTU_ZERO_FOLIO        = 0x100,/* zero folio */
>>>>     };
>>>>        #ifdef CONFIG_MMU
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index 52357d79917c..d98f70876327 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -1819,96 +1819,101 @@ static bool try_to_unmap_one(struct folio
>>>> *folio, struct vm_area_struct *vma,
>>>>                  */
>>>>                 dec_mm_counter(mm, mm_counter(folio));
>>>>             } else if (folio_test_anon(folio)) {
>>>> -            swp_entry_t entry = page_swap_entry(subpage);
>>>> -            pte_t swp_pte;
>>>> -            /*
>>>> -             * Store the swap location in the pte.
>>>> -             * See handle_pte_fault() ...
>>>> -             */
>>>> -            if (unlikely(folio_test_swapbacked(folio) !=
>>>> -                    folio_test_swapcache(folio))) {
>>>> +            if (flags & TTU_ZERO_FOLIO) {
>>>> +                pte_clear(mm, address, pvmw.pte);
>>>> +                dec_mm_counter(mm, MM_ANONPAGES);
>>>
>>> Is there an easy way to reduce the code churn and highlight the added
>>> code?
>>>
>>> Like
>>>
>>> } else if (folio_test_anon(folio) && (flags & TTU_ZERO_FOLIO)) {
>>>
>>> } else if (folio_test_anon(folio)) {
>>>
>>>
>>>
>>> Also to concerns that I want to spell out:
>>>
>>> (a) what stops the page from getting modified in the meantime? The CPU
>>>        can write it until the TLB was flushed.
>>>
> Thanks for pointing this out David and Shakeel. This is a big issue in
> this v2, and as Shakeel pointed out in [1] we need to do a second rmap
> walk. Looking at how ksm deals with this in try_to_merge_one_page which
> calls write_protect_page for each vma (i.e. basically an rmap walk),
> this would be much more CPU expensive and complicated compared to v1
> [2], where the swap subsystem can handle all complexities. I will go
> back to my v1 solution for the next revision as its much more simpler
> and the memory usage is very low (0.003%) as pointed out by Johannes [3]
> which would likely go away with the memory savings of not having a
> zswap_entry for zero filled pages, and the solution being a lot simpler
> than what a valid v2 approach would look like.

Agreed.

-- 
Cheers,

David / dhildenb