linux-kernel - Re: [PATCH v3 6/6] mm: swap: entirely map large folios found in swapcache

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5b770715-7516-42a8-9ea0-3f61572d92af@redhat.com>
Date: Mon, 6 May 2024 15:16:09 +0200
From: David Hildenbrand <david@...hat.com>
To: Barry Song <21cnbao@...il.com>
Cc: Ryan Roberts <ryan.roberts@....com>, akpm@...ux-foundation.org,
 linux-mm@...ck.org, baolin.wang@...ux.alibaba.com, chrisl@...nel.org,
 hanchuanhua@...o.com, hannes@...xchg.org, hughd@...gle.com,
 kasong@...cent.com, linux-kernel@...r.kernel.org, surenb@...gle.com,
 v-songbaohua@...o.com, willy@...radead.org, xiang@...nel.org,
 ying.huang@...el.com, yosryahmed@...gle.com, yuzhao@...gle.com,
 ziy@...dia.com
Subject: Re: [PATCH v3 6/6] mm: swap: entirely map large folios found in
 swapcache

On 06.05.24 14:58, Barry Song wrote:
> On Tue, May 7, 2024 at 12:38 AM Barry Song <21cnbao@...il.com> wrote:
>>
>> On Tue, May 7, 2024 at 12:07 AM David Hildenbrand <david@...hat.com> wrote:
>>>
>>> On 04.05.24 01:23, Barry Song wrote:
>>>> On Fri, May 3, 2024 at 6:50 PM Ryan Roberts <ryan.roberts@....com> wrote:
>>>>>
>>>>> On 03/05/2024 01:50, Barry Song wrote:
>>>>>> From: Chuanhua Han <hanchuanhua@...o.com>
>>>>>>
>>>>>> When a large folio is found in the swapcache, the current implementation
>>>>>> requires calling do_swap_page() nr_pages times, resulting in nr_pages
>>>>>> page faults. This patch opts to map the entire large folio at once to
>>>>>> minimize page faults. Additionally, redundant checks and early exits
>>>>>> for ARM64 MTE restoring are removed.
>>>>>>
>>>>>> Signed-off-by: Chuanhua Han <hanchuanhua@...o.com>
>>>>>> Co-developed-by: Barry Song <v-songbaohua@...o.com>
>>>>>> Signed-off-by: Barry Song <v-songbaohua@...o.com>
>>>>>
>>>>> With the suggested changes below:
>>>>>
>>>>> Reviewed-by: Ryan Roberts <ryan.roberts@....com>
>>>>>
>>>>>> ---
>>>>>>    mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++-----------
>>>>>>    1 file changed, 48 insertions(+), 12 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>> index 22e7c33cc747..940fdbe69fa1 100644
>>>>>> --- a/mm/memory.c
>>>>>> +++ b/mm/memory.c
>>>>>> @@ -3968,6 +3968,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>>         pte_t pte;
>>>>>>         vm_fault_t ret = 0;
>>>>>>         void *shadow = NULL;
>>>>>> +     int nr_pages = 1;
>>>>>> +     unsigned long page_idx = 0;
>>>>>> +     unsigned long address = vmf->address;
>>>>>> +     pte_t *ptep;
>>>>>
>>>>> nit: Personally I'd prefer all these to get initialised just before the "if
>>>>> (folio_test_large()..." block below. That way it is clear they are fresh (incase
>>>>> any logic between here and there makes an adjustment) and its clear that they
>>>>> are only to be used after that block (the compiler will warn if using an
>>>>> uninitialized value).
>>>>
>>>> right. I agree this will make the code more readable.
>>>>
>>>>>
>>>>>>
>>>>>>         if (!pte_unmap_same(vmf))
>>>>>>                 goto out;
>>>>>> @@ -4166,6 +4170,36 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>>                 goto out_nomap;
>>>>>>         }
>>>>>>
>>>>>> +     ptep = vmf->pte;
>>>>>> +     if (folio_test_large(folio) && folio_test_swapcache(folio)) {
>>>>>> +             int nr = folio_nr_pages(folio);
>>>>>> +             unsigned long idx = folio_page_idx(folio, page);
>>>>>> +             unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
>>>>>> +             unsigned long folio_end = folio_start + nr * PAGE_SIZE;
>>>>>> +             pte_t *folio_ptep;
>>>>>> +             pte_t folio_pte;
>>>>>> +
>>>>>> +             if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
>>>>>> +                     goto check_folio;
>>>>>> +             if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
>>>>>> +                     goto check_folio;
>>>>>> +
>>>>>> +             folio_ptep = vmf->pte - idx;
>>>>>> +             folio_pte = ptep_get(folio_ptep);
>>>>>> +             if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
>>>>>> +                 swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
>>>>>> +                     goto check_folio;
>>>>>> +
>>>>>> +             page_idx = idx;
>>>>>> +             address = folio_start;
>>>>>> +             ptep = folio_ptep;
>>>>>> +             nr_pages = nr;
>>>>>> +             entry = folio->swap;
>>>>>> +             page = &folio->page;
>>>>>> +     }
>>>>>> +
>>>>>> +check_folio:
>>>>>
>>>>> Is this still the correct label name, given the checks are now above the new
>>>>> block? Perhaps "one_page" or something like that?
>>>>
>>>> not quite sure about this, as the code after one_page can be multiple_pages.
>>>> On the other hand, it seems we are really checking folio after "check_folio"
>>>> :-)
>>>>
>>>>
>>>> BUG_ON(!folio_test_anon(folio) && folio_test_mappedtodisk(folio));
>>>> BUG_ON(folio_test_anon(folio) && PageAnonExclusive(page));
>>>>
>>>> /*
>>>> * Check under PT lock (to protect against concurrent fork() sharing
>>>> * the swap entry concurrently) for certainly exclusive pages.
>>>> */
>>>> if (!folio_test_ksm(folio)) {
>>>>
>>>>
>>>>>
>>>>>> +
>>>>>>         /*
>>>>>>          * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte
>>>>>>          * must never point at an anonymous page in the swapcache that is
>>>>>> @@ -4225,12 +4259,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>>          * We're already holding a reference on the page but haven't mapped it
>>>>>>          * yet.
>>>>>>          */
>>>>>> -     swap_free_nr(entry, 1);
>>>>>> +     swap_free_nr(entry, nr_pages);
>>>>>>         if (should_try_to_free_swap(folio, vma, vmf->flags))
>>>>>>                 folio_free_swap(folio);
>>>>>>
>>>>>> -     inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>>>>>> -     dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
>>>>>> +     folio_ref_add(folio, nr_pages - 1);
>>>>>> +     add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
>>>>>> +     add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
>>>>>>         pte = mk_pte(page, vma->vm_page_prot);
>>>>>>
>>>>>>         /*
>>>>>> @@ -4240,34 +4275,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>>          * exclusivity.
>>>>>>          */
>>>>>>         if (!folio_test_ksm(folio) &&
>>>>>> -         (exclusive || folio_ref_count(folio) == 1)) {
>>>>>> +         (exclusive || (folio_ref_count(folio) == nr_pages &&
>>>>>> +                        folio_nr_pages(folio) == nr_pages))) {
>>>>>
>>>>> I think in practice there is no change here? If nr_pages > 1 then the folio is
>>>>> in the swapcache, so there is an extra ref on it? I agree with the change for
>>>>> robustness sake. Just checking my understanding.
>>>>
>>>> This is the code showing we are reusing/(mkwrite) a folio either
>>>> 1. we meet a small folio and we are the only one hitting the small folio
>>>> 2. we meet a large folio and we are the only one hitting the large folio
>>>>
>>>> any corner cases besides the above two seems difficult. for example,
>>>>
>>>> while we hit a large folio in swapcache but if we can't entirely map it
>>>> (nr_pages==1) due to partial unmap, we will have folio_ref_count(folio)
>>>> == nr_pages == 1
>>>
>>> No, there would be other references from the swapcache and
>>> folio_ref_count(folio) > 1. See my other reply.
>>
>> right. can be clearer by:
> 
> Wait, do we still need folio_nr_pages(folio) == nr_pages even if we use
> folio_ref_count(folio) == 1 and moving folio_ref_add(folio, nr_pages - 1)?

I don't think that we will "need" it.

> 
> one case is that we have a large folio with 16 PTEs, and we unmap
> 15 swap PTE entries, thus we have only one swap entry left. Then
> we hit the large folio in swapcache.  but we have only one PTE thus we will
> map only one PTE. lacking folio_nr_pages(folio) == nr_pages, we reuse the
> large folio for one PTE. with it, do_wp_page() will migrate the large
> folio to a small one?

We will set PAE bit and do_wp_page() will unconditionally reuse that page.

Note that this is the same as if we had pte_swp_exclusive() set and 
would have run into "exclusive=true" here.

If we'd want a similar "optimization" as we have in 
wp_can_reuse_anon_folio(), you'd want something like

exclusive || (folio_ref_count(folio) == 1 &&
	      (!folio_test_large(folio) || nr_pages > 1)

.. but I am not sure if that is really worth the complexity here.

> 
> 1AM, tired and sleepy. not quite sure I am correct.
> I look forward to seeing your reply tomorrow morning :-)

Heh, no need to dream about this ;)

-- 
Cheers,

David / dhildenb