linux-kernel - Re: [PATCH v3 6/6] mm: swap: entirely map large folios found in swapcache

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7dc2084e-d8b1-42f7-b854-38981839f82e@redhat.com>
Date: Tue, 7 May 2024 10:24:25 +0200
From: David Hildenbrand <david@...hat.com>
To: Barry Song <21cnbao@...il.com>
Cc: Ryan Roberts <ryan.roberts@....com>, akpm@...ux-foundation.org,
 linux-mm@...ck.org, baolin.wang@...ux.alibaba.com, chrisl@...nel.org,
 hanchuanhua@...o.com, hannes@...xchg.org, hughd@...gle.com,
 kasong@...cent.com, linux-kernel@...r.kernel.org, surenb@...gle.com,
 v-songbaohua@...o.com, willy@...radead.org, xiang@...nel.org,
 ying.huang@...el.com, yosryahmed@...gle.com, yuzhao@...gle.com,
 ziy@...dia.com
Subject: Re: [PATCH v3 6/6] mm: swap: entirely map large folios found in
 swapcache

On 07.05.24 00:58, Barry Song wrote:
> On Tue, May 7, 2024 at 1:16 AM David Hildenbrand <david@...hat.com> wrote:
>>
>> On 06.05.24 14:58, Barry Song wrote:
>>> On Tue, May 7, 2024 at 12:38 AM Barry Song <21cnbao@...il.com> wrote:
>>>>
>>>> On Tue, May 7, 2024 at 12:07 AM David Hildenbrand <david@...hat.com> wrote:
>>>>>
>>>>> On 04.05.24 01:23, Barry Song wrote:
>>>>>> On Fri, May 3, 2024 at 6:50 PM Ryan Roberts <ryan.roberts@....com> wrote:
>>>>>>>
>>>>>>> On 03/05/2024 01:50, Barry Song wrote:
>>>>>>>> From: Chuanhua Han <hanchuanhua@...o.com>
>>>>>>>>
>>>>>>>> When a large folio is found in the swapcache, the current implementation
>>>>>>>> requires calling do_swap_page() nr_pages times, resulting in nr_pages
>>>>>>>> page faults. This patch opts to map the entire large folio at once to
>>>>>>>> minimize page faults. Additionally, redundant checks and early exits
>>>>>>>> for ARM64 MTE restoring are removed.
>>>>>>>>
>>>>>>>> Signed-off-by: Chuanhua Han <hanchuanhua@...o.com>
>>>>>>>> Co-developed-by: Barry Song <v-songbaohua@...o.com>
>>>>>>>> Signed-off-by: Barry Song <v-songbaohua@...o.com>
>>>>>>>
>>>>>>> With the suggested changes below:
>>>>>>>
>>>>>>> Reviewed-by: Ryan Roberts <ryan.roberts@....com>
>>>>>>>
>>>>>>>> ---
>>>>>>>>     mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++-----------
>>>>>>>>     1 file changed, 48 insertions(+), 12 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>>>> index 22e7c33cc747..940fdbe69fa1 100644
>>>>>>>> --- a/mm/memory.c
>>>>>>>> +++ b/mm/memory.c
>>>>>>>> @@ -3968,6 +3968,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>>>>          pte_t pte;
>>>>>>>>          vm_fault_t ret = 0;
>>>>>>>>          void *shadow = NULL;
>>>>>>>> +     int nr_pages = 1;
>>>>>>>> +     unsigned long page_idx = 0;
>>>>>>>> +     unsigned long address = vmf->address;
>>>>>>>> +     pte_t *ptep;
>>>>>>>
>>>>>>> nit: Personally I'd prefer all these to get initialised just before the "if
>>>>>>> (folio_test_large()..." block below. That way it is clear they are fresh (incase
>>>>>>> any logic between here and there makes an adjustment) and its clear that they
>>>>>>> are only to be used after that block (the compiler will warn if using an
>>>>>>> uninitialized value).
>>>>>>
>>>>>> right. I agree this will make the code more readable.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>          if (!pte_unmap_same(vmf))
>>>>>>>>                  goto out;
>>>>>>>> @@ -4166,6 +4170,36 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>>>>                  goto out_nomap;
>>>>>>>>          }
>>>>>>>>
>>>>>>>> +     ptep = vmf->pte;
>>>>>>>> +     if (folio_test_large(folio) && folio_test_swapcache(folio)) {
>>>>>>>> +             int nr = folio_nr_pages(folio);
>>>>>>>> +             unsigned long idx = folio_page_idx(folio, page);
>>>>>>>> +             unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
>>>>>>>> +             unsigned long folio_end = folio_start + nr * PAGE_SIZE;
>>>>>>>> +             pte_t *folio_ptep;
>>>>>>>> +             pte_t folio_pte;
>>>>>>>> +
>>>>>>>> +             if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
>>>>>>>> +                     goto check_folio;
>>>>>>>> +             if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
>>>>>>>> +                     goto check_folio;
>>>>>>>> +
>>>>>>>> +             folio_ptep = vmf->pte - idx;
>>>>>>>> +             folio_pte = ptep_get(folio_ptep);
>>>>>>>> +             if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
>>>>>>>> +                 swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
>>>>>>>> +                     goto check_folio;
>>>>>>>> +
>>>>>>>> +             page_idx = idx;
>>>>>>>> +             address = folio_start;
>>>>>>>> +             ptep = folio_ptep;
>>>>>>>> +             nr_pages = nr;
>>>>>>>> +             entry = folio->swap;
>>>>>>>> +             page = &folio->page;
>>>>>>>> +     }
>>>>>>>> +
>>>>>>>> +check_folio:
>>>>>>>
>>>>>>> Is this still the correct label name, given the checks are now above the new
>>>>>>> block? Perhaps "one_page" or something like that?
>>>>>>
>>>>>> not quite sure about this, as the code after one_page can be multiple_pages.
>>>>>> On the other hand, it seems we are really checking folio after "check_folio"
>>>>>> :-)
>>>>>>
>>>>>>
>>>>>> BUG_ON(!folio_test_anon(folio) && folio_test_mappedtodisk(folio));
>>>>>> BUG_ON(folio_test_anon(folio) && PageAnonExclusive(page));
>>>>>>
>>>>>> /*
>>>>>> * Check under PT lock (to protect against concurrent fork() sharing
>>>>>> * the swap entry concurrently) for certainly exclusive pages.
>>>>>> */
>>>>>> if (!folio_test_ksm(folio)) {
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>> +
>>>>>>>>          /*
>>>>>>>>           * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte
>>>>>>>>           * must never point at an anonymous page in the swapcache that is
>>>>>>>> @@ -4225,12 +4259,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>>>>           * We're already holding a reference on the page but haven't mapped it
>>>>>>>>           * yet.
>>>>>>>>           */
>>>>>>>> -     swap_free_nr(entry, 1);
>>>>>>>> +     swap_free_nr(entry, nr_pages);
>>>>>>>>          if (should_try_to_free_swap(folio, vma, vmf->flags))
>>>>>>>>                  folio_free_swap(folio);
>>>>>>>>
>>>>>>>> -     inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>>>>>>>> -     dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
>>>>>>>> +     folio_ref_add(folio, nr_pages - 1);
>>>>>>>> +     add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
>>>>>>>> +     add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
>>>>>>>>          pte = mk_pte(page, vma->vm_page_prot);
>>>>>>>>
>>>>>>>>          /*
>>>>>>>> @@ -4240,34 +4275,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>>>>           * exclusivity.
>>>>>>>>           */
>>>>>>>>          if (!folio_test_ksm(folio) &&
>>>>>>>> -         (exclusive || folio_ref_count(folio) == 1)) {
>>>>>>>> +         (exclusive || (folio_ref_count(folio) == nr_pages &&
>>>>>>>> +                        folio_nr_pages(folio) == nr_pages))) {
>>>>>>>
>>>>>>> I think in practice there is no change here? If nr_pages > 1 then the folio is
>>>>>>> in the swapcache, so there is an extra ref on it? I agree with the change for
>>>>>>> robustness sake. Just checking my understanding.
>>>>>>
>>>>>> This is the code showing we are reusing/(mkwrite) a folio either
>>>>>> 1. we meet a small folio and we are the only one hitting the small folio
>>>>>> 2. we meet a large folio and we are the only one hitting the large folio
>>>>>>
>>>>>> any corner cases besides the above two seems difficult. for example,
>>>>>>
>>>>>> while we hit a large folio in swapcache but if we can't entirely map it
>>>>>> (nr_pages==1) due to partial unmap, we will have folio_ref_count(folio)
>>>>>> == nr_pages == 1
>>>>>
>>>>> No, there would be other references from the swapcache and
>>>>> folio_ref_count(folio) > 1. See my other reply.
>>>>
>>>> right. can be clearer by:
>>>
>>> Wait, do we still need folio_nr_pages(folio) == nr_pages even if we use
>>> folio_ref_count(folio) == 1 and moving folio_ref_add(folio, nr_pages - 1)?
>>
>> I don't think that we will "need" it.
>>
>>>
>>> one case is that we have a large folio with 16 PTEs, and we unmap
>>> 15 swap PTE entries, thus we have only one swap entry left. Then
>>> we hit the large folio in swapcache.  but we have only one PTE thus we will
>>> map only one PTE. lacking folio_nr_pages(folio) == nr_pages, we reuse the
>>> large folio for one PTE. with it, do_wp_page() will migrate the large
>>> folio to a small one?
>>
>> We will set PAE bit and do_wp_page() will unconditionally reuse that page.
>>
>> Note that this is the same as if we had pte_swp_exclusive() set and
>> would have run into "exclusive=true" here.
>>
>> If we'd want a similar "optimization" as we have in
>> wp_can_reuse_anon_folio(), you'd want something like
>>
>> exclusive || (folio_ref_count(folio) == 1 &&
>>                (!folio_test_large(folio) || nr_pages > 1)
> 
> I feel like
> 
> A :   !folio_test_large(folio) || nr_pages > 1
> 
> equals
> 
> B:    folio_nr_pages(folio) == nr_pages
> 
> if folio is small,  folio_test_large(folio) is false, both A and B will be true;
> if folio is large, and we map the whole large folio, A will be true
> because of nr_pages > 1;
> B is also true;
> if folio is large, and we map single one PTE, A will be false;
> B is also false, because nr_pages == 1 but  folio_nr_pages(folio) > 1;
> 
> right?

Let's assume a single subpage of a large folio is no longer mapped. 
Then, we'd have:

nr_pages == folio_nr_pages(folio) - 1.

You could simply map+reuse most of the folio without COWing.

Once we support COW reuse of PTE-mapped THP we'd do the same. Here, it's 
just easy to detect that the folio is exclusive (folio_ref_count(folio) 
== 1 before mapping anything).

If you really want to mimic what do_wp_page() currently does, you should 
have:

exclusive || (folio_ref_count(folio) == 1 && !folio_test_large(folio))

Personally, I think we should keep it simple here and use:

exclusive || folio_ref_count(folio) == 1

IMHO, that's as clear as it gets.

-- 
Cheers,

David / dhildenb