[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7dc2084e-d8b1-42f7-b854-38981839f82e@redhat.com>
Date: Tue, 7 May 2024 10:24:25 +0200
From: David Hildenbrand <david@...hat.com>
To: Barry Song <21cnbao@...il.com>
Cc: Ryan Roberts <ryan.roberts@....com>, akpm@...ux-foundation.org,
linux-mm@...ck.org, baolin.wang@...ux.alibaba.com, chrisl@...nel.org,
hanchuanhua@...o.com, hannes@...xchg.org, hughd@...gle.com,
kasong@...cent.com, linux-kernel@...r.kernel.org, surenb@...gle.com,
v-songbaohua@...o.com, willy@...radead.org, xiang@...nel.org,
ying.huang@...el.com, yosryahmed@...gle.com, yuzhao@...gle.com,
ziy@...dia.com
Subject: Re: [PATCH v3 6/6] mm: swap: entirely map large folios found in
swapcache
On 07.05.24 00:58, Barry Song wrote:
> On Tue, May 7, 2024 at 1:16 AM David Hildenbrand <david@...hat.com> wrote:
>>
>> On 06.05.24 14:58, Barry Song wrote:
>>> On Tue, May 7, 2024 at 12:38 AM Barry Song <21cnbao@...il.com> wrote:
>>>>
>>>> On Tue, May 7, 2024 at 12:07 AM David Hildenbrand <david@...hat.com> wrote:
>>>>>
>>>>> On 04.05.24 01:23, Barry Song wrote:
>>>>>> On Fri, May 3, 2024 at 6:50 PM Ryan Roberts <ryan.roberts@....com> wrote:
>>>>>>>
>>>>>>> On 03/05/2024 01:50, Barry Song wrote:
>>>>>>>> From: Chuanhua Han <hanchuanhua@...o.com>
>>>>>>>>
>>>>>>>> When a large folio is found in the swapcache, the current implementation
>>>>>>>> requires calling do_swap_page() nr_pages times, resulting in nr_pages
>>>>>>>> page faults. This patch opts to map the entire large folio at once to
>>>>>>>> minimize page faults. Additionally, redundant checks and early exits
>>>>>>>> for ARM64 MTE restoring are removed.
>>>>>>>>
>>>>>>>> Signed-off-by: Chuanhua Han <hanchuanhua@...o.com>
>>>>>>>> Co-developed-by: Barry Song <v-songbaohua@...o.com>
>>>>>>>> Signed-off-by: Barry Song <v-songbaohua@...o.com>
>>>>>>>
>>>>>>> With the suggested changes below:
>>>>>>>
>>>>>>> Reviewed-by: Ryan Roberts <ryan.roberts@....com>
>>>>>>>
>>>>>>>> ---
>>>>>>>> mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++-----------
>>>>>>>> 1 file changed, 48 insertions(+), 12 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>>>> index 22e7c33cc747..940fdbe69fa1 100644
>>>>>>>> --- a/mm/memory.c
>>>>>>>> +++ b/mm/memory.c
>>>>>>>> @@ -3968,6 +3968,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>>>> pte_t pte;
>>>>>>>> vm_fault_t ret = 0;
>>>>>>>> void *shadow = NULL;
>>>>>>>> + int nr_pages = 1;
>>>>>>>> + unsigned long page_idx = 0;
>>>>>>>> + unsigned long address = vmf->address;
>>>>>>>> + pte_t *ptep;
>>>>>>>
>>>>>>> nit: Personally I'd prefer all these to get initialised just before the "if
>>>>>>> (folio_test_large()..." block below. That way it is clear they are fresh (incase
>>>>>>> any logic between here and there makes an adjustment) and its clear that they
>>>>>>> are only to be used after that block (the compiler will warn if using an
>>>>>>> uninitialized value).
>>>>>>
>>>>>> right. I agree this will make the code more readable.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> if (!pte_unmap_same(vmf))
>>>>>>>> goto out;
>>>>>>>> @@ -4166,6 +4170,36 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>>>> goto out_nomap;
>>>>>>>> }
>>>>>>>>
>>>>>>>> + ptep = vmf->pte;
>>>>>>>> + if (folio_test_large(folio) && folio_test_swapcache(folio)) {
>>>>>>>> + int nr = folio_nr_pages(folio);
>>>>>>>> + unsigned long idx = folio_page_idx(folio, page);
>>>>>>>> + unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
>>>>>>>> + unsigned long folio_end = folio_start + nr * PAGE_SIZE;
>>>>>>>> + pte_t *folio_ptep;
>>>>>>>> + pte_t folio_pte;
>>>>>>>> +
>>>>>>>> + if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
>>>>>>>> + goto check_folio;
>>>>>>>> + if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
>>>>>>>> + goto check_folio;
>>>>>>>> +
>>>>>>>> + folio_ptep = vmf->pte - idx;
>>>>>>>> + folio_pte = ptep_get(folio_ptep);
>>>>>>>> + if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
>>>>>>>> + swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
>>>>>>>> + goto check_folio;
>>>>>>>> +
>>>>>>>> + page_idx = idx;
>>>>>>>> + address = folio_start;
>>>>>>>> + ptep = folio_ptep;
>>>>>>>> + nr_pages = nr;
>>>>>>>> + entry = folio->swap;
>>>>>>>> + page = &folio->page;
>>>>>>>> + }
>>>>>>>> +
>>>>>>>> +check_folio:
>>>>>>>
>>>>>>> Is this still the correct label name, given the checks are now above the new
>>>>>>> block? Perhaps "one_page" or something like that?
>>>>>>
>>>>>> not quite sure about this, as the code after one_page can be multiple_pages.
>>>>>> On the other hand, it seems we are really checking folio after "check_folio"
>>>>>> :-)
>>>>>>
>>>>>>
>>>>>> BUG_ON(!folio_test_anon(folio) && folio_test_mappedtodisk(folio));
>>>>>> BUG_ON(folio_test_anon(folio) && PageAnonExclusive(page));
>>>>>>
>>>>>> /*
>>>>>> * Check under PT lock (to protect against concurrent fork() sharing
>>>>>> * the swap entry concurrently) for certainly exclusive pages.
>>>>>> */
>>>>>> if (!folio_test_ksm(folio)) {
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>> +
>>>>>>>> /*
>>>>>>>> * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte
>>>>>>>> * must never point at an anonymous page in the swapcache that is
>>>>>>>> @@ -4225,12 +4259,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>>>> * We're already holding a reference on the page but haven't mapped it
>>>>>>>> * yet.
>>>>>>>> */
>>>>>>>> - swap_free_nr(entry, 1);
>>>>>>>> + swap_free_nr(entry, nr_pages);
>>>>>>>> if (should_try_to_free_swap(folio, vma, vmf->flags))
>>>>>>>> folio_free_swap(folio);
>>>>>>>>
>>>>>>>> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>>>>>>>> - dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
>>>>>>>> + folio_ref_add(folio, nr_pages - 1);
>>>>>>>> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
>>>>>>>> + add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
>>>>>>>> pte = mk_pte(page, vma->vm_page_prot);
>>>>>>>>
>>>>>>>> /*
>>>>>>>> @@ -4240,34 +4275,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>>>> * exclusivity.
>>>>>>>> */
>>>>>>>> if (!folio_test_ksm(folio) &&
>>>>>>>> - (exclusive || folio_ref_count(folio) == 1)) {
>>>>>>>> + (exclusive || (folio_ref_count(folio) == nr_pages &&
>>>>>>>> + folio_nr_pages(folio) == nr_pages))) {
>>>>>>>
>>>>>>> I think in practice there is no change here? If nr_pages > 1 then the folio is
>>>>>>> in the swapcache, so there is an extra ref on it? I agree with the change for
>>>>>>> robustness sake. Just checking my understanding.
>>>>>>
>>>>>> This is the code showing we are reusing/(mkwrite) a folio either
>>>>>> 1. we meet a small folio and we are the only one hitting the small folio
>>>>>> 2. we meet a large folio and we are the only one hitting the large folio
>>>>>>
>>>>>> any corner cases besides the above two seems difficult. for example,
>>>>>>
>>>>>> while we hit a large folio in swapcache but if we can't entirely map it
>>>>>> (nr_pages==1) due to partial unmap, we will have folio_ref_count(folio)
>>>>>> == nr_pages == 1
>>>>>
>>>>> No, there would be other references from the swapcache and
>>>>> folio_ref_count(folio) > 1. See my other reply.
>>>>
>>>> right. can be clearer by:
>>>
>>> Wait, do we still need folio_nr_pages(folio) == nr_pages even if we use
>>> folio_ref_count(folio) == 1 and moving folio_ref_add(folio, nr_pages - 1)?
>>
>> I don't think that we will "need" it.
>>
>>>
>>> one case is that we have a large folio with 16 PTEs, and we unmap
>>> 15 swap PTE entries, thus we have only one swap entry left. Then
>>> we hit the large folio in swapcache. but we have only one PTE thus we will
>>> map only one PTE. lacking folio_nr_pages(folio) == nr_pages, we reuse the
>>> large folio for one PTE. with it, do_wp_page() will migrate the large
>>> folio to a small one?
>>
>> We will set PAE bit and do_wp_page() will unconditionally reuse that page.
>>
>> Note that this is the same as if we had pte_swp_exclusive() set and
>> would have run into "exclusive=true" here.
>>
>> If we'd want a similar "optimization" as we have in
>> wp_can_reuse_anon_folio(), you'd want something like
>>
>> exclusive || (folio_ref_count(folio) == 1 &&
>> (!folio_test_large(folio) || nr_pages > 1)
>
> I feel like
>
> A : !folio_test_large(folio) || nr_pages > 1
>
> equals
>
> B: folio_nr_pages(folio) == nr_pages
>
> if folio is small, folio_test_large(folio) is false, both A and B will be true;
> if folio is large, and we map the whole large folio, A will be true
> because of nr_pages > 1;
> B is also true;
> if folio is large, and we map single one PTE, A will be false;
> B is also false, because nr_pages == 1 but folio_nr_pages(folio) > 1;
>
> right?
Let's assume a single subpage of a large folio is no longer mapped.
Then, we'd have:
nr_pages == folio_nr_pages(folio) - 1.
You could simply map+reuse most of the folio without COWing.
Once we support COW reuse of PTE-mapped THP we'd do the same. Here, it's
just easy to detect that the folio is exclusive (folio_ref_count(folio)
== 1 before mapping anything).
If you really want to mimic what do_wp_page() currently does, you should
have:
exclusive || (folio_ref_count(folio) == 1 && !folio_test_large(folio))
Personally, I think we should keep it simple here and use:
exclusive || folio_ref_count(folio) == 1
IMHO, that's as clear as it gets.
--
Cheers,
David / dhildenb
Powered by blists - more mailing lists