[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5b770715-7516-42a8-9ea0-3f61572d92af@redhat.com>
Date: Mon, 6 May 2024 15:16:09 +0200
From: David Hildenbrand <david@...hat.com>
To: Barry Song <21cnbao@...il.com>
Cc: Ryan Roberts <ryan.roberts@....com>, akpm@...ux-foundation.org,
linux-mm@...ck.org, baolin.wang@...ux.alibaba.com, chrisl@...nel.org,
hanchuanhua@...o.com, hannes@...xchg.org, hughd@...gle.com,
kasong@...cent.com, linux-kernel@...r.kernel.org, surenb@...gle.com,
v-songbaohua@...o.com, willy@...radead.org, xiang@...nel.org,
ying.huang@...el.com, yosryahmed@...gle.com, yuzhao@...gle.com,
ziy@...dia.com
Subject: Re: [PATCH v3 6/6] mm: swap: entirely map large folios found in
swapcache
On 06.05.24 14:58, Barry Song wrote:
> On Tue, May 7, 2024 at 12:38 AM Barry Song <21cnbao@...il.com> wrote:
>>
>> On Tue, May 7, 2024 at 12:07 AM David Hildenbrand <david@...hat.com> wrote:
>>>
>>> On 04.05.24 01:23, Barry Song wrote:
>>>> On Fri, May 3, 2024 at 6:50 PM Ryan Roberts <ryan.roberts@....com> wrote:
>>>>>
>>>>> On 03/05/2024 01:50, Barry Song wrote:
>>>>>> From: Chuanhua Han <hanchuanhua@...o.com>
>>>>>>
>>>>>> When a large folio is found in the swapcache, the current implementation
>>>>>> requires calling do_swap_page() nr_pages times, resulting in nr_pages
>>>>>> page faults. This patch opts to map the entire large folio at once to
>>>>>> minimize page faults. Additionally, redundant checks and early exits
>>>>>> for ARM64 MTE restoring are removed.
>>>>>>
>>>>>> Signed-off-by: Chuanhua Han <hanchuanhua@...o.com>
>>>>>> Co-developed-by: Barry Song <v-songbaohua@...o.com>
>>>>>> Signed-off-by: Barry Song <v-songbaohua@...o.com>
>>>>>
>>>>> With the suggested changes below:
>>>>>
>>>>> Reviewed-by: Ryan Roberts <ryan.roberts@....com>
>>>>>
>>>>>> ---
>>>>>> mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++-----------
>>>>>> 1 file changed, 48 insertions(+), 12 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>> index 22e7c33cc747..940fdbe69fa1 100644
>>>>>> --- a/mm/memory.c
>>>>>> +++ b/mm/memory.c
>>>>>> @@ -3968,6 +3968,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>> pte_t pte;
>>>>>> vm_fault_t ret = 0;
>>>>>> void *shadow = NULL;
>>>>>> + int nr_pages = 1;
>>>>>> + unsigned long page_idx = 0;
>>>>>> + unsigned long address = vmf->address;
>>>>>> + pte_t *ptep;
>>>>>
>>>>> nit: Personally I'd prefer all these to get initialised just before the "if
>>>>> (folio_test_large()..." block below. That way it is clear they are fresh (incase
>>>>> any logic between here and there makes an adjustment) and its clear that they
>>>>> are only to be used after that block (the compiler will warn if using an
>>>>> uninitialized value).
>>>>
>>>> right. I agree this will make the code more readable.
>>>>
>>>>>
>>>>>>
>>>>>> if (!pte_unmap_same(vmf))
>>>>>> goto out;
>>>>>> @@ -4166,6 +4170,36 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>> goto out_nomap;
>>>>>> }
>>>>>>
>>>>>> + ptep = vmf->pte;
>>>>>> + if (folio_test_large(folio) && folio_test_swapcache(folio)) {
>>>>>> + int nr = folio_nr_pages(folio);
>>>>>> + unsigned long idx = folio_page_idx(folio, page);
>>>>>> + unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
>>>>>> + unsigned long folio_end = folio_start + nr * PAGE_SIZE;
>>>>>> + pte_t *folio_ptep;
>>>>>> + pte_t folio_pte;
>>>>>> +
>>>>>> + if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
>>>>>> + goto check_folio;
>>>>>> + if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
>>>>>> + goto check_folio;
>>>>>> +
>>>>>> + folio_ptep = vmf->pte - idx;
>>>>>> + folio_pte = ptep_get(folio_ptep);
>>>>>> + if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
>>>>>> + swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
>>>>>> + goto check_folio;
>>>>>> +
>>>>>> + page_idx = idx;
>>>>>> + address = folio_start;
>>>>>> + ptep = folio_ptep;
>>>>>> + nr_pages = nr;
>>>>>> + entry = folio->swap;
>>>>>> + page = &folio->page;
>>>>>> + }
>>>>>> +
>>>>>> +check_folio:
>>>>>
>>>>> Is this still the correct label name, given the checks are now above the new
>>>>> block? Perhaps "one_page" or something like that?
>>>>
>>>> not quite sure about this, as the code after one_page can be multiple_pages.
>>>> On the other hand, it seems we are really checking folio after "check_folio"
>>>> :-)
>>>>
>>>>
>>>> BUG_ON(!folio_test_anon(folio) && folio_test_mappedtodisk(folio));
>>>> BUG_ON(folio_test_anon(folio) && PageAnonExclusive(page));
>>>>
>>>> /*
>>>> * Check under PT lock (to protect against concurrent fork() sharing
>>>> * the swap entry concurrently) for certainly exclusive pages.
>>>> */
>>>> if (!folio_test_ksm(folio)) {
>>>>
>>>>
>>>>>
>>>>>> +
>>>>>> /*
>>>>>> * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte
>>>>>> * must never point at an anonymous page in the swapcache that is
>>>>>> @@ -4225,12 +4259,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>> * We're already holding a reference on the page but haven't mapped it
>>>>>> * yet.
>>>>>> */
>>>>>> - swap_free_nr(entry, 1);
>>>>>> + swap_free_nr(entry, nr_pages);
>>>>>> if (should_try_to_free_swap(folio, vma, vmf->flags))
>>>>>> folio_free_swap(folio);
>>>>>>
>>>>>> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>>>>>> - dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
>>>>>> + folio_ref_add(folio, nr_pages - 1);
>>>>>> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
>>>>>> + add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
>>>>>> pte = mk_pte(page, vma->vm_page_prot);
>>>>>>
>>>>>> /*
>>>>>> @@ -4240,34 +4275,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>> * exclusivity.
>>>>>> */
>>>>>> if (!folio_test_ksm(folio) &&
>>>>>> - (exclusive || folio_ref_count(folio) == 1)) {
>>>>>> + (exclusive || (folio_ref_count(folio) == nr_pages &&
>>>>>> + folio_nr_pages(folio) == nr_pages))) {
>>>>>
>>>>> I think in practice there is no change here? If nr_pages > 1 then the folio is
>>>>> in the swapcache, so there is an extra ref on it? I agree with the change for
>>>>> robustness sake. Just checking my understanding.
>>>>
>>>> This is the code showing we are reusing/(mkwrite) a folio either
>>>> 1. we meet a small folio and we are the only one hitting the small folio
>>>> 2. we meet a large folio and we are the only one hitting the large folio
>>>>
>>>> any corner cases besides the above two seems difficult. for example,
>>>>
>>>> while we hit a large folio in swapcache but if we can't entirely map it
>>>> (nr_pages==1) due to partial unmap, we will have folio_ref_count(folio)
>>>> == nr_pages == 1
>>>
>>> No, there would be other references from the swapcache and
>>> folio_ref_count(folio) > 1. See my other reply.
>>
>> right. can be clearer by:
>
> Wait, do we still need folio_nr_pages(folio) == nr_pages even if we use
> folio_ref_count(folio) == 1 and moving folio_ref_add(folio, nr_pages - 1)?
I don't think that we will "need" it.
>
> one case is that we have a large folio with 16 PTEs, and we unmap
> 15 swap PTE entries, thus we have only one swap entry left. Then
> we hit the large folio in swapcache. but we have only one PTE thus we will
> map only one PTE. lacking folio_nr_pages(folio) == nr_pages, we reuse the
> large folio for one PTE. with it, do_wp_page() will migrate the large
> folio to a small one?
We will set PAE bit and do_wp_page() will unconditionally reuse that page.
Note that this is the same as if we had pte_swp_exclusive() set and
would have run into "exclusive=true" here.
If we'd want a similar "optimization" as we have in
wp_can_reuse_anon_folio(), you'd want something like
exclusive || (folio_ref_count(folio) == 1 &&
(!folio_test_large(folio) || nr_pages > 1)
.. but I am not sure if that is really worth the complexity here.
>
> 1AM, tired and sleepy. not quite sure I am correct.
> I look forward to seeing your reply tomorrow morning :-)
Heh, no need to dream about this ;)
--
Cheers,
David / dhildenb
Powered by blists - more mailing lists