[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <99d3229a-17ce-4a41-ae11-ebfab138cf76@bytedance.com>
Date: Tue, 19 Nov 2024 18:03:34 +0800
From: Qi Zheng <zhengqi.arch@...edance.com>
To: David Hildenbrand <david@...hat.com>
Cc: jannh@...gle.com, hughd@...gle.com, willy@...radead.org,
muchun.song@...ux.dev, vbabka@...nel.org, akpm@...ux-foundation.org,
peterx@...hat.com, mgorman@...e.de, catalin.marinas@....com,
will@...nel.org, dave.hansen@...ux.intel.com, luto@...nel.org,
peterz@...radead.org, x86@...nel.org, lorenzo.stoakes@...cle.com,
linux-mm@...ck.org, linux-kernel@...r.kernel.org, zokeefe@...gle.com,
rientjes@...gle.com
Subject: Re: [PATCH v3 4/9] mm: introduce skip_none_ptes()
On 2024/11/19 17:55, David Hildenbrand wrote:
> On 18.11.24 12:13, Qi Zheng wrote:
>>
>>
>> On 2024/11/18 18:59, David Hildenbrand wrote:
>>> On 18.11.24 11:56, Qi Zheng wrote:
>>>>
>>>>
>>>> On 2024/11/18 18:41, David Hildenbrand wrote:
>>>>> On 18.11.24 11:34, Qi Zheng wrote:
>>>>>>
>>>>>>
>>>>>> On 2024/11/18 17:29, David Hildenbrand wrote:
>>>>>>> On 18.11.24 04:35, Qi Zheng wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2024/11/15 22:59, David Hildenbrand wrote:
>>>>>>>>> On 15.11.24 15:41, Qi Zheng wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 2024/11/15 18:22, David Hildenbrand wrote:
>>>>>>>>>>>>>> *nr_skip = nr;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> and then:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> zap_pte_range
>>>>>>>>>>>>>> --> nr = do_zap_pte_range(tlb, vma, pte, addr, end, details,
>>>>>>>>>>>>>> &skip_nr,
>>>>>>>>>>>>>> rss, &force_flush,
>>>>>>>>>>>>>> &force_break);
>>>>>>>>>>>>>> if (can_reclaim_pt) {
>>>>>>>>>>>>>> none_nr += count_pte_none(pte, nr);
>>>>>>>>>>>>>> none_nr += nr_skip;
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Right?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes. I did not look closely at the patch that adds the
>>>>>>>>>>>>> counting of
>>>>>>>>>>>>
>>>>>>>>>>>> Got it.
>>>>>>>>>>>>
>>>>>>>>>>>>> pte_none though (to digest why it is required :) ).
>>>>>>>>>>>>
>>>>>>>>>>>> Because 'none_nr == PTRS_PER_PTE' is used in patch #7 to detect
>>>>>>>>>>>> empty PTE page.
>>>>>>>>>>>
>>>>>>>>>>> Okay, so the problem is that "nr" would be "all processed
>>>>>>>>>>> entries" but
>>>>>>>>>>> there are cases where we "process an entry but not zap it".
>>>>>>>>>>>
>>>>>>>>>>> What you really only want to know is "was any entry not zapped",
>>>>>>>>>>> which
>>>>>>>>>>> could be a simple input boolean variable passed into
>>>>>>>>>>> do_zap_pte_range?
>>>>>>>>>>>
>>>>>>>>>>> Because as soon as any entry was processed but no zapped,
>>>>>>>>>>> you can
>>>>>>>>>>> immediately give up on reclaiming that table.
>>>>>>>>>>
>>>>>>>>>> Yes, we can set can_reclaim_pt to false when a !pte_none()
>>>>>>>>>> entry is
>>>>>>>>>> found in count_pte_none().
>>>>>>>>>
>>>>>>>>> I'm not sure if well need cont_pte_none(), but I'll have to take a
>>>>>>>>> look
>>>>>>>>> at your new patch to see how this fits together with doing the
>>>>>>>>> pte_none
>>>>>>>>> detection+skipping in do_zap_pte_range().
>>>>>>>>>
>>>>>>>>> I was wondering if you cannot simply avoid the additional
>>>>>>>>> scanning and
>>>>>>>>> simply set "can_reclaim_pt" if you skip a zap.
>>>>>>>>
>>>>>>>> Maybe we can return the information whether the zap was skipped
>>>>>>>> from
>>>>>>>> zap_present_ptes() and zap_nonpresent_ptes() through parameters
>>>>>>>> like I
>>>>>>>> did in [PATCH v1 3/7] and [PATCH v1 4/7].
>>>>>>>>
>>>>>>>> In theory, we can detect empty PTE pages in the following two ways:
>>>>>>>>
>>>>>>>> 1) If no zap is skipped, it means that all pte entries have been
>>>>>>>> zap, and the PTE page must be empty.
>>>>>>>> 2) If all pte entries are detected to be none, then the PTE page is
>>>>>>>> empty.
>>>>>>>>
>>>>>>>> In the error case, 1) may cause non-empty PTE pages to be reclaimed
>>>>>>>> (which is unacceptable), while the 2) will at most cause empty PTE
>>>>>>>> pages
>>>>>>>> to not be reclaimed.
>>>>>>>>
>>>>>>>> So the most reliable and efficient method may be:
>>>>>>>>
>>>>>>>> a. If there is a zap that is skipped, stop scanning and do not
>>>>>>>> reclaim
>>>>>>>> the PTE page;
>>>>>>>> b. Otherwise, as now, detect the empty PTE page through
>>>>>>>> count_pte_none()
>>>>>>>
>>>>>>> Is there a need for count_pte_none() that I am missing?
>>>>>>
>>>>>> When any_skipped == false, at least add VM_BUG_ON() to recheck none
>>>>>> ptes.
>>>>>>
>>>>>>>
>>>>>>> Assume we have
>>>>>>>
>>>>>>> nr = do_zap_pte_range(&any_skipped)
>>>>>>>
>>>>>>>
>>>>>>> If "nr" is the number of processed entries (including
>>>>>>> pte_none()), and
>>>>>>> "any_skipped" is set whenever we skipped to zap a !pte_none
>>>>>>> entry, we
>>>>>>> can detect what we need, no?
>>>>>>>
>>>>>>> If any_skipped == false after the call, we now have "nr" pte_none()
>>>>>>> entries. -> We can continue trying to reclaim
>>>>>>
>>>>>> I prefer that "nr" should not include pte_none().
>>>>>>
>>>>>
>>>>> Why? do_zap_pte_range() should tell you how far to advance, nothing
>>>>> less, nothing more.
>>>>>
>>>>> Let's just keep it simple and avoid count_pte_none().
>>>>>
>>>>> I'm probably missing something important?
>>>>
>>>> As we discussed before, we should skip all consecutive none ptes,
>>> > pte and addr are already incremented before returning.
>>>
>>> It's probably best to send the resulting patch so I can either
>>> understand why count_pte_none() is required or comment on how to get rid
>>> of it.
>>
>> Something like this:
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index bd9ebe0f4471f..e9bec3cd49d44 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -1657,6 +1657,66 @@ static inline int zap_nonpresent_ptes(struct
>> mmu_gather *tlb,
>> return nr;
>> }
>>
>> +static inline int do_zap_pte_range(struct mmu_gather *tlb,
>> + struct vm_area_struct *vma, pte_t
>> *pte,
>> + unsigned long addr, unsigned long end,
>> + struct zap_details *details, int *rss,
>> + bool *force_flush, bool *force_break,
>> + bool *any_skipped)
>> +{
>> + pte_t ptent = ptep_get(pte);
>> + int max_nr = (end - addr) / PAGE_SIZE;
>
> I'd do here:
>
> int nr = 0;
>
>> +
>> + /* Skip all consecutive pte_none(). */
>> + if (pte_none(ptent)) {
>> +
>
> instead of the "int nr" here
>
>> + for (nr = 1; nr < max_nr; nr++) {
>> + ptent = ptep_get(pte + nr);
>> + if (!pte_none(ptent))
>> + break;
>> + }
>> + max_nr -= nr;
>> + if (!max_nr)
>> + return 0;
>> + pte += nr;
>> + addr += nr * PAGE_SIZE;
>> + }
>> +
>> + if (pte_present(ptent))
>> + return zap_present_ptes(tlb, vma, pte, ptent, max_nr,
>> + addr, details, rss, force_flush,
>> + force_break, any_skipped);
>
> and here:
>
> if (pte_present(ptent))
> nr += zap_present_ptes(...)
> else
> nr += zap_nonpresent_ptes();
> return nr;
>
> So you really return the number of processed items -- how much the
> caller should advance.
Got it, please ignore my previous stupid considerations. ;)
>
>> +
>> + return zap_nonpresent_ptes(tlb, vma, pte, ptent, max_nr, addr,
>> + details, rss, any_skipped);
>> +}
>> +
>> +static inline int count_pte_none(pte_t *pte, int nr)
>> +{
>> + int none_nr = 0;
>> +
>> + /*
>> + * If PTE_MARKER_UFFD_WP is enabled, the uffd-wp PTEs may be
>> + * re-installed, so we need to check pte_none() one by one.
>> + * Otherwise, checking a single PTE in a batch is sufficient.
>> + */
>> +#ifdef CONFIG_PTE_MARKER_UFFD_WP
>> + for (;;) {
>> + if (pte_none(ptep_get(pte)))
>> + none_nr++;
>> + if (--nr == 0)
>> + break;
>> + pte++;
>> + }
>> +#else
>> + if (pte_none(ptep_get(pte)))
>> + none_nr = nr;
>> +#endif
>> + return none_nr;
>> +}
>> +
>> +
>> static unsigned long zap_pte_range(struct mmu_gather *tlb,
>> struct vm_area_struct *vma, pmd_t *pmd,
>> unsigned long addr, unsigned long end,
>> @@ -1667,6 +1727,7 @@ static unsigned long zap_pte_range(struct
>> mmu_gather *tlb,
>> int rss[NR_MM_COUNTERS];
>> spinlock_t *ptl;
>> pte_t *start_pte;
>> + bool can_reclaim_pt;
>> pte_t *pte;
>> int nr;
>>
>> @@ -1679,28 +1740,22 @@ static unsigned long zap_pte_range(struct
>> mmu_gather *tlb,
>> flush_tlb_batched_pending(mm);
>> arch_enter_lazy_mmu_mode();
>> do {
>> - pte_t ptent = ptep_get(pte);
>> - int max_nr;
>> -
>> - nr = 1;
>> - if (pte_none(ptent))
>> - continue;
>> + bool any_skipped;
>>
>> if (need_resched())
>> break;
>>
>> - max_nr = (end - addr) / PAGE_SIZE;
>> - if (pte_present(ptent)) {
>> - nr = zap_present_ptes(tlb, vma, pte, ptent,
>> max_nr,
>> - addr, details, rss,
>> &force_flush,
>> - &force_break);
>> - if (unlikely(force_break)) {
>> - addr += nr * PAGE_SIZE;
>> - break;
>> - }
>> - } else {
>> - nr = zap_nonpresent_ptes(tlb, vma, pte, ptent,
>> max_nr,
>> - addr, details, rss);
>> + nr = do_zap_pte_range(tlb, vma, pte, addr, end, details,
>> + rss, &force_flush, &force_break,
>> + &any_skipped);
>> + if (can_reclaim_pt) {
>> + VM_BUG_ON(!any_skipped && count_pte_none(pte,
>> nr) == nr);
>
> Okay, so this is really only for debugging then? So we can safely drop
> that for now.
>
> I assume it would make sense to check when reclaiming a page table with
> CONFIG_DEBUG_VM, that the table is actually all-pte_none instead?
Ah, make sense. Will change to it in the next version.
>
> (as a side note: no VM_BUG_ON, please :) )
Got it.
Thanks!
>
Powered by blists - more mailing lists