linux-kernel - Re: [PATCH v3 4/9] mm: introduce skip_none

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <99d3229a-17ce-4a41-ae11-ebfab138cf76@bytedance.com>
Date: Tue, 19 Nov 2024 18:03:34 +0800
From: Qi Zheng <zhengqi.arch@...edance.com>
To: David Hildenbrand <david@...hat.com>
Cc: jannh@...gle.com, hughd@...gle.com, willy@...radead.org,
 muchun.song@...ux.dev, vbabka@...nel.org, akpm@...ux-foundation.org,
 peterx@...hat.com, mgorman@...e.de, catalin.marinas@....com,
 will@...nel.org, dave.hansen@...ux.intel.com, luto@...nel.org,
 peterz@...radead.org, x86@...nel.org, lorenzo.stoakes@...cle.com,
 linux-mm@...ck.org, linux-kernel@...r.kernel.org, zokeefe@...gle.com,
 rientjes@...gle.com
Subject: Re: [PATCH v3 4/9] mm: introduce skip_none_ptes()



On 2024/11/19 17:55, David Hildenbrand wrote:
> On 18.11.24 12:13, Qi Zheng wrote:
>>
>>
>> On 2024/11/18 18:59, David Hildenbrand wrote:
>>> On 18.11.24 11:56, Qi Zheng wrote:
>>>>
>>>>
>>>> On 2024/11/18 18:41, David Hildenbrand wrote:
>>>>> On 18.11.24 11:34, Qi Zheng wrote:
>>>>>>
>>>>>>
>>>>>> On 2024/11/18 17:29, David Hildenbrand wrote:
>>>>>>> On 18.11.24 04:35, Qi Zheng wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2024/11/15 22:59, David Hildenbrand wrote:
>>>>>>>>> On 15.11.24 15:41, Qi Zheng wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 2024/11/15 18:22, David Hildenbrand wrote:
>>>>>>>>>>>>>> *nr_skip = nr;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> and then:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> zap_pte_range
>>>>>>>>>>>>>> --> nr = do_zap_pte_range(tlb, vma, pte, addr, end, details,
>>>>>>>>>>>>>> &skip_nr,
>>>>>>>>>>>>>>                                rss, &force_flush, 
>>>>>>>>>>>>>> &force_break);
>>>>>>>>>>>>>>             if (can_reclaim_pt) {
>>>>>>>>>>>>>>                 none_nr += count_pte_none(pte, nr);
>>>>>>>>>>>>>>                 none_nr += nr_skip;
>>>>>>>>>>>>>>             }
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Right?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes. I did not look closely at the patch that adds the
>>>>>>>>>>>>> counting of
>>>>>>>>>>>>
>>>>>>>>>>>> Got it.
>>>>>>>>>>>>
>>>>>>>>>>>>> pte_none though (to digest why it is required :) ).
>>>>>>>>>>>>
>>>>>>>>>>>> Because 'none_nr == PTRS_PER_PTE' is used in patch #7 to detect
>>>>>>>>>>>> empty PTE page.
>>>>>>>>>>>
>>>>>>>>>>> Okay, so the problem is that "nr" would be "all processed
>>>>>>>>>>> entries" but
>>>>>>>>>>> there are cases where we "process an entry but not zap it".
>>>>>>>>>>>
>>>>>>>>>>> What you really only want to know is "was any entry not zapped",
>>>>>>>>>>> which
>>>>>>>>>>> could be a simple input boolean variable passed into
>>>>>>>>>>> do_zap_pte_range?
>>>>>>>>>>>
>>>>>>>>>>> Because as soon as any entry was processed but  no zapped, 
>>>>>>>>>>> you can
>>>>>>>>>>> immediately give up on reclaiming that table.
>>>>>>>>>>
>>>>>>>>>> Yes, we can set can_reclaim_pt to false when a !pte_none() 
>>>>>>>>>> entry is
>>>>>>>>>> found in count_pte_none().
>>>>>>>>>
>>>>>>>>> I'm not sure if well need cont_pte_none(), but I'll have to take a
>>>>>>>>> look
>>>>>>>>> at your new patch to see how this fits together with doing the
>>>>>>>>> pte_none
>>>>>>>>> detection+skipping in do_zap_pte_range().
>>>>>>>>>
>>>>>>>>> I was wondering if you cannot simply avoid the additional
>>>>>>>>> scanning and
>>>>>>>>> simply set "can_reclaim_pt" if you skip a zap.
>>>>>>>>
>>>>>>>> Maybe we can return the information whether the zap was skipped 
>>>>>>>> from
>>>>>>>> zap_present_ptes() and zap_nonpresent_ptes() through parameters
>>>>>>>> like I
>>>>>>>> did in [PATCH v1 3/7] and [PATCH v1 4/7].
>>>>>>>>
>>>>>>>> In theory, we can detect empty PTE pages in the following two ways:
>>>>>>>>
>>>>>>>> 1) If no zap is skipped, it means that all pte entries have been
>>>>>>>>         zap, and the PTE page must be empty.
>>>>>>>> 2) If all pte entries are detected to be none, then the PTE page is
>>>>>>>>         empty.
>>>>>>>>
>>>>>>>> In the error case, 1) may cause non-empty PTE pages to be reclaimed
>>>>>>>> (which is unacceptable), while the 2) will at most cause empty PTE
>>>>>>>> pages
>>>>>>>> to not be reclaimed.
>>>>>>>>
>>>>>>>> So the most reliable and efficient method may be:
>>>>>>>>
>>>>>>>> a. If there is a zap that is skipped, stop scanning and do not
>>>>>>>> reclaim
>>>>>>>>         the PTE page;
>>>>>>>> b. Otherwise, as now, detect the empty PTE page through
>>>>>>>> count_pte_none()
>>>>>>>
>>>>>>> Is there a need for count_pte_none() that I am missing?
>>>>>>
>>>>>> When any_skipped == false, at least add VM_BUG_ON() to recheck none
>>>>>> ptes.
>>>>>>
>>>>>>>
>>>>>>> Assume we have
>>>>>>>
>>>>>>> nr = do_zap_pte_range(&any_skipped)
>>>>>>>
>>>>>>>
>>>>>>> If "nr" is the number of processed entries (including 
>>>>>>> pte_none()), and
>>>>>>> "any_skipped" is set whenever we skipped to zap a !pte_none 
>>>>>>> entry, we
>>>>>>> can detect what we need, no?
>>>>>>>
>>>>>>> If any_skipped == false after the call, we now have "nr" pte_none()
>>>>>>> entries. -> We can continue trying to reclaim
>>>>>>
>>>>>> I prefer that "nr" should not include pte_none().
>>>>>>
>>>>>
>>>>> Why? do_zap_pte_range() should tell you how far to advance, nothing
>>>>> less, nothing more.
>>>>>
>>>>> Let's just keep it simple and avoid count_pte_none().
>>>>>
>>>>> I'm probably missing something important?
>>>>
>>>> As we discussed before, we should skip all consecutive none ptes,
>>>   > pte and addr are already incremented before returning.
>>>
>>> It's probably best to send the resulting patch so I can either
>>> understand why count_pte_none() is required or comment on how to get rid
>>> of it.
>>
>> Something like this:
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index bd9ebe0f4471f..e9bec3cd49d44 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -1657,6 +1657,66 @@ static inline int zap_nonpresent_ptes(struct
>> mmu_gather *tlb,
>>           return nr;
>>    }
>>
>> +static inline int do_zap_pte_range(struct mmu_gather *tlb,
>> +                                  struct vm_area_struct *vma, pte_t 
>> *pte,
>> +                                  unsigned long addr, unsigned long end,
>> +                                  struct zap_details *details, int *rss,
>> +                                  bool *force_flush, bool *force_break,
>> +                                  bool *any_skipped)
>> +{
>> +       pte_t ptent = ptep_get(pte);
>> +       int max_nr = (end - addr) / PAGE_SIZE;
> 
> I'd do here:
> 
>        int nr = 0;
> 
>> +
>> +       /* Skip all consecutive pte_none(). */
>> +       if (pte_none(ptent)) {
>> +
> 
> instead of the "int nr" here
> 
>> +               for (nr = 1; nr < max_nr; nr++) {
>> +                       ptent = ptep_get(pte + nr);
>> +                       if (!pte_none(ptent))
>> +                               break;
>> +               }
>> +               max_nr -= nr;
>> +               if (!max_nr)
>> +                       return 0;
>> +               pte += nr;
>> +               addr += nr * PAGE_SIZE;
>> +       }
>> +
>> +       if (pte_present(ptent))
>> +               return zap_present_ptes(tlb, vma, pte, ptent, max_nr,
>> +                                       addr, details, rss, force_flush,
>> +                                       force_break, any_skipped);
> 
> and here:
> 
> if (pte_present(ptent))
>      nr += zap_present_ptes(...)
> else
>      nr += zap_nonpresent_ptes();
> return nr;
> 
> So you really return the number of processed items -- how much the 
> caller should advance.

Got it, please ignore my previous stupid considerations. ;)

> 
>> +
>> +       return zap_nonpresent_ptes(tlb, vma, pte, ptent, max_nr, addr,
>> +                                  details, rss, any_skipped);
>> +}
>> +
>> +static inline int count_pte_none(pte_t *pte, int nr)
>> +{
>> +       int none_nr = 0;
>> +
>> +       /*
>> +        * If PTE_MARKER_UFFD_WP is enabled, the uffd-wp PTEs may be
>> +        * re-installed, so we need to check pte_none() one by one.
>> +        * Otherwise, checking a single PTE in a batch is sufficient.
>> +        */
>> +#ifdef CONFIG_PTE_MARKER_UFFD_WP
>> +       for (;;) {
>> +               if (pte_none(ptep_get(pte)))
>> +                       none_nr++;
>> +               if (--nr == 0)
>> +                       break;
>> +               pte++;
>> +       }
>> +#else
>> +       if (pte_none(ptep_get(pte)))
>> +               none_nr = nr;
>> +#endif
>> +       return none_nr;
>> +}
>> +
>> +
>>    static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>                                   struct vm_area_struct *vma, pmd_t *pmd,
>>                                   unsigned long addr, unsigned long end,
>> @@ -1667,6 +1727,7 @@ static unsigned long zap_pte_range(struct
>> mmu_gather *tlb,
>>           int rss[NR_MM_COUNTERS];
>>           spinlock_t *ptl;
>>           pte_t *start_pte;
>> +       bool can_reclaim_pt;
>>           pte_t *pte;
>>           int nr;
>>
>> @@ -1679,28 +1740,22 @@ static unsigned long zap_pte_range(struct
>> mmu_gather *tlb,
>>           flush_tlb_batched_pending(mm);
>>           arch_enter_lazy_mmu_mode();
>>           do {
>> -               pte_t ptent = ptep_get(pte);
>> -               int max_nr;
>> -
>> -               nr = 1;
>> -               if (pte_none(ptent))
>> -                       continue;
>> +               bool any_skipped;
>>
>>                   if (need_resched())
>>                           break;
>>
>> -               max_nr = (end - addr) / PAGE_SIZE;
>> -               if (pte_present(ptent)) {
>> -                       nr = zap_present_ptes(tlb, vma, pte, ptent, 
>> max_nr,
>> -                                             addr, details, rss,
>> &force_flush,
>> -                                             &force_break);
>> -                       if (unlikely(force_break)) {
>> -                               addr += nr * PAGE_SIZE;
>> -                               break;
>> -                       }
>> -               } else {
>> -                       nr = zap_nonpresent_ptes(tlb, vma, pte, ptent,
>> max_nr,
>> -                                                addr, details, rss);
>> +               nr = do_zap_pte_range(tlb, vma, pte, addr, end, details,
>> +                                     rss, &force_flush, &force_break,
>> +                                     &any_skipped);
>> +               if (can_reclaim_pt) {
>> +                       VM_BUG_ON(!any_skipped && count_pte_none(pte,
>> nr) == nr);
> 
> Okay, so this is really only for debugging then? So we can safely drop 
> that for now.
> 
> I assume it would make sense to check when reclaiming a page table with 
> CONFIG_DEBUG_VM, that the table is actually all-pte_none instead?

Ah, make sense. Will change to it in the next version.

> 
> (as a side note: no VM_BUG_ON, please :) )

Got it.

Thanks!

>