lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b5da4118-2b63-4383-8617-ac98d7e62c64@redhat.com>
Date: Tue, 19 Nov 2024 10:55:33 +0100
From: David Hildenbrand <david@...hat.com>
To: Qi Zheng <zhengqi.arch@...edance.com>
Cc: jannh@...gle.com, hughd@...gle.com, willy@...radead.org,
 muchun.song@...ux.dev, vbabka@...nel.org, akpm@...ux-foundation.org,
 peterx@...hat.com, mgorman@...e.de, catalin.marinas@....com,
 will@...nel.org, dave.hansen@...ux.intel.com, luto@...nel.org,
 peterz@...radead.org, x86@...nel.org, lorenzo.stoakes@...cle.com,
 linux-mm@...ck.org, linux-kernel@...r.kernel.org, zokeefe@...gle.com,
 rientjes@...gle.com
Subject: Re: [PATCH v3 4/9] mm: introduce skip_none_ptes()

On 18.11.24 12:13, Qi Zheng wrote:
> 
> 
> On 2024/11/18 18:59, David Hildenbrand wrote:
>> On 18.11.24 11:56, Qi Zheng wrote:
>>>
>>>
>>> On 2024/11/18 18:41, David Hildenbrand wrote:
>>>> On 18.11.24 11:34, Qi Zheng wrote:
>>>>>
>>>>>
>>>>> On 2024/11/18 17:29, David Hildenbrand wrote:
>>>>>> On 18.11.24 04:35, Qi Zheng wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2024/11/15 22:59, David Hildenbrand wrote:
>>>>>>>> On 15.11.24 15:41, Qi Zheng wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 2024/11/15 18:22, David Hildenbrand wrote:
>>>>>>>>>>>>> *nr_skip = nr;
>>>>>>>>>>>>>
>>>>>>>>>>>>> and then:
>>>>>>>>>>>>>
>>>>>>>>>>>>> zap_pte_range
>>>>>>>>>>>>> --> nr = do_zap_pte_range(tlb, vma, pte, addr, end, details,
>>>>>>>>>>>>> &skip_nr,
>>>>>>>>>>>>>                                rss, &force_flush, &force_break);
>>>>>>>>>>>>>             if (can_reclaim_pt) {
>>>>>>>>>>>>>                 none_nr += count_pte_none(pte, nr);
>>>>>>>>>>>>>                 none_nr += nr_skip;
>>>>>>>>>>>>>             }
>>>>>>>>>>>>>
>>>>>>>>>>>>> Right?
>>>>>>>>>>>>
>>>>>>>>>>>> Yes. I did not look closely at the patch that adds the
>>>>>>>>>>>> counting of
>>>>>>>>>>>
>>>>>>>>>>> Got it.
>>>>>>>>>>>
>>>>>>>>>>>> pte_none though (to digest why it is required :) ).
>>>>>>>>>>>
>>>>>>>>>>> Because 'none_nr == PTRS_PER_PTE' is used in patch #7 to detect
>>>>>>>>>>> empty PTE page.
>>>>>>>>>>
>>>>>>>>>> Okay, so the problem is that "nr" would be "all processed
>>>>>>>>>> entries" but
>>>>>>>>>> there are cases where we "process an entry but not zap it".
>>>>>>>>>>
>>>>>>>>>> What you really only want to know is "was any entry not zapped",
>>>>>>>>>> which
>>>>>>>>>> could be a simple input boolean variable passed into
>>>>>>>>>> do_zap_pte_range?
>>>>>>>>>>
>>>>>>>>>> Because as soon as any entry was processed but  no zapped, you can
>>>>>>>>>> immediately give up on reclaiming that table.
>>>>>>>>>
>>>>>>>>> Yes, we can set can_reclaim_pt to false when a !pte_none() entry is
>>>>>>>>> found in count_pte_none().
>>>>>>>>
>>>>>>>> I'm not sure if well need cont_pte_none(), but I'll have to take a
>>>>>>>> look
>>>>>>>> at your new patch to see how this fits together with doing the
>>>>>>>> pte_none
>>>>>>>> detection+skipping in do_zap_pte_range().
>>>>>>>>
>>>>>>>> I was wondering if you cannot simply avoid the additional
>>>>>>>> scanning and
>>>>>>>> simply set "can_reclaim_pt" if you skip a zap.
>>>>>>>
>>>>>>> Maybe we can return the information whether the zap was skipped from
>>>>>>> zap_present_ptes() and zap_nonpresent_ptes() through parameters
>>>>>>> like I
>>>>>>> did in [PATCH v1 3/7] and [PATCH v1 4/7].
>>>>>>>
>>>>>>> In theory, we can detect empty PTE pages in the following two ways:
>>>>>>>
>>>>>>> 1) If no zap is skipped, it means that all pte entries have been
>>>>>>>         zap, and the PTE page must be empty.
>>>>>>> 2) If all pte entries are detected to be none, then the PTE page is
>>>>>>>         empty.
>>>>>>>
>>>>>>> In the error case, 1) may cause non-empty PTE pages to be reclaimed
>>>>>>> (which is unacceptable), while the 2) will at most cause empty PTE
>>>>>>> pages
>>>>>>> to not be reclaimed.
>>>>>>>
>>>>>>> So the most reliable and efficient method may be:
>>>>>>>
>>>>>>> a. If there is a zap that is skipped, stop scanning and do not
>>>>>>> reclaim
>>>>>>>         the PTE page;
>>>>>>> b. Otherwise, as now, detect the empty PTE page through
>>>>>>> count_pte_none()
>>>>>>
>>>>>> Is there a need for count_pte_none() that I am missing?
>>>>>
>>>>> When any_skipped == false, at least add VM_BUG_ON() to recheck none
>>>>> ptes.
>>>>>
>>>>>>
>>>>>> Assume we have
>>>>>>
>>>>>> nr = do_zap_pte_range(&any_skipped)
>>>>>>
>>>>>>
>>>>>> If "nr" is the number of processed entries (including pte_none()), and
>>>>>> "any_skipped" is set whenever we skipped to zap a !pte_none entry, we
>>>>>> can detect what we need, no?
>>>>>>
>>>>>> If any_skipped == false after the call, we now have "nr" pte_none()
>>>>>> entries. -> We can continue trying to reclaim
>>>>>
>>>>> I prefer that "nr" should not include pte_none().
>>>>>
>>>>
>>>> Why? do_zap_pte_range() should tell you how far to advance, nothing
>>>> less, nothing more.
>>>>
>>>> Let's just keep it simple and avoid count_pte_none().
>>>>
>>>> I'm probably missing something important?
>>>
>>> As we discussed before, we should skip all consecutive none ptes,
>>   > pte and addr are already incremented before returning.
>>
>> It's probably best to send the resulting patch so I can either
>> understand why count_pte_none() is required or comment on how to get rid
>> of it.
> 
> Something like this:
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index bd9ebe0f4471f..e9bec3cd49d44 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1657,6 +1657,66 @@ static inline int zap_nonpresent_ptes(struct
> mmu_gather *tlb,
>           return nr;
>    }
> 
> +static inline int do_zap_pte_range(struct mmu_gather *tlb,
> +                                  struct vm_area_struct *vma, pte_t *pte,
> +                                  unsigned long addr, unsigned long end,
> +                                  struct zap_details *details, int *rss,
> +                                  bool *force_flush, bool *force_break,
> +                                  bool *any_skipped)
> +{
> +       pte_t ptent = ptep_get(pte);
> +       int max_nr = (end - addr) / PAGE_SIZE;

I'd do here:

	  int nr = 0;

> +
> +       /* Skip all consecutive pte_none(). */
> +       if (pte_none(ptent)) {
> +

instead of the "int nr" here

> +               for (nr = 1; nr < max_nr; nr++) {
> +                       ptent = ptep_get(pte + nr);
> +                       if (!pte_none(ptent))
> +                               break;
> +               }
> +               max_nr -= nr;
> +               if (!max_nr)
> +                       return 0;
> +               pte += nr;
> +               addr += nr * PAGE_SIZE;
> +       }
> +
> +       if (pte_present(ptent))
> +               return zap_present_ptes(tlb, vma, pte, ptent, max_nr,
> +                                       addr, details, rss, force_flush,
> +                                       force_break, any_skipped);

and here:

if (pte_present(ptent))
	nr += zap_present_ptes(...)
else
	nr += zap_nonpresent_ptes();
return nr;

So you really return the number of processed items -- how much the 
caller should advance.

> +
> +       return zap_nonpresent_ptes(tlb, vma, pte, ptent, max_nr, addr,
> +                                  details, rss, any_skipped);
> +}
> +
> +static inline int count_pte_none(pte_t *pte, int nr)
> +{
> +       int none_nr = 0;
> +
> +       /*
> +        * If PTE_MARKER_UFFD_WP is enabled, the uffd-wp PTEs may be
> +        * re-installed, so we need to check pte_none() one by one.
> +        * Otherwise, checking a single PTE in a batch is sufficient.
> +        */
> +#ifdef CONFIG_PTE_MARKER_UFFD_WP
> +       for (;;) {
> +               if (pte_none(ptep_get(pte)))
> +                       none_nr++;
> +               if (--nr == 0)
> +                       break;
> +               pte++;
> +       }
> +#else
> +       if (pte_none(ptep_get(pte)))
> +               none_nr = nr;
> +#endif
> +       return none_nr;
> +}
> +
> +
>    static unsigned long zap_pte_range(struct mmu_gather *tlb,
>                                   struct vm_area_struct *vma, pmd_t *pmd,
>                                   unsigned long addr, unsigned long end,
> @@ -1667,6 +1727,7 @@ static unsigned long zap_pte_range(struct
> mmu_gather *tlb,
>           int rss[NR_MM_COUNTERS];
>           spinlock_t *ptl;
>           pte_t *start_pte;
> +       bool can_reclaim_pt;
>           pte_t *pte;
>           int nr;
> 
> @@ -1679,28 +1740,22 @@ static unsigned long zap_pte_range(struct
> mmu_gather *tlb,
>           flush_tlb_batched_pending(mm);
>           arch_enter_lazy_mmu_mode();
>           do {
> -               pte_t ptent = ptep_get(pte);
> -               int max_nr;
> -
> -               nr = 1;
> -               if (pte_none(ptent))
> -                       continue;
> +               bool any_skipped;
> 
>                   if (need_resched())
>                           break;
> 
> -               max_nr = (end - addr) / PAGE_SIZE;
> -               if (pte_present(ptent)) {
> -                       nr = zap_present_ptes(tlb, vma, pte, ptent, max_nr,
> -                                             addr, details, rss,
> &force_flush,
> -                                             &force_break);
> -                       if (unlikely(force_break)) {
> -                               addr += nr * PAGE_SIZE;
> -                               break;
> -                       }
> -               } else {
> -                       nr = zap_nonpresent_ptes(tlb, vma, pte, ptent,
> max_nr,
> -                                                addr, details, rss);
> +               nr = do_zap_pte_range(tlb, vma, pte, addr, end, details,
> +                                     rss, &force_flush, &force_break,
> +                                     &any_skipped);
> +               if (can_reclaim_pt) {
> +                       VM_BUG_ON(!any_skipped && count_pte_none(pte,
> nr) == nr);

Okay, so this is really only for debugging then? So we can safely drop 
that for now.

I assume it would make sense to check when reclaiming a page table with 
CONFIG_DEBUG_VM, that the table is actually all-pte_none instead?

(as a side note: no VM_BUG_ON, please :) )

-- 
Cheers,

David / dhildenb


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ