lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <8c27c1f8-9573-4777-8397-929a15e67f60@bytedance.com>
Date: Thu, 7 Nov 2024 15:54:47 +0800
From: Qi Zheng <zhengqi.arch@...edance.com>
To: Jann Horn <jannh@...gle.com>
Cc: david@...hat.com, hughd@...gle.com, willy@...radead.org, mgorman@...e.de,
 muchun.song@...ux.dev, vbabka@...nel.org, akpm@...ux-foundation.org,
 zokeefe@...gle.com, rientjes@...gle.com, peterx@...hat.com,
 catalin.marinas@....com, linux-mm@...ck.org, linux-kernel@...r.kernel.org,
 x86@...nel.org
Subject: Re: [PATCH v2 1/7] mm: khugepaged: retract_page_tables() use
 pte_offset_map_rw_nolock()

Hi Jann,

On 2024/11/7 05:48, Jann Horn wrote:
> On Thu, Oct 31, 2024 at 9:14 AM Qi Zheng <zhengqi.arch@...edance.com> wrote:
>> In retract_page_tables(), we may modify the pmd entry after acquiring the
>> pml and ptl, so we should also check whether the pmd entry is stable.
> 
> Why does taking the PMD lock not guarantee that the PMD entry is stable?

Because the pmd entry may have changed before taking the pmd lock, so we
need to recheck it after taking the pmd or pte lock.

> 
>> Using pte_offset_map_rw_nolock() + pmd_same() to do it, and then we can
>> also remove the calling of the pte_lockptr().
>>
>> Signed-off-by: Qi Zheng <zhengqi.arch@...edance.com>
>> ---
>>   mm/khugepaged.c | 17 ++++++++++++++++-
>>   1 file changed, 16 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 6f8d46d107b4b..6d76dde64f5fb 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1721,6 +1721,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
>>                  spinlock_t *pml;
>>                  spinlock_t *ptl;
>>                  bool skipped_uffd = false;
>> +               pte_t *pte;
>>
>>                  /*
>>                   * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
>> @@ -1756,11 +1757,25 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
>>                                          addr, addr + HPAGE_PMD_SIZE);
>>                  mmu_notifier_invalidate_range_start(&range);
>>
>> +               pte = pte_offset_map_rw_nolock(mm, pmd, addr, &pgt_pmd, &ptl);
>> +               if (!pte) {
>> +                       mmu_notifier_invalidate_range_end(&range);
>> +                       continue;
>> +               }
>> +
>>                  pml = pmd_lock(mm, pmd);
> 
> I don't understand why you're mapping the page table before locking
> the PMD. Doesn't that just mean we need more error checking
> afterwards?

The main purpose is to obtain the pmdval. If we don't use
pte_offset_map_rw_nolock, we should pay attention to recheck pmd entry
before pte_lockptr(), like this:

pmdval = pmdp_get_lockless(pmd);
pmd_lock
recheck pmdval
pte_lockptr(mm, pmd)

Otherwise, it may cause the system to crash. Consider the following
situation:

     CPU 0              CPU 1

zap_pte_range
--> clear pmd entry
     free pte page (by RCU)

                       retract_page_tables
                       --> pmd_lock
                           pte_lockptr(mm, pmd)  <-- BOOM!!

So maybe calling pte_offset_map_rw_nolock() is more convenient.

Thanks,
Qi


> 
> 
>> -               ptl = pte_lockptr(mm, pmd);
>>                  if (ptl != pml)
>>                          spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
>>
>> +               if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd)))) {
>> +                       pte_unmap_unlock(pte, ptl);
>> +                       if (ptl != pml)
>> +                               spin_unlock(pml);
>> +                       mmu_notifier_invalidate_range_end(&range);
>> +                       continue;
>> +               }
>> +               pte_unmap(pte);
>> +
>>                  /*
>>                   * Huge page lock is still held, so normally the page table
>>                   * must remain empty; and we have already skipped anon_vma
>> --
>> 2.20.1
>>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ