lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <da59b028-b472-4ac1-b893-2f17496fb384@bytedance.com>
Date: Thu, 12 Sep 2024 17:28:39 +0800
From: Qi Zheng <zhengqi.arch@...edance.com>
To: Muchun Song <muchun.song@...ux.dev>
Cc: LKML <linux-kernel@...r.kernel.org>,
 Linux Memory Management List <linux-mm@...ck.org>,
 linux-arm-kernel@...ts.infradead.org, linuxppc-dev@...ts.ozlabs.org,
 David Hildenbrand <david@...hat.com>, hughd@...gle.com, willy@...radead.org,
 vbabka@...nel.org, akpm@...ux-foundation.org, rppt@...nel.org,
 vishal.moola@...il.com, peterx@...hat.com, ryan.roberts@....com,
 christophe.leroy2@...soprasteria.com
Subject: Re: [PATCH v3 01/14] mm: pgtable: introduce
 pte_offset_map_{ro|rw}_nolock()

Hi Muchun,

On 2024/9/6 15:20, Muchun Song wrote:
> 
> 
> On 2024/9/4 16:40, Qi Zheng wrote:
>> Currently, the usage of pte_offset_map_nolock() can be divided into the
>> following two cases:
>>
>> 1) After acquiring PTL, only read-only operations are performed on the 
>> PTE
>>     page. In this case, the RCU lock in pte_offset_map_nolock() will 
>> ensure
>>     that the PTE page will not be freed, and there is no need to worry
>>     about whether the pmd entry is modified.
>>
>> 2) After acquiring PTL, the pte or pmd entries may be modified. At this
>>     time, we need to ensure that the pmd entry has not been modified
>>     concurrently.
>>
>> To more clearing distinguish between these two cases, this commit
>> introduces two new helper functions to replace pte_offset_map_nolock().
>> For 1), just rename it to pte_offset_map_ro_nolock(). For 2), in addition
>> to changing the name to pte_offset_map_rw_nolock(), it also outputs the
>> pmdval when successful. It is applicable for may-write cases where any
>> modification operations to the page table may happen after the
>> corresponding spinlock is held afterwards. But the users should make sure
>> the page table is stable like checking pte_same() or checking pmd_same()
>> by using the output pmdval before performing the write operations.
>>
>> Note: "RO" / "RW" expresses the intended semantics, not that the *kmap*
>> will be read-only/read-write protected.
>>
>> Subsequent commits will convert pte_offset_map_nolock() into the above
>> two functions one by one, and finally completely delete it.
>>
>> Signed-off-by: Qi Zheng<zhengqi.arch@...edance.com>
>> ---
>>   Documentation/mm/split_page_table_lock.rst |  7 +++
>>   include/linux/mm.h                         |  5 +++
>>   mm/pgtable-generic.c                       | 50 ++++++++++++++++++++++
>>   3 files changed, 62 insertions(+)
>>
>> diff --git a/Documentation/mm/split_page_table_lock.rst 
>> b/Documentation/mm/split_page_table_lock.rst
>> index e4f6972eb6c04..08d0e706a32db 100644
>> --- a/Documentation/mm/split_page_table_lock.rst
>> +++ b/Documentation/mm/split_page_table_lock.rst
>> @@ -19,6 +19,13 @@ There are helpers to lock/unlock a table and other 
>> accessor functions:
>>    - pte_offset_map_nolock()
>>       maps PTE, returns pointer to PTE with pointer to its PTE table
>>       lock (not taken), or returns NULL if no PTE table;
>> + - pte_offset_map_ro_nolock()
>> +    maps PTE, returns pointer to PTE with pointer to its PTE table
>> +    lock (not taken), or returns NULL if no PTE table;
>> + - pte_offset_map_rw_nolock()
>> +    maps PTE, returns pointer to PTE with pointer to its PTE table
>> +    lock (not taken) and the value of its pmd entry, or returns NULL
>> +    if no PTE table;
>>    - pte_offset_map()
>>       maps PTE, returns pointer to PTE, or returns NULL if no PTE table;
>>    - pte_unmap()
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index a7c74a840249a..1fde9242231c9 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -3006,6 +3006,11 @@ static inline pte_t *pte_offset_map_lock(struct 
>> mm_struct *mm, pmd_t *pmd,
>>   pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
>>               unsigned long addr, spinlock_t **ptlp);
>> +pte_t *pte_offset_map_ro_nolock(struct mm_struct *mm, pmd_t *pmd,
>> +                unsigned long addr, spinlock_t **ptlp);
>> +pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd,
>> +                unsigned long addr, pmd_t *pmdvalp,
>> +                spinlock_t **ptlp);
>>   #define pte_unmap_unlock(pte, ptl)    do {        \
>>       spin_unlock(ptl);                \
>> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
>> index a78a4adf711ac..262b7065a5a2e 100644
>> --- a/mm/pgtable-generic.c
>> +++ b/mm/pgtable-generic.c
>> @@ -317,6 +317,33 @@ pte_t *pte_offset_map_nolock(struct mm_struct 
>> *mm, pmd_t *pmd,
>>       return pte;
>>   }
>> +pte_t *pte_offset_map_ro_nolock(struct mm_struct *mm, pmd_t *pmd,
>> +                unsigned long addr, spinlock_t **ptlp)
>> +{
>> +    pmd_t pmdval;
>> +    pte_t *pte;
>> +
>> +    pte = __pte_offset_map(pmd, addr, &pmdval);
>> +    if (likely(pte))
>> +        *ptlp = pte_lockptr(mm, &pmdval);
>> +    return pte;
>> +}
>> +
>> +pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd,
>> +                unsigned long addr, pmd_t *pmdvalp,
>> +                spinlock_t **ptlp)
>> +{
>> +    pmd_t pmdval;
>> +    pte_t *pte;
>> +
>> +    VM_WARN_ON_ONCE(!pmdvalp);
>> +    pte = __pte_offset_map(pmd, addr, &pmdval);
>> +    if (likely(pte))
>> +        *ptlp = pte_lockptr(mm, &pmdval);
>> +    *pmdvalp = pmdval;
>> +    return pte;
>> +}
>> +
>>   /*
>>    * pte_offset_map_lock(mm, pmd, addr, ptlp), and its internal 
>> implementation
>>    * __pte_offset_map_lock() below, is usually called with the pmd 
>> pointer for
>> @@ -356,6 +383,29 @@ pte_t *pte_offset_map_nolock(struct mm_struct 
>> *mm, pmd_t *pmd,
>>    * recheck *pmd once the lock is taken; in practice, no callsite 
>> needs that -
>>    * either the mmap_lock for write, or pte_same() check on contents, 
>> is enough.
>>    *
>> + * pte_offset_map_ro_nolock(mm, pmd, addr, ptlp), above, is like 
>> pte_offset_map();
>> + * but when successful, it also outputs a pointer to the spinlock in 
>> ptlp - as
>> + * pte_offset_map_lock() does, but in this case without locking it.  
>> This helps
>> + * the caller to avoid a later pte_lockptr(mm, *pmd), which might by 
>> that time
>> + * act on a changed *pmd: pte_offset_map_ro_nolock() provides the 
>> correct spinlock
>> + * pointer for the page table that it returns. Even after grabbing 
>> the spinlock,
>> + * we might be looking either at a page table that is still mapped or 
>> one that
>> + * was unmapped and is about to get freed. But for R/O access this is 
>> sufficient.
>> + * So it is only applicable for read-only cases where any 
>> modification operations
>> + * to the page table are not allowed even if the corresponding 
>> spinlock is held
>> + * afterwards.
>> + *
>> + * pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is 
>> like
>> + * pte_offset_map_ro_nolock(); but when successful, it also outputs 
>> the pdmval.
>> + * It is applicable for may-write cases where any modification 
>> operations to the
>> + * page table may happen after the corresponding spinlock is held 
>> afterwards.
>> + * But the users should make sure the page table is stable like 
>> checking pte_same()
>> + * or checking pmd_same() by using the output pmdval before 
>> performing the write
>> + * operations.
> 
> Now, we have two options to make sure the stability of PTE for users
> of pte_offset_map_rw_nolock(), in order to ease this operation, how
> about proposing a new helper (or two, one for pmd_same, another for
> pte_same) like pte_lock_stability (I am not good at naming, maybe
> you can) which helps users 1) hold the PTL and 2) check if the PTE is
> stable and 3) return true if the PTE stable, otherwise return false.

I've been trying to do this these days, but I found it was not very
convenient.

I introduced the following helpers:

#define __PTE_STABILITY(lock)						\
bool __pte_stability_##lock(pmd_t *pmd, pmd_t *orig_pmd, pte_t *pte,	\
			    pte_t *orig_pte, spinlock_t *ptlp)		\
{									\
	pte_spin_##lock(ptlp);						\
	if (orig_pte) {							\
		VM_WARN_ON_ONCE(pte_none(*orig_pte));			\
		return pte_same(*orig_pte, ptep_get(pte));		\
	}								\
	if (orig_pmd) {							\
		VM_WARN_ON_ONCE(pmd_none(*orig_pmd));			\
		return pmd_same(*orig_pmd, pmdp_get_lockless(pmd));	\
	}								\
	VM_WARN_ON_ONCE(1);						\
	return false;							\
}
__PTE_STABILITY(lock)
__PTE_STABILITY(lock_nested)

static inline bool pte_stability_lock(pmd_t *pmd, pmd_t *orig_pmd, pte_t 
*pte,
				      pte_t *orig_pte, spinlock_t *ptlp)
	__acquires(ptlp)
{
	return __pte_stability_lock(pmd, orig_pmd, pte, orig_pte, ptlp);
}

#ifdef CONFIG_SPLIT_PTE_PTLOCKS
static inline bool pte_stability_lock_nested(pmd_t *pmd, pmd_t *orig_pmd,
                                              pte_t *pte, pte_t *orig_pte,
                                              spinlock_t *ptlp)
         __acquires(ptlp)
{
         return __pte_stability_lock_nested(pmd, orig_pmd, pte, 
orig_pte, ptlp);
}

static inline void pte_stability_unlock_nested(spinlock_t *ptlp)
	__releases(ptlp)
{
	spin_unlock(ptlp);
}
#else
static inline bool pte_stability_lock_nested(pmd_t *pmd, pmd_t *orig_pmd,
                                              pte_t *pte, pte_t *orig_pte,
                                              spinlock_t *ptlp)
{
         return true;
}
static inline void pte_stability_unlock_nested(spinlock_t *ptlp)
{
}
#endif /* CONFIG_SPLIT_PTE_PTLOCKS */

and try to use them with pte_offset_map_rw_nolock() in the following
functions:

1. collapse_pte_mapped_thp
2. handle_pte_fault
3. map_pte
4. move_pages_pte
5. walk_pte_range

For 1, 2 and 3, the conversion is relatively simple, but 2 actually
already does a pte_same() check, so it does not reduce the amount of
code much.

For 4, the pte_same() checks have already been done, and it is not
easy to convert double_pt_lock() to use pte_stability_lock{_nested}().

For 5, it calls spin_trylock(), we should introduce another
pte_stability_trylock() helper for it, but it feels unnecessary.

There are not many places where pte_offset_map_rw_nolock() is called,
and some places have already done pte_same() checks, so maybe open
code is enough and there is no need to introduce more helper function.

Thanks,
Qi

> 
> Muchun,
> Thanks.
> 
>> + *
>> + * Note: "RO" / "RW" expresses the intended semantics, not that the 
>> *kmap* will
>> + * be read-only/read-write protected.
>> + *
>>    * Note that free_pgtables(), used after unmapping detached vmas, or 
>> when
>>    * exiting the whole mm, does not take page table lock before 
>> freeing a page
>>    * table, and may not use RCU at all: "outsiders" like khugepaged 
>> should avoid
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ