[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <da59b028-b472-4ac1-b893-2f17496fb384@bytedance.com>
Date: Thu, 12 Sep 2024 17:28:39 +0800
From: Qi Zheng <zhengqi.arch@...edance.com>
To: Muchun Song <muchun.song@...ux.dev>
Cc: LKML <linux-kernel@...r.kernel.org>,
Linux Memory Management List <linux-mm@...ck.org>,
linux-arm-kernel@...ts.infradead.org, linuxppc-dev@...ts.ozlabs.org,
David Hildenbrand <david@...hat.com>, hughd@...gle.com, willy@...radead.org,
vbabka@...nel.org, akpm@...ux-foundation.org, rppt@...nel.org,
vishal.moola@...il.com, peterx@...hat.com, ryan.roberts@....com,
christophe.leroy2@...soprasteria.com
Subject: Re: [PATCH v3 01/14] mm: pgtable: introduce
pte_offset_map_{ro|rw}_nolock()
Hi Muchun,
On 2024/9/6 15:20, Muchun Song wrote:
>
>
> On 2024/9/4 16:40, Qi Zheng wrote:
>> Currently, the usage of pte_offset_map_nolock() can be divided into the
>> following two cases:
>>
>> 1) After acquiring PTL, only read-only operations are performed on the
>> PTE
>> page. In this case, the RCU lock in pte_offset_map_nolock() will
>> ensure
>> that the PTE page will not be freed, and there is no need to worry
>> about whether the pmd entry is modified.
>>
>> 2) After acquiring PTL, the pte or pmd entries may be modified. At this
>> time, we need to ensure that the pmd entry has not been modified
>> concurrently.
>>
>> To more clearing distinguish between these two cases, this commit
>> introduces two new helper functions to replace pte_offset_map_nolock().
>> For 1), just rename it to pte_offset_map_ro_nolock(). For 2), in addition
>> to changing the name to pte_offset_map_rw_nolock(), it also outputs the
>> pmdval when successful. It is applicable for may-write cases where any
>> modification operations to the page table may happen after the
>> corresponding spinlock is held afterwards. But the users should make sure
>> the page table is stable like checking pte_same() or checking pmd_same()
>> by using the output pmdval before performing the write operations.
>>
>> Note: "RO" / "RW" expresses the intended semantics, not that the *kmap*
>> will be read-only/read-write protected.
>>
>> Subsequent commits will convert pte_offset_map_nolock() into the above
>> two functions one by one, and finally completely delete it.
>>
>> Signed-off-by: Qi Zheng<zhengqi.arch@...edance.com>
>> ---
>> Documentation/mm/split_page_table_lock.rst | 7 +++
>> include/linux/mm.h | 5 +++
>> mm/pgtable-generic.c | 50 ++++++++++++++++++++++
>> 3 files changed, 62 insertions(+)
>>
>> diff --git a/Documentation/mm/split_page_table_lock.rst
>> b/Documentation/mm/split_page_table_lock.rst
>> index e4f6972eb6c04..08d0e706a32db 100644
>> --- a/Documentation/mm/split_page_table_lock.rst
>> +++ b/Documentation/mm/split_page_table_lock.rst
>> @@ -19,6 +19,13 @@ There are helpers to lock/unlock a table and other
>> accessor functions:
>> - pte_offset_map_nolock()
>> maps PTE, returns pointer to PTE with pointer to its PTE table
>> lock (not taken), or returns NULL if no PTE table;
>> + - pte_offset_map_ro_nolock()
>> + maps PTE, returns pointer to PTE with pointer to its PTE table
>> + lock (not taken), or returns NULL if no PTE table;
>> + - pte_offset_map_rw_nolock()
>> + maps PTE, returns pointer to PTE with pointer to its PTE table
>> + lock (not taken) and the value of its pmd entry, or returns NULL
>> + if no PTE table;
>> - pte_offset_map()
>> maps PTE, returns pointer to PTE, or returns NULL if no PTE table;
>> - pte_unmap()
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index a7c74a840249a..1fde9242231c9 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -3006,6 +3006,11 @@ static inline pte_t *pte_offset_map_lock(struct
>> mm_struct *mm, pmd_t *pmd,
>> pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
>> unsigned long addr, spinlock_t **ptlp);
>> +pte_t *pte_offset_map_ro_nolock(struct mm_struct *mm, pmd_t *pmd,
>> + unsigned long addr, spinlock_t **ptlp);
>> +pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd,
>> + unsigned long addr, pmd_t *pmdvalp,
>> + spinlock_t **ptlp);
>> #define pte_unmap_unlock(pte, ptl) do { \
>> spin_unlock(ptl); \
>> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
>> index a78a4adf711ac..262b7065a5a2e 100644
>> --- a/mm/pgtable-generic.c
>> +++ b/mm/pgtable-generic.c
>> @@ -317,6 +317,33 @@ pte_t *pte_offset_map_nolock(struct mm_struct
>> *mm, pmd_t *pmd,
>> return pte;
>> }
>> +pte_t *pte_offset_map_ro_nolock(struct mm_struct *mm, pmd_t *pmd,
>> + unsigned long addr, spinlock_t **ptlp)
>> +{
>> + pmd_t pmdval;
>> + pte_t *pte;
>> +
>> + pte = __pte_offset_map(pmd, addr, &pmdval);
>> + if (likely(pte))
>> + *ptlp = pte_lockptr(mm, &pmdval);
>> + return pte;
>> +}
>> +
>> +pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd,
>> + unsigned long addr, pmd_t *pmdvalp,
>> + spinlock_t **ptlp)
>> +{
>> + pmd_t pmdval;
>> + pte_t *pte;
>> +
>> + VM_WARN_ON_ONCE(!pmdvalp);
>> + pte = __pte_offset_map(pmd, addr, &pmdval);
>> + if (likely(pte))
>> + *ptlp = pte_lockptr(mm, &pmdval);
>> + *pmdvalp = pmdval;
>> + return pte;
>> +}
>> +
>> /*
>> * pte_offset_map_lock(mm, pmd, addr, ptlp), and its internal
>> implementation
>> * __pte_offset_map_lock() below, is usually called with the pmd
>> pointer for
>> @@ -356,6 +383,29 @@ pte_t *pte_offset_map_nolock(struct mm_struct
>> *mm, pmd_t *pmd,
>> * recheck *pmd once the lock is taken; in practice, no callsite
>> needs that -
>> * either the mmap_lock for write, or pte_same() check on contents,
>> is enough.
>> *
>> + * pte_offset_map_ro_nolock(mm, pmd, addr, ptlp), above, is like
>> pte_offset_map();
>> + * but when successful, it also outputs a pointer to the spinlock in
>> ptlp - as
>> + * pte_offset_map_lock() does, but in this case without locking it.
>> This helps
>> + * the caller to avoid a later pte_lockptr(mm, *pmd), which might by
>> that time
>> + * act on a changed *pmd: pte_offset_map_ro_nolock() provides the
>> correct spinlock
>> + * pointer for the page table that it returns. Even after grabbing
>> the spinlock,
>> + * we might be looking either at a page table that is still mapped or
>> one that
>> + * was unmapped and is about to get freed. But for R/O access this is
>> sufficient.
>> + * So it is only applicable for read-only cases where any
>> modification operations
>> + * to the page table are not allowed even if the corresponding
>> spinlock is held
>> + * afterwards.
>> + *
>> + * pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is
>> like
>> + * pte_offset_map_ro_nolock(); but when successful, it also outputs
>> the pdmval.
>> + * It is applicable for may-write cases where any modification
>> operations to the
>> + * page table may happen after the corresponding spinlock is held
>> afterwards.
>> + * But the users should make sure the page table is stable like
>> checking pte_same()
>> + * or checking pmd_same() by using the output pmdval before
>> performing the write
>> + * operations.
>
> Now, we have two options to make sure the stability of PTE for users
> of pte_offset_map_rw_nolock(), in order to ease this operation, how
> about proposing a new helper (or two, one for pmd_same, another for
> pte_same) like pte_lock_stability (I am not good at naming, maybe
> you can) which helps users 1) hold the PTL and 2) check if the PTE is
> stable and 3) return true if the PTE stable, otherwise return false.
I've been trying to do this these days, but I found it was not very
convenient.
I introduced the following helpers:
#define __PTE_STABILITY(lock) \
bool __pte_stability_##lock(pmd_t *pmd, pmd_t *orig_pmd, pte_t *pte, \
pte_t *orig_pte, spinlock_t *ptlp) \
{ \
pte_spin_##lock(ptlp); \
if (orig_pte) { \
VM_WARN_ON_ONCE(pte_none(*orig_pte)); \
return pte_same(*orig_pte, ptep_get(pte)); \
} \
if (orig_pmd) { \
VM_WARN_ON_ONCE(pmd_none(*orig_pmd)); \
return pmd_same(*orig_pmd, pmdp_get_lockless(pmd)); \
} \
VM_WARN_ON_ONCE(1); \
return false; \
}
__PTE_STABILITY(lock)
__PTE_STABILITY(lock_nested)
static inline bool pte_stability_lock(pmd_t *pmd, pmd_t *orig_pmd, pte_t
*pte,
pte_t *orig_pte, spinlock_t *ptlp)
__acquires(ptlp)
{
return __pte_stability_lock(pmd, orig_pmd, pte, orig_pte, ptlp);
}
#ifdef CONFIG_SPLIT_PTE_PTLOCKS
static inline bool pte_stability_lock_nested(pmd_t *pmd, pmd_t *orig_pmd,
pte_t *pte, pte_t *orig_pte,
spinlock_t *ptlp)
__acquires(ptlp)
{
return __pte_stability_lock_nested(pmd, orig_pmd, pte,
orig_pte, ptlp);
}
static inline void pte_stability_unlock_nested(spinlock_t *ptlp)
__releases(ptlp)
{
spin_unlock(ptlp);
}
#else
static inline bool pte_stability_lock_nested(pmd_t *pmd, pmd_t *orig_pmd,
pte_t *pte, pte_t *orig_pte,
spinlock_t *ptlp)
{
return true;
}
static inline void pte_stability_unlock_nested(spinlock_t *ptlp)
{
}
#endif /* CONFIG_SPLIT_PTE_PTLOCKS */
and try to use them with pte_offset_map_rw_nolock() in the following
functions:
1. collapse_pte_mapped_thp
2. handle_pte_fault
3. map_pte
4. move_pages_pte
5. walk_pte_range
For 1, 2 and 3, the conversion is relatively simple, but 2 actually
already does a pte_same() check, so it does not reduce the amount of
code much.
For 4, the pte_same() checks have already been done, and it is not
easy to convert double_pt_lock() to use pte_stability_lock{_nested}().
For 5, it calls spin_trylock(), we should introduce another
pte_stability_trylock() helper for it, but it feels unnecessary.
There are not many places where pte_offset_map_rw_nolock() is called,
and some places have already done pte_same() checks, so maybe open
code is enough and there is no need to introduce more helper function.
Thanks,
Qi
>
> Muchun,
> Thanks.
>
>> + *
>> + * Note: "RO" / "RW" expresses the intended semantics, not that the
>> *kmap* will
>> + * be read-only/read-write protected.
>> + *
>> * Note that free_pgtables(), used after unmapping detached vmas, or
>> when
>> * exiting the whole mm, does not take page table lock before
>> freeing a page
>> * table, and may not use RCU at all: "outsiders" like khugepaged
>> should avoid
>
Powered by blists - more mailing lists