[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <73adae65-4429-41d7-bbb6-4c58156060d3@arm.com>
Date: Mon, 8 Apr 2024 10:22:51 +0100
From: Ryan Roberts <ryan.roberts@....com>
To: David Hildenbrand <david@...hat.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Matthew Wilcox <willy@...radead.org>, Huang Ying <ying.huang@...el.com>,
Gao Xiang <xiang@...nel.org>, Yu Zhao <yuzhao@...gle.com>,
Yang Shi <shy828301@...il.com>, Michal Hocko <mhocko@...e.com>,
Kefeng Wang <wangkefeng.wang@...wei.com>, Barry Song <21cnbao@...il.com>,
Chris Li <chrisl@...nel.org>, Lance Yang <ioworker0@...il.com>
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v6 2/6] mm: swap: free_swap_and_cache_nr() as batched
free_swap_and_cache()
Andrew, Looks like there are a few niggles for me to address below. So please
don't let v6 (currently in mm-unstable) progress to mm-stable. I'll aim to get a
new version out (or patch for this depending on how discussion lands) in the
next couple of days.
On 05/04/2024 11:13, David Hildenbrand wrote:
> On 03.04.24 13:40, Ryan Roberts wrote:
>> Now that we no longer have a convenient flag in the cluster to determine
>> if a folio is large, free_swap_and_cache() will take a reference and
>> lock a large folio much more often, which could lead to contention and
>> (e.g.) failure to split large folios, etc.
>>
>> Let's solve that problem by batch freeing swap and cache with a new
>> function, free_swap_and_cache_nr(), to free a contiguous range of swap
>> entries together. This allows us to first drop a reference to each swap
>> slot before we try to release the cache folio. This means we only try to
>> release the folio once, only taking the reference and lock once - much
>> better than the previous 512 times for the 2M THP case.
>>
>> Contiguous swap entries are gathered in zap_pte_range() and
>> madvise_free_pte_range() in a similar way to how present ptes are
>> already gathered in zap_pte_range().
>>
>> While we are at it, let's simplify by converting the return type of both
>> functions to void. The return value was used only by zap_pte_range() to
>> print a bad pte, and was ignored by everyone else, so the extra
>> reporting wasn't exactly guaranteed. We will still get the warning with
>> most of the information from get_swap_device(). With the batch version,
>> we wouldn't know which pte was bad anyway so could print the wrong one.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@....com>
>> ---
>> include/linux/pgtable.h | 28 ++++++++++++++
>> include/linux/swap.h | 12 ++++--
>> mm/internal.h | 48 +++++++++++++++++++++++
>> mm/madvise.c | 12 ++++--
>> mm/memory.c | 13 ++++---
>> mm/swapfile.c | 86 ++++++++++++++++++++++++++++++++---------
>> 6 files changed, 167 insertions(+), 32 deletions(-)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index a3fc8150b047..0278259f7078 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -708,6 +708,34 @@ static inline void pte_clear_not_present_full(struct
>> mm_struct *mm,
>> }
>> #endif
>> +#ifndef clear_not_present_full_ptes
>> +/**
>> + * clear_not_present_full_ptes - Clear consecutive not present PTEs.
>
>
> Consecutive only in the page table or also in some other sense?
>
> I suspect: just unrelated non-present entries of any kind (swp, nonswp) and any
> offset/pfn.
yes.
>
> Consider document that.
How about: "Clear multiple not present PTEs which are consecutive in the pgtable"?
>
>> + * @mm: Address space the ptes represent.
>> + * @addr: Address of the first pte.
>> + * @ptep: Page table pointer for the first entry.
>> + * @nr: Number of entries to clear.
>> + * @full: Whether we are clearing a full mm.
>> + *
>> + * May be overridden by the architecture; otherwise, implemented as a simple
>> + * loop over pte_clear_not_present_full().
>> + *
>> + * Context: The caller holds the page table lock. The PTEs are all not present.
>> + * The PTEs are all in the same PMD.
>> + */
>> +static inline void clear_not_present_full_ptes(struct mm_struct *mm,
>> + unsigned long addr, pte_t *ptep, unsigned int nr, int full)
>> +{
>> + for (;;) {
>> + pte_clear_not_present_full(mm, addr, ptep, full);
>> + if (--nr == 0)
>> + break;
>> + ptep++;
>> + addr += PAGE_SIZE;
>> + }
>> +}
>> +#endif
>> +
>> #ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH
>> extern pte_t ptep_clear_flush(struct vm_area_struct *vma,
>> unsigned long address,
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index f6f78198f000..5737236dc3ce 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -471,7 +471,7 @@ extern int swap_duplicate(swp_entry_t);
>> extern int swapcache_prepare(swp_entry_t);
>> extern void swap_free(swp_entry_t);
>> extern void swapcache_free_entries(swp_entry_t *entries, int n);
>> -extern int free_swap_and_cache(swp_entry_t);
>> +extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
>> int swap_type_of(dev_t device, sector_t offset);
>> int find_first_swap(dev_t *device);
>> extern unsigned int count_swap_pages(int, int);
>> @@ -520,8 +520,9 @@ static inline void put_swap_device(struct swap_info_struct
>> *si)
>> #define free_pages_and_swap_cache(pages, nr) \
>> release_pages((pages), (nr));
>>
>
>
> [...]
>
>> +
>> +/**
>> + * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries
>> + * @start_ptep: Page table pointer for the first entry.
>> + * @max_nr: The maximum number of table entries to consider.
>> + * @entry: Swap entry recovered from the first table entry.
>> + *
>> + * Detect a batch of contiguous swap entries: consecutive (non-present) PTEs
>> + * containing swap entries all with consecutive offsets and targeting the same
>> + * swap type.
>> + *
>
> Likely you should document that any swp pte bits are ignored? ()
Sorry I don't understand this comment. I thought any non-none, non-present PTE
was always considered to contain only a "swap entry" and a swap entry consists
of a "type" and an "offset" only. (and its a special "non-swap" swap entry if
type > SOME_CONSTANT) Are you saying there are additional fields in the PTE that
are not part of the swap entry?
>
>> + * max_nr must be at least one and must be limited by the caller so scanning
>> + * cannot exceed a single page table.
>> + *
>> + * Return: the number of table entries in the batch.
>> + */
>> +static inline int swap_pte_batch(pte_t *start_ptep, int max_nr,
>> + swp_entry_t entry)
>> +{
>> + const pte_t *end_ptep = start_ptep + max_nr;
>> + unsigned long expected_offset = swp_offset(entry) + 1;
>> + unsigned int expected_type = swp_type(entry);
>> + pte_t *ptep = start_ptep + 1;
>> +
>> + VM_WARN_ON(max_nr < 1);
>> + VM_WARN_ON(non_swap_entry(entry));
>> +
>> + while (ptep < end_ptep) {
>> + pte_t pte = ptep_get(ptep);
>> +
>> + if (pte_none(pte) || pte_present(pte))
>> + break;
>> +
>> + entry = pte_to_swp_entry(pte);
>> +
>> + if (non_swap_entry(entry) ||
>> + swp_type(entry) != expected_type ||
>> + swp_offset(entry) != expected_offset)
>> + break;
>> +
>> + expected_offset++;
>> + ptep++;
>> + }
>> +
>> + return ptep - start_ptep;
>> +}
>
> Looks very clean :)
>
> I was wondering whether we could similarly construct the expected swp PTE and
> only check pte_same.
>
> expected_pte = __swp_entry_to_pte(__swp_entry(expected_type, expected_offset));
>
> ... or have a variant to increase only the swp offset for an existing pte. But
> non-trivial due to the arch-dependent format.
>
> But then, we'd fail on mismatch of other swp pte bits.
Hmm, perhaps I have a misunderstanding regarding "swp pte bits"...
>
>
> On swapin, when reusing this function (likely!), we'll might to make sure that
> the PTE bits match as well.
>
> See below regarding uffd-wp.
>
>
>> #endif /* CONFIG_MMU */
>> void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio,
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index 1f77a51baaac..070bedb4996e 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -628,6 +628,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned
>> long addr,
>> struct folio *folio;
>> int nr_swap = 0;
>> unsigned long next;
>> + int nr, max_nr;
>> next = pmd_addr_end(addr, end);
>> if (pmd_trans_huge(*pmd))
>> @@ -640,7 +641,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned
>> long addr,
>> return 0;
>> flush_tlb_batched_pending(mm);
>> arch_enter_lazy_mmu_mode();
>> - for (; addr != end; pte++, addr += PAGE_SIZE) {
>> + for (; addr != end; pte += nr, addr += PAGE_SIZE * nr) {
>> + nr = 1;
>> ptent = ptep_get(pte);
>> if (pte_none(ptent))
>> @@ -655,9 +657,11 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned
>> long addr,
>> entry = pte_to_swp_entry(ptent);
>> if (!non_swap_entry(entry)) {
>> - nr_swap--;
>> - free_swap_and_cache(entry);
>> - pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
>> + max_nr = (end - addr) / PAGE_SIZE;
>> + nr = swap_pte_batch(pte, max_nr, entry);
>> + nr_swap -= nr;
>> + free_swap_and_cache_nr(entry, nr);
>> + clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
>> } else if (is_hwpoison_entry(entry) ||
>> is_poisoned_swp_entry(entry)) {
>> pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 7dc6c3d9fa83..ef2968894718 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -1637,12 +1637,13 @@ static unsigned long zap_pte_range(struct mmu_gather
>> *tlb,
>> folio_remove_rmap_pte(folio, page, vma);
>> folio_put(folio);
>> } else if (!non_swap_entry(entry)) {
>> - /* Genuine swap entry, hence a private anon page */
>> + max_nr = (end - addr) / PAGE_SIZE;
>> + nr = swap_pte_batch(pte, max_nr, entry);
>> + /* Genuine swap entries, hence a private anon pages */
>> if (!should_zap_cows(details))
>> continue;
>> - rss[MM_SWAPENTS]--;
>> - if (unlikely(!free_swap_and_cache(entry)))
>> - print_bad_pte(vma, addr, ptent, NULL);
>> + rss[MM_SWAPENTS] -= nr;
>> + free_swap_and_cache_nr(entry, nr);
>> } else if (is_migration_entry(entry)) {
>> folio = pfn_swap_entry_folio(entry);
>> if (!should_zap_folio(details, folio))
>> @@ -1665,8 +1666,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>> pr_alert("unrecognized swap entry 0x%lx\n", entry.val);
>> WARN_ON_ONCE(1);
>> }
>> - pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
>> - zap_install_uffd_wp_if_needed(vma, addr, pte, 1, details, ptent);
>> + clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
>
> For zap_install_uffd_wp_if_needed(), the uffd-wp bit has to match.
>
> zap_install_uffd_wp_if_needed() will use the uffd-wp information in
> ptent->pteval to make a decision whether to place PTE_MARKER_UFFD_WP markers.
>
> On mixture, you either lose some or place too many markers.
What path are you concerned about here? I don't get how what you describe can
happen? swap_pte_batch() will only give me a batch of actual swap entries and
actual swap entries don't contain uffd-wp info, IIUC. If the function gets to a
"non-swap" swap entry, it bails. I thought the uffd-wp info was populated based
on the VMA state at swap-in? I think you are telling me that it's persisted
across the swap per-pte?
>
> A simple workaround would be to disable any such batching if the VMA does have
> uffd-wp enabled.
>
>> + zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent);
>> } while (pte += nr, addr += PAGE_SIZE * nr, addr != end);
>> add_mm_rss_vec(mm, rss);
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 0d44ee2b4f9c..d059de6896c1 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -130,7 +130,11 @@ static inline unsigned char swap_count(unsigned char ent)
>> /* Reclaim the swap entry if swap is getting full*/
>> #define TTRS_FULL 0x4
>> -/* returns 1 if swap entry is freed */
>> +/*
>> + * returns number of pages in the folio that backs the swap entry. If positive,
>> + * the folio was reclaimed. If negative, the folio was not reclaimed. If 0, no
>> + * folio was associated with the swap entry.
>> + */
>> static int __try_to_reclaim_swap(struct swap_info_struct *si,
>> unsigned long offset, unsigned long flags)
>> {
>> @@ -155,6 +159,7 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
>> ret = folio_free_swap(folio);
>> folio_unlock(folio);
>> }
>> + ret = ret ? folio_nr_pages(folio) : -folio_nr_pages(folio);
>> folio_put(folio);
>> return ret;
>> }
>> @@ -895,7 +900,7 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>> swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
>> spin_lock(&si->lock);
>> /* entry was freed successfully, try to use this again */
>> - if (swap_was_freed)
>> + if (swap_was_freed > 0)
>> goto checks;
>> goto scan; /* check next one */
>> }
>> @@ -1572,32 +1577,75 @@ bool folio_free_swap(struct folio *folio)
>> return true;
>> }
>> -/*
>> - * Free the swap entry like above, but also try to
>> - * free the page cache entry if it is the last user.
>> - */
>> -int free_swap_and_cache(swp_entry_t entry)
>
> Can we have some documentation what this function expects? How does nr relate to
> entry?
>
> i.e., offset range defined by [entry.offset, entry.offset + nr).
Yep, will add something along these lines.
>
>> +void free_swap_and_cache_nr(swp_entry_t entry, int nr)
>> {
>> - struct swap_info_struct *p;
>
> It might be easier to get if you do
>
> const unsigned long start_offset = swp_offset(entry);
> const unsigned long end_offset = start_offset + nr;
OK will do this.
>
>> + unsigned long end = swp_offset(entry) + nr;
>> + unsigned int type = swp_type(entry);
>> + struct swap_info_struct *si;
>> + bool any_only_cache = false;
>> + unsigned long offset;
>> unsigned char count;
>> if (non_swap_entry(entry))
>> - return 1;
>> + return;
>> - p = get_swap_device(entry);
>> - if (p) {
>> - if (WARN_ON(data_race(!p->swap_map[swp_offset(entry)]))) {
>> - put_swap_device(p);
>> - return 0;
>> + si = get_swap_device(entry);
>> + if (!si)
>> + return;
>> +
>> + if (WARN_ON(end > si->max))
>> + goto out;
>> +
>> + /*
>> + * First free all entries in the range.
>> + */
>> + for (offset = swp_offset(entry); offset < end; offset++) {
>> + if (!WARN_ON(data_race(!si->swap_map[offset]))) {
>
> Ouch "!(!)). Confusing.
>
> I'm sure there is a better way to write that, maybe using more lines
>
> if (data_race(si->swap_map[offset])) {
> ...
> } else {
> WARN_ON_ONCE(1);
> }
OK, I thought it was kinda neat myself. I'll change it.
>
>
>> + count = __swap_entry_free(si, swp_entry(type, offset));
>> + if (count == SWAP_HAS_CACHE)
>> + any_only_cache = true;
>> }
>> + }
>> +
>> + /*
>> + * Short-circuit the below loop if none of the entries had their
>> + * reference drop to zero.
>> + */
>> + if (!any_only_cache)
>> + goto out;
>> - count = __swap_entry_free(p, entry);
>> - if (count == SWAP_HAS_CACHE)
>> - __try_to_reclaim_swap(p, swp_offset(entry),
>> + /*
>> + * Now go back over the range trying to reclaim the swap cache. This is
>> + * more efficient for large folios because we will only try to reclaim
>> + * the swap once per folio in the common case. If we do
>> + * __swap_entry_free() and __try_to_reclaim_swap() in the same loop, the
>> + * latter will get a reference and lock the folio for every individual
>> + * page but will only succeed once the swap slot for every subpage is
>> + * zero.
>> + */
>> + for (offset = swp_offset(entry); offset < end; offset += nr) {
>> + nr = 1;
>> + if (READ_ONCE(si->swap_map[offset]) == SWAP_HAS_CACHE) {
>
> Here we use READ_ONCE() only, above data_race(). Hmmm.
Yes. I think this is correct.
READ_ONCE() is a "marked access" which KCSAN understands, so it won't complain
about it. So data_race() isn't required when READ_ONCE() (or WRITE_ONCE()) is
used. I believe READ_ONCE() is required here because we don't have a lock and we
want to make sure we read it in a non-tearing manner.
We don't need the READ_ONCE() above since we don't care about the exact value -
only that it's not 0 (because we should be holding a ref). So do a plain access
to give the compiler a bit more freedom. But we need to mark that with
data_race() to stop KCSAN from complaining.
>
>> + /*
>> + * Folios are always naturally aligned in swap so
>> + * advance forward to the next boundary. Zero means no
>> + * folio was found for the swap entry, so advance by 1
>> + * in this case. Negative value means folio was found
>> + * but could not be reclaimed. Here we can still advance
>> + * to the next boundary.
>> + */
>> + nr = __try_to_reclaim_swap(si, offset,
>> TTRS_UNMAPPED | TTRS_FULL);
>> - put_swap_device(p);
>> + if (nr == 0)
>> + nr = 1;
>> + else if (nr < 0)
>> + nr = -nr;
>> + nr = ALIGN(offset + 1, nr) - offset;
>> + }
>
> Apart from that nothing jumped at me.
Thanks for the review!
Powered by blists - more mailing lists