[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <82d32616-9135-4b30-9e91-b190a03bcb54@gmail.com>
Date: Thu, 24 Oct 2024 11:20:14 +0100
From: Usama Arif <usamaarif642@...il.com>
To: Hugh Dickins <hughd@...gle.com>, Andrew Morton <akpm@...ux-foundation.org>
Cc: Yang Shi <shy828301@...il.com>, Wei Yang <richard.weiyang@...il.com>,
"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
Matthew Wilcox <willy@...radead.org>, David Hildenbrand <david@...hat.com>,
Johannes Weiner <hannes@...xchg.org>,
Baolin Wang <baolin.wang@...ux.alibaba.com>, Barry Song <baohua@...nel.org>,
Kefeng Wang <wangkefeng.wang@...wei.com>, Ryan Roberts
<ryan.roberts@....com>, Nhat Pham <nphamcs@...il.com>,
Zi Yan <ziy@...dia.com>, Chris Li <chrisl@...nel.org>,
linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH hotfix 1/2] mm/thp: fix deferred split queue not
partially_mapped
On 24/10/2024 05:10, Hugh Dickins wrote:
> Recent changes are putting more pressure on THP deferred split queues:
> under load revealing long-standing races, causing list_del corruptions,
> "Bad page state"s and worse (I keep BUGs in both of those, so usually
> don't get to see how badly they end up without). The relevant recent
> changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin,
> improved swap allocation, and underused THP splitting.
>
> The new unlocked list_del_init() in deferred_split_scan() is buggy.
> I gave bad advice, it looks plausible since that's a local on-stack
> list, but the fact is that it can race with a third party freeing or
> migrating the preceding folio (properly unqueueing it with refcount 0
> while holding split_queue_lock), thereby corrupting the list linkage.
>
> The obvious answer would be to take split_queue_lock there: but it has
> a long history of contention, so I'm reluctant to add to that. Instead,
> make sure that there is always one safe (raised refcount) folio before,
> by delaying its folio_put(). (And of course I was wrong to suggest
> updating split_queue_len without the lock: leave that until the splice.)
>
Thanks for this, I imagine this was quite tricky to debug.
Avoiding taking the split_queue_lock by keeping reference of the
preceding folio is really nice.
And I should have spotted in the original patch that split_queue_len
shouldn't be changed without holding split_queue_lock.
Acked-by: Usama Arif <usamaarif642@...il.com>
> And remove two over-eager partially_mapped checks, restoring those tests
> to how they were before: if uncharge_folio() or free_tail_page_prepare()
> finds _deferred_list non-empty, it's in trouble whether or not that folio
> is partially_mapped (and the flag was already cleared in the latter case).
>
> Fixes: dafff3f4c850 ("mm: split underused THPs")
> Signed-off-by: Hugh Dickins <hughd@...gle.com>
> ---
> mm/huge_memory.c | 21 +++++++++++++++++----
> mm/memcontrol.c | 3 +--
> mm/page_alloc.c | 5 ++---
> 3 files changed, 20 insertions(+), 9 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 2fb328880b50..a1d345f1680c 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3718,8 +3718,8 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
> struct deferred_split *ds_queue = &pgdata->deferred_split_queue;
> unsigned long flags;
> LIST_HEAD(list);
> - struct folio *folio, *next;
> - int split = 0;
> + struct folio *folio, *next, *prev = NULL;
> + int split = 0, removed = 0;
>
> #ifdef CONFIG_MEMCG
> if (sc->memcg)
> @@ -3775,15 +3775,28 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
> */
> if (!did_split && !folio_test_partially_mapped(folio)) {
> list_del_init(&folio->_deferred_list);
> - ds_queue->split_queue_len--;
> + removed++;
> + } else {
> + /*
> + * That unlocked list_del_init() above would be unsafe,
> + * unless its folio is separated from any earlier folios
> + * left on the list (which may be concurrently unqueued)
> + * by one safe folio with refcount still raised.
> + */
> + swap(folio, prev);
> }
> - folio_put(folio);
> + if (folio)
> + folio_put(folio);
> }
>
> spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> list_splice_tail(&list, &ds_queue->split_queue);
> + ds_queue->split_queue_len -= removed;
> spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>
> + if (prev)
> + folio_put(prev);
> +
> /*
> * Stop shrinker if we didn't split any page, but the queue is empty.
> * This can happen if pages were freed under us.
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7845c64a2c57..2703227cce88 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4631,8 +4631,7 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
> VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
> VM_BUG_ON_FOLIO(folio_order(folio) > 1 &&
> !folio_test_hugetlb(folio) &&
> - !list_empty(&folio->_deferred_list) &&
> - folio_test_partially_mapped(folio), folio);
> + !list_empty(&folio->_deferred_list), folio);
>
> /*
> * Nobody should be changing or seriously looking at
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8afab64814dc..4b21a368b4e2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -961,9 +961,8 @@ static int free_tail_page_prepare(struct page *head_page, struct page *page)
> break;
> case 2:
> /* the second tail page: deferred_list overlaps ->mapping */
> - if (unlikely(!list_empty(&folio->_deferred_list) &&
> - folio_test_partially_mapped(folio))) {
> - bad_page(page, "partially mapped folio on deferred list");
> + if (unlikely(!list_empty(&folio->_deferred_list))) {
> + bad_page(page, "on deferred list");
> goto out;
> }
> break;
Powered by blists - more mailing lists