[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ddf1f0f0-dfb9-4df0-9bfa-df9c5cf39ec5@linux.alibaba.com>
Date: Tue, 1 Jul 2025 09:57:05 +0800
From: Baolin Wang <baolin.wang@...ux.alibaba.com>
To: Kairui Song <ryncsn@...il.com>
Cc: linux-mm@...ck.org, Andrew Morton <akpm@...ux-foundation.org>,
Hugh Dickins <hughd@...gle.com>, Matthew Wilcox <willy@...radead.org>,
Kemeng Shi <shikemeng@...weicloud.com>, Chris Li <chrisl@...nel.org>,
Nhat Pham <nphamcs@...il.com>, Baoquan He <bhe@...hat.com>,
Barry Song <baohua@...nel.org>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3 4/7] mm/shmem, swap: clean up swap entry splitting
On 2025/7/1 02:19, Kairui Song wrote:
> On Mon, Jun 30, 2025 at 7:59 PM Baolin Wang
> <baolin.wang@...ux.alibaba.com> wrote:
>>
>>
>>
>> On 2025/6/30 18:06, Kairui Song wrote:
>>> On Mon, Jun 30, 2025 at 5:53 PM Baolin Wang
>>> <baolin.wang@...ux.alibaba.com> wrote:
>>>> On 2025/6/30 17:16, Kairui Song wrote:
>>>>> On Mon, Jun 30, 2025 at 2:34 PM Baolin Wang
>>>>> <baolin.wang@...ux.alibaba.com> wrote:
>>>>>> On 2025/6/27 14:20, Kairui Song wrote:
>>>>>>> From: Kairui Song <kasong@...cent.com>
>>>>>>>
>>>>>>> Instead of keeping different paths of splitting the entry and
>>>>>>> recalculating the swap entry and index, do it in one place.
>>>>>>>
>>>>>>> Whenever swapin brought in a folio smaller than the entry, split the
>>>>>>> entry. And always recalculate the entry and index, in case it might
>>>>>>> read in a folio that's larger than the entry order. This removes
>>>>>>> duplicated code and function calls, and makes the code more robust.
>>>>>>>
>>>>>>> Signed-off-by: Kairui Song <kasong@...cent.com>
>>>>>>> ---
>>>>>>> mm/shmem.c | 103 +++++++++++++++++++++--------------------------------
>>>>>>> 1 file changed, 41 insertions(+), 62 deletions(-)
>>>>>>>
>>>>>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>>>>>> index f85a985167c5..5be9c905396e 100644
>>>>>>> --- a/mm/shmem.c
>>>>>>> +++ b/mm/shmem.c
>>>>>>> @@ -2178,8 +2178,12 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
>>>>>>> swap_free_nr(swap, nr_pages);
>>>>>>> }
>>>>>>>
>>>>>>> -static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
>>>>>>> - swp_entry_t swap, gfp_t gfp)
>>>>>>> +/*
>>>>>>> + * Split an existing large swap entry. @index should point to one sub mapping
>>>>>>> + * slot within the entry @swap, this sub slot will be split into order 0.
>>>>>>> + */
>>>>>>> +static int shmem_split_swap_entry(struct inode *inode, pgoff_t index,
>>>>>>> + swp_entry_t swap, gfp_t gfp)
>>>>>>> {
>>>>>>> struct address_space *mapping = inode->i_mapping;
>>>>>>> XA_STATE_ORDER(xas, &mapping->i_pages, index, 0);
>>>>>>> @@ -2250,7 +2254,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
>>>>>>> if (xas_error(&xas))
>>>>>>> return xas_error(&xas);
>>>>>>>
>>>>>>> - return entry_order;
>>>>>>> + return 0;
>>>>>>> }
>>>>>>>
>>>>>>> /*
>>>>>>> @@ -2267,11 +2271,11 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>>>>>>> struct address_space *mapping = inode->i_mapping;
>>>>>>> struct mm_struct *fault_mm = vma ? vma->vm_mm : NULL;
>>>>>>> struct shmem_inode_info *info = SHMEM_I(inode);
>>>>>>> + int error, nr_pages, order, swap_order;
>>>>>>> struct swap_info_struct *si;
>>>>>>> struct folio *folio = NULL;
>>>>>>> bool skip_swapcache = false;
>>>>>>> swp_entry_t swap;
>>>>>>> - int error, nr_pages, order, split_order;
>>>>>>>
>>>>>>> VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
>>>>>>> swap = radix_to_swp_entry(*foliop);
>>>>>>> @@ -2321,70 +2325,43 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>>>>>>> goto failed;
>>>>>>> }
>>>>>>>
>>>>>>> - /*
>>>>>>> - * Now swap device can only swap in order 0 folio, then we
>>>>>>> - * should split the large swap entry stored in the pagecache
>>>>>>> - * if necessary.
>>>>>>> - */
>>>>>>> - split_order = shmem_split_large_entry(inode, index, swap, gfp);
>>>>>>> - if (split_order < 0) {
>>>>>>> - error = split_order;
>>>>>>> - goto failed;
>>>>>>> - }
>>>>>>> -
>>>>>>> - /*
>>>>>>> - * If the large swap entry has already been split, it is
>>>>>>> - * necessary to recalculate the new swap entry based on
>>>>>>> - * the old order alignment.
>>>>>>> - */
>>>>>>> - if (split_order > 0) {
>>>>>>> - pgoff_t offset = index - round_down(index, 1 << split_order);
>>>>>>> -
>>>>>>> - swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
>>>>>>> - }
>>>>>>> -
>>>>>>> /* Here we actually start the io */
>>>>>>> folio = shmem_swapin_cluster(swap, gfp, info, index);
>>>>>>> if (!folio) {
>>>>>>> error = -ENOMEM;
>>>>>>> goto failed;
>>>>>>> }
>>>>>>> - } else if (order > folio_order(folio)) {
>>>>>>> - /*
>>>>>>> - * Swap readahead may swap in order 0 folios into swapcache
>>>>>>> - * asynchronously, while the shmem mapping can still stores
>>>>>>> - * large swap entries. In such cases, we should split the
>>>>>>> - * large swap entry to prevent possible data corruption.
>>>>>>> - */
>>>>>>> - split_order = shmem_split_large_entry(inode, index, swap, gfp);
>>>>>>> - if (split_order < 0) {
>>>>>>> - folio_put(folio);
>>>>>>> - folio = NULL;
>>>>>>> - error = split_order;
>>>>>>> - goto failed;
>>>>>>> - }
>>>>>>> -
>>>>>>> - /*
>>>>>>> - * If the large swap entry has already been split, it is
>>>>>>> - * necessary to recalculate the new swap entry based on
>>>>>>> - * the old order alignment.
>>>>>>> - */
>>>>>>> - if (split_order > 0) {
>>>>>>> - pgoff_t offset = index - round_down(index, 1 << split_order);
>>>>>>> -
>>>>>>> - swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
>>>>>>> - }
>>>>>>> - } else if (order < folio_order(folio)) {
>>>>>>> - swap.val = round_down(swap.val, 1 << folio_order(folio));
>>>>>>> }
>>>>>>>
>>>>>>> alloced:
>>>>>>> + /*
>>>>>>> + * We need to split an existing large entry if swapin brought in a
>>>>>>> + * smaller folio due to various of reasons.
>>>>>>> + *
>>>>>>> + * And worth noting there is a special case: if there is a smaller
>>>>>>> + * cached folio that covers @swap, but not @index (it only covers
>>>>>>> + * first few sub entries of the large entry, but @index points to
>>>>>>> + * later parts), the swap cache lookup will still see this folio,
>>>>>>> + * And we need to split the large entry here. Later checks will fail,
>>>>>>> + * as it can't satisfy the swap requirement, and we will retry
>>>>>>> + * the swapin from beginning.
>>>>>>> + */
>>>>>>> + swap_order = folio_order(folio);
>>>>>>
>>>>>> Nit: 'swap_order' is confusing, and can you just use folio_order() or a
>>>>>> btter name?
>>>>>
>>>>> Good idea.
>>>>>
>>>>>>
>>>>>>> + if (order > swap_order) {
>>>>>>> + error = shmem_split_swap_entry(inode, index, swap, gfp);
>>>>>>> + if (error)
>>>>>>> + goto failed_nolock;
>>>>>>> + }
>>>>>>> +
>>>>>>> + index = round_down(index, 1 << swap_order);
>>>>>>> + swap.val = round_down(swap.val, 1 << swap_order);
>>>>>>
>>>>>> The round_down() of index and swap value here may cause
>>>>>> shmem_add_to_page_cache() to fail to insert a new folio, because the
>>>>>> swap value stored at that index in the shmem mapping does not match,
>>>>>> leading to another swapin page fault for correction.
>>>>>>
>>>>>> For example, shmem stores a large swap entry of order 4 in the range of
>>>>>> index 0-64. When a swapin fault occurs at index = 3, with swap.val =
>>>>>> 0x4000, if a split happens and this round_down() logic is applied, then
>>>>>> index = 3, swap.val = 0x4000. However, the actual swap.val should be
>>>>>> 0x4003 stored in the shmem mapping. This would cause another swapin fault.
>>>>>
>>>>> Oops, I missed a swap value fixup in the !SWP_SYNCHRONOUS_IO path
>>>>> above, it should re-calculate the swap value there. It's fixed in the
>>>>> final patch but left unhandled here. I'll update this part.
>>>>>
>>>>>>
>>>>>> I still prefer my original alignment method, and do you find this will
>>>>>> cause any issues?
>>>>>>
>>>>>> "
>>>>>> if (split_order > 0) {
>>>>>> pgoff_t offset = index - round_down(index, 1 << split_order);
>>>>>>
>>>>>> swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
>>>>>> }
>>>>>> "
>>>>>
>>>>> It only fits the cached swapin and uncached swapin, not the cache hit
>>>>> case. Cache hits may see a larger folio so split didn't happen, but
>>>>> the round_down is still needed.
>>>>
>>>> IMO, this only fits for the large swap entry split case.
>>>>
>>>>> And there is another racy case: before this patch, the split may
>>>>> happen first, but shmem_swapin_cluster brought in a large folio due to
>>>>> race in the swap cache layer.
>>>>
>>>> shmem_swapin_cluster() can only allocate order 0 folio, right?
>>>
>>> It can only allocate order 0 folio, but It can hit a large folio: eg.
>>> a parallel swapin/swapout happened, and the folio stays in swap cache,
>>> while we are handling a swapin here.
>>
>> Yes, I know this. This is exactly the issue that patch 1 wants to fix.
>
> Hmm, patch 1 prevents the hang but doesn't prevent things like a
> duplicated fault:
>
> Starts with:
> swap entry val = 0x400, order = 4, covering index 0 - 15, faulting index 3:
>
> Before this patch:
> CPU0 CPU1
> shmem_swapin_folio (see order = 4)
> shmem_swapin_folio (see order = 4)
> /* fallback to order 0 due to */
> /* mem pressure / temporary pin / etc */
> shmem_split_large_entry
> /* split to order 0 */
> /* interrupted */
> /* swapin done with order = 4 folio */
> /* swapout again, leave the large folio
> in cache temporarily */
> folio = swap_cluster_readahead(0x403)
> /* Gets folio order = 4, folio->swap = 0x400
> since swap_cluster_readahead uses swap cache */
> folio->swap.val != swap.val
> /* ! Above test failed ! */
> /* It shouldn't fail the round down is needed */
OK. Thanks for the explanation. Yes, this might cause a refault.
> This patch moved the split after the swapin so it should be OK now,
Yes, I agree 'moving the split after the swapin'.
> but still the split_order could be unstable, see below:
>>>>> And I'm not sure if split_order is always reliable here, for example
>>>>> concurrent split may return an inaccurate value here.
>>>>
>>>> We've held the xas lock to ensure the split is reliable, even though
>>>> concurrent splits may occur, only one split can get the large
>>>> 'split_order', another will return 0 (since it will see the large swao
>>>> entry has already been split).
>>>
>>> Yes, it may return 0, so we can get a large folio here, but get
>>> `split_order = 0`?
>>
>> If split happens, which means the 'order' > folio_order(), right? how
>> can you get a large folio in this context?
>>
>>> And if concurrently swapout/swapin happened, the `split_order` could
>>> be a different value?
>>
>> What do you mean different value? The large swap entry can only be split
>> once, so the 'split_order' must be 0 or the original large order.
>
> Since d53c78fffe7ad, shmem_split_large_entry doesn't split every slot
> into order 0 IIUC, so things get complex if two CPUs are faulting on
> different indexes landing into two different splitting zones:
>
> Before this patch:
> swap entry val = 0x400, order = 9, covering index 0 - 511, faulting index 3:
>
> CPU0 CPU1
> shmem_swapin_folio (index = 3)
> shmem_swapin_folio (index = 510)
> /* Gets swap = 0x400 */ /* Gets swap = 0x400 */
> /* fallback to order 0 */ /* fallback to order 0 */
> split_order = shmem_split_large_entry
> /* get split_order = 512 */
> /* offset = 3 */
> split_order = shmem_split_large_entry
> /* get split_order = 0, but no split */
> /* map order is 8, offset = 0 */
> /* wrong offset */
> shmem_swapin_cluster(0x400)
> /* It should swapin 0x5fe */
Not ture. IIUC, the CPU1 will failed to split due to
'swp_to_radix_entry(swap) != old' in shmem_split_large_entry(), and will
retry again to fix this race.
> After this patch (with the append fix which was left in latest patch
> by mistake) it won't swapin wrong entry now, but
> shmem_split_large_entry may still return a outdated order.
Like I said above, I don't think we can get a outdated order,right?
> That's two previous races I can come up with. These no longer exist
> after this patch, it's not a bug though, just redundant IO as far as I
> can see because other checks will fallback, looks a bit fragile
> though. But shmem_split_large_entry may still return invalid order,
> just much less likely.
>
> I think the ideology here is, because the `order =
> shmem_confirm_swap(mapping, index, swap)` ensures order is stable and
> corresponds to the entry value at one point, so keep using that value
> is better (and so this patch does the offset and calculation using the
> `order` retrieved there before issues the swapin).
I don't think that the 'order' obtained through the shmem_confirm_swap()
is stable, because shmem_confirm_swap() is only protected by RCU.
However, I think the 'split_order' obtained from
shmem_split_large_entry() under the xas lock is stable.
> And after the swapin have brought a folio in, simply round down using
> the folio's order, which should ensure the folio can be added
> successfully in any case as long as the folio->swap and index fits the
> shmem mapping fine.
>
>>>> Based on your current patch, would the following modifications be clearer?
>>>>
>>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>>> index 5be9c905396e..91c071fb7b67 100644
>>>> --- a/mm/shmem.c
>>>> +++ b/mm/shmem.c
>>>> @@ -2254,7 +2254,7 @@ static int shmem_split_swap_entry(struct inode
>>>> *inode, pgoff_t index,
>>>> if (xas_error(&xas))
>>>> return xas_error(&xas);
>>>>
>>>> - return 0;
>>>> + return split_order;
>>>> }
>>>>
>>>> /*
>>>> @@ -2351,10 +2351,23 @@ static int shmem_swapin_folio(struct inode
>>>> *inode, pgoff_t index,
>>>> error = shmem_split_swap_entry(inode, index, swap, gfp);
>>>> if (error)
>>>> goto failed_nolock;
>>>> - }
>>>>
>>>> - index = round_down(index, 1 << swap_order);
>>>> - swap.val = round_down(swap.val, 1 << swap_order);
>>>> + /*
>>>> + * If the large swap entry has already been split, it is
>>>> + * necessary to recalculate the new swap entry based on
>>>> + * the old order alignment.
>>>> + */
>>>> + if (split_order > 0) {
>
> The split_order could still be an outdated value, eg. we may even get
> split_order = 0 with a large folio loaded here.
Ditto. I didn't see split_order can be an outdated value.
>>>> + pgoff_t offset = index - round_down(index, 1 <<
>>>> split_order);
>>>> +
>>>> + swap = swp_entry(swp_type(swap),
>>>> swp_offset(swap) + offset);
>>>> + }
>>>> + } else if (order < folio_order(folio)) {
>>>> + /*
>>>> + * TODO; explain the posible race...
>>>> + */
>>>> + swap.val = round_down(swap.val, 1 << folio_order(folio));
>>>> + }
>>
>> Sorry, you changes did not convince me. I still prefer my changes,
>> listing out the possible races that might require changes to the swap
>> value, as it would be clearer and more readable. Additionally, this is a
>> cleanup patch, so I hope there are no implicit functional changes.
>
> I was thinking that the less branch we have the better, I won't insist
> on this though.
But this is under the premise of keeping the code logic clear and simple.
>>>> /* We have to do this with folio locked to prevent races */
>>>> folio_lock(folio);
>>>> @@ -2382,7 +2395,8 @@ static int shmem_swapin_folio(struct inode *inode,
>>>> pgoff_t index,
>>>> goto failed;
>>>> }
>>>>
>>>> - error = shmem_add_to_page_cache(folio, mapping, index,
>>>> + error = shmem_add_to_page_cache(folio, mapping,
>>>> + round_down(index, nr_pages),
>>>> swp_to_radix_entry(swap), gfp);
>>>> if (error)
>>>> goto failed;
>>>>
>>>>> So I wanted to simplify it: by round_down(folio_order(folio)) we
>>>>> simply get the index and swap that will be covered by this specific
>>>>> folio, if the covered range still has the corresponding swap entries
>>>>> (check and ensured by shmem_add_to_page_cache which holds the
>>>>> xa_lock), then the folio will be inserted back safely and
>>>>> successfully.
>>>>
>>>
>>> I think adding the missing swap value fixup in the !SYNC_IO path
>>> should be good enough?
>>>
>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>> index 5be9c905396e..2620e4d1b56a 100644
>>> --- a/mm/shmem.c
>>> +++ b/mm/shmem.c
>>> @@ -2276,6 +2276,7 @@ static int shmem_swapin_folio(struct inode
>>> *inode, pgoff_t index,
>>> struct folio *folio = NULL;
>>> bool skip_swapcache = false;
>>> swp_entry_t swap;
>>> + pgoff_t offset;
>>>
>>> VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
>>> swap = radix_to_swp_entry(*foliop);
>>> @@ -2325,7 +2326,9 @@ static int shmem_swapin_folio(struct inode
>>> *inode, pgoff_t index,
>>> goto failed;
>>> }
>>>
>>> - /* Here we actually start the io */
>>> + /* Cached swapin currently only supports order 0 swapin */
>>> + /* It may hit a large folio but that's OK and handled below */
>>> + offset = index - round_down(index, 1 << order);
>>> + swap.val = swap.val + offset;
>
> BTW for this append patch, it should be using a different variable so
> swap won't be changed or it will cause redundant fault again. My bad.
>
>>> folio = shmem_swapin_cluster(swap, gfp, info, index);
>>> if (!folio) {
>>> error = -ENOMEM;
>>
Powered by blists - more mailing lists