[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMgjq7CmK3g_QpepGJP9Cyu2aA7t+R4XN+NQDqouLCKKF+RJPA@mail.gmail.com>
Date: Mon, 30 Jun 2025 18:06:49 +0800
From: Kairui Song <ryncsn@...il.com>
To: Baolin Wang <baolin.wang@...ux.alibaba.com>
Cc: linux-mm@...ck.org, Andrew Morton <akpm@...ux-foundation.org>,
Hugh Dickins <hughd@...gle.com>, Matthew Wilcox <willy@...radead.org>,
Kemeng Shi <shikemeng@...weicloud.com>, Chris Li <chrisl@...nel.org>,
Nhat Pham <nphamcs@...il.com>, Baoquan He <bhe@...hat.com>, Barry Song <baohua@...nel.org>,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3 4/7] mm/shmem, swap: clean up swap entry splitting
On Mon, Jun 30, 2025 at 5:53 PM Baolin Wang
<baolin.wang@...ux.alibaba.com> wrote:
> On 2025/6/30 17:16, Kairui Song wrote:
> > On Mon, Jun 30, 2025 at 2:34 PM Baolin Wang
> > <baolin.wang@...ux.alibaba.com> wrote:
> >> On 2025/6/27 14:20, Kairui Song wrote:
> >>> From: Kairui Song <kasong@...cent.com>
> >>>
> >>> Instead of keeping different paths of splitting the entry and
> >>> recalculating the swap entry and index, do it in one place.
> >>>
> >>> Whenever swapin brought in a folio smaller than the entry, split the
> >>> entry. And always recalculate the entry and index, in case it might
> >>> read in a folio that's larger than the entry order. This removes
> >>> duplicated code and function calls, and makes the code more robust.
> >>>
> >>> Signed-off-by: Kairui Song <kasong@...cent.com>
> >>> ---
> >>> mm/shmem.c | 103 +++++++++++++++++++++--------------------------------
> >>> 1 file changed, 41 insertions(+), 62 deletions(-)
> >>>
> >>> diff --git a/mm/shmem.c b/mm/shmem.c
> >>> index f85a985167c5..5be9c905396e 100644
> >>> --- a/mm/shmem.c
> >>> +++ b/mm/shmem.c
> >>> @@ -2178,8 +2178,12 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
> >>> swap_free_nr(swap, nr_pages);
> >>> }
> >>>
> >>> -static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
> >>> - swp_entry_t swap, gfp_t gfp)
> >>> +/*
> >>> + * Split an existing large swap entry. @index should point to one sub mapping
> >>> + * slot within the entry @swap, this sub slot will be split into order 0.
> >>> + */
> >>> +static int shmem_split_swap_entry(struct inode *inode, pgoff_t index,
> >>> + swp_entry_t swap, gfp_t gfp)
> >>> {
> >>> struct address_space *mapping = inode->i_mapping;
> >>> XA_STATE_ORDER(xas, &mapping->i_pages, index, 0);
> >>> @@ -2250,7 +2254,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
> >>> if (xas_error(&xas))
> >>> return xas_error(&xas);
> >>>
> >>> - return entry_order;
> >>> + return 0;
> >>> }
> >>>
> >>> /*
> >>> @@ -2267,11 +2271,11 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> >>> struct address_space *mapping = inode->i_mapping;
> >>> struct mm_struct *fault_mm = vma ? vma->vm_mm : NULL;
> >>> struct shmem_inode_info *info = SHMEM_I(inode);
> >>> + int error, nr_pages, order, swap_order;
> >>> struct swap_info_struct *si;
> >>> struct folio *folio = NULL;
> >>> bool skip_swapcache = false;
> >>> swp_entry_t swap;
> >>> - int error, nr_pages, order, split_order;
> >>>
> >>> VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
> >>> swap = radix_to_swp_entry(*foliop);
> >>> @@ -2321,70 +2325,43 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> >>> goto failed;
> >>> }
> >>>
> >>> - /*
> >>> - * Now swap device can only swap in order 0 folio, then we
> >>> - * should split the large swap entry stored in the pagecache
> >>> - * if necessary.
> >>> - */
> >>> - split_order = shmem_split_large_entry(inode, index, swap, gfp);
> >>> - if (split_order < 0) {
> >>> - error = split_order;
> >>> - goto failed;
> >>> - }
> >>> -
> >>> - /*
> >>> - * If the large swap entry has already been split, it is
> >>> - * necessary to recalculate the new swap entry based on
> >>> - * the old order alignment.
> >>> - */
> >>> - if (split_order > 0) {
> >>> - pgoff_t offset = index - round_down(index, 1 << split_order);
> >>> -
> >>> - swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
> >>> - }
> >>> -
> >>> /* Here we actually start the io */
> >>> folio = shmem_swapin_cluster(swap, gfp, info, index);
> >>> if (!folio) {
> >>> error = -ENOMEM;
> >>> goto failed;
> >>> }
> >>> - } else if (order > folio_order(folio)) {
> >>> - /*
> >>> - * Swap readahead may swap in order 0 folios into swapcache
> >>> - * asynchronously, while the shmem mapping can still stores
> >>> - * large swap entries. In such cases, we should split the
> >>> - * large swap entry to prevent possible data corruption.
> >>> - */
> >>> - split_order = shmem_split_large_entry(inode, index, swap, gfp);
> >>> - if (split_order < 0) {
> >>> - folio_put(folio);
> >>> - folio = NULL;
> >>> - error = split_order;
> >>> - goto failed;
> >>> - }
> >>> -
> >>> - /*
> >>> - * If the large swap entry has already been split, it is
> >>> - * necessary to recalculate the new swap entry based on
> >>> - * the old order alignment.
> >>> - */
> >>> - if (split_order > 0) {
> >>> - pgoff_t offset = index - round_down(index, 1 << split_order);
> >>> -
> >>> - swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
> >>> - }
> >>> - } else if (order < folio_order(folio)) {
> >>> - swap.val = round_down(swap.val, 1 << folio_order(folio));
> >>> }
> >>>
> >>> alloced:
> >>> + /*
> >>> + * We need to split an existing large entry if swapin brought in a
> >>> + * smaller folio due to various of reasons.
> >>> + *
> >>> + * And worth noting there is a special case: if there is a smaller
> >>> + * cached folio that covers @swap, but not @index (it only covers
> >>> + * first few sub entries of the large entry, but @index points to
> >>> + * later parts), the swap cache lookup will still see this folio,
> >>> + * And we need to split the large entry here. Later checks will fail,
> >>> + * as it can't satisfy the swap requirement, and we will retry
> >>> + * the swapin from beginning.
> >>> + */
> >>> + swap_order = folio_order(folio);
> >>
> >> Nit: 'swap_order' is confusing, and can you just use folio_order() or a
> >> btter name?
> >
> > Good idea.
> >
> >>
> >>> + if (order > swap_order) {
> >>> + error = shmem_split_swap_entry(inode, index, swap, gfp);
> >>> + if (error)
> >>> + goto failed_nolock;
> >>> + }
> >>> +
> >>> + index = round_down(index, 1 << swap_order);
> >>> + swap.val = round_down(swap.val, 1 << swap_order);
> >>
> >> The round_down() of index and swap value here may cause
> >> shmem_add_to_page_cache() to fail to insert a new folio, because the
> >> swap value stored at that index in the shmem mapping does not match,
> >> leading to another swapin page fault for correction.
> >>
> >> For example, shmem stores a large swap entry of order 4 in the range of
> >> index 0-64. When a swapin fault occurs at index = 3, with swap.val =
> >> 0x4000, if a split happens and this round_down() logic is applied, then
> >> index = 3, swap.val = 0x4000. However, the actual swap.val should be
> >> 0x4003 stored in the shmem mapping. This would cause another swapin fault.
> >
> > Oops, I missed a swap value fixup in the !SWP_SYNCHRONOUS_IO path
> > above, it should re-calculate the swap value there. It's fixed in the
> > final patch but left unhandled here. I'll update this part.
> >
> >>
> >> I still prefer my original alignment method, and do you find this will
> >> cause any issues?
> >>
> >> "
> >> if (split_order > 0) {
> >> pgoff_t offset = index - round_down(index, 1 << split_order);
> >>
> >> swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
> >> }
> >> "
> >
> > It only fits the cached swapin and uncached swapin, not the cache hit
> > case. Cache hits may see a larger folio so split didn't happen, but
> > the round_down is still needed.
>
> IMO, this only fits for the large swap entry split case.
>
> > And there is another racy case: before this patch, the split may
> > happen first, but shmem_swapin_cluster brought in a large folio due to
> > race in the swap cache layer.
>
> shmem_swapin_cluster() can only allocate order 0 folio, right?
It can only allocate order 0 folio, but It can hit a large folio: eg.
a parallel swapin/swapout happened, and the folio stays in swap cache,
while we are handling a swapin here.
>
> > And I'm not sure if split_order is always reliable here, for example
> > concurrent split may return an inaccurate value here.
>
> We've held the xas lock to ensure the split is reliable, even though
> concurrent splits may occur, only one split can get the large
> 'split_order', another will return 0 (since it will see the large swao
> entry has already been split).
Yes, it may return 0, so we can get a large folio here, but get
`split_order = 0`?
And if concurrently swapout/swapin happened, the `split_order` could
be a different value?
>
> Based on your current patch, would the following modifications be clearer?
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 5be9c905396e..91c071fb7b67 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2254,7 +2254,7 @@ static int shmem_split_swap_entry(struct inode
> *inode, pgoff_t index,
> if (xas_error(&xas))
> return xas_error(&xas);
>
> - return 0;
> + return split_order;
> }
>
> /*
> @@ -2351,10 +2351,23 @@ static int shmem_swapin_folio(struct inode
> *inode, pgoff_t index,
> error = shmem_split_swap_entry(inode, index, swap, gfp);
> if (error)
> goto failed_nolock;
> - }
>
> - index = round_down(index, 1 << swap_order);
> - swap.val = round_down(swap.val, 1 << swap_order);
> + /*
> + * If the large swap entry has already been split, it is
> + * necessary to recalculate the new swap entry based on
> + * the old order alignment.
> + */
> + if (split_order > 0) {
> + pgoff_t offset = index - round_down(index, 1 <<
> split_order);
> +
> + swap = swp_entry(swp_type(swap),
> swp_offset(swap) + offset);
> + }
> + } else if (order < folio_order(folio)) {
> + /*
> + * TODO; explain the posible race...
> + */
> + swap.val = round_down(swap.val, 1 << folio_order(folio));
> + }
>
> /* We have to do this with folio locked to prevent races */
> folio_lock(folio);
> @@ -2382,7 +2395,8 @@ static int shmem_swapin_folio(struct inode *inode,
> pgoff_t index,
> goto failed;
> }
>
> - error = shmem_add_to_page_cache(folio, mapping, index,
> + error = shmem_add_to_page_cache(folio, mapping,
> + round_down(index, nr_pages),
> swp_to_radix_entry(swap), gfp);
> if (error)
> goto failed;
>
> > So I wanted to simplify it: by round_down(folio_order(folio)) we
> > simply get the index and swap that will be covered by this specific
> > folio, if the covered range still has the corresponding swap entries
> > (check and ensured by shmem_add_to_page_cache which holds the
> > xa_lock), then the folio will be inserted back safely and
> > successfully.
>
I think adding the missing swap value fixup in the !SYNC_IO path
should be good enough?
diff --git a/mm/shmem.c b/mm/shmem.c
index 5be9c905396e..2620e4d1b56a 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2276,6 +2276,7 @@ static int shmem_swapin_folio(struct inode
*inode, pgoff_t index,
struct folio *folio = NULL;
bool skip_swapcache = false;
swp_entry_t swap;
+ pgoff_t offset;
VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
swap = radix_to_swp_entry(*foliop);
@@ -2325,7 +2326,9 @@ static int shmem_swapin_folio(struct inode
*inode, pgoff_t index,
goto failed;
}
- /* Here we actually start the io */
+ /* Cached swapin currently only supports order 0 swapin */
+ /* It may hit a large folio but that's OK and handled below */
+ offset = index - round_down(index, 1 << order);
+ swap.val = swap.val + offset;
folio = shmem_swapin_cluster(swap, gfp, info, index);
if (!folio) {
error = -ENOMEM;
Powered by blists - more mailing lists