[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8419f066-fdfb-4ccb-14cf-27a3139fe713@huaweicloud.com>
Date: Wed, 18 Jun 2025 16:08:55 +0800
From: Kemeng Shi <shikemeng@...weicloud.com>
To: Kairui Song <ryncsn@...il.com>
Cc: linux-mm@...ck.org, Andrew Morton <akpm@...ux-foundation.org>,
Hugh Dickins <hughd@...gle.com>, Baolin Wang
<baolin.wang@...ux.alibaba.com>, Matthew Wilcox <willy@...radead.org>,
Chris Li <chrisl@...nel.org>, Nhat Pham <nphamcs@...il.com>,
Baoquan He <bhe@...hat.com>, Barry Song <baohua@...nel.org>,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH 3/4] mm/shmem, swap: improve mthp swapin process
on 6/18/2025 2:50 PM, Kairui Song wrote:
> On Wed, Jun 18, 2025 at 2:27 PM Kemeng Shi <shikemeng@...weicloud.com> wrote:
>> on 6/18/2025 2:35 AM, Kairui Song wrote:
>>> From: Kairui Song <kasong@...cent.com>
>>>
>>> Tidy up the mTHP swapin workflow. There should be no feature change, but
>>> consolidates the mTHP related check to one place so they are now all
>>> wrapped by CONFIG_TRANSPARENT_HUGEPAGE, and will be trimmed off by
>>> compiler if not needed.
>>>
>>> Signed-off-by: Kairui Song <kasong@...cent.com>
>>> ---
>>> mm/shmem.c | 175 ++++++++++++++++++++++++-----------------------------
>>> 1 file changed, 78 insertions(+), 97 deletions(-)
>>>
>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>> index 0ad49e57f736..46dea2fa1b43 100644
>>> --- a/mm/shmem.c
>>> +++ b/mm/shmem.c
>>
>> ...
>>
>>> @@ -2283,110 +2306,66 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>>> /* Look it up and read it in.. */
>>> folio = swap_cache_get_folio(swap, NULL, 0);
>>> if (!folio) {
>>> - int nr_pages = 1 << order;
>>> - bool fallback_order0 = false;
>>> -
>>> /* Or update major stats only when swapin succeeds?? */
>>> if (fault_type) {
>>> *fault_type |= VM_FAULT_MAJOR;
>>> count_vm_event(PGMAJFAULT);
>>> count_memcg_event_mm(fault_mm, PGMAJFAULT);
>>> }
>>> -
>>> - /*
>>> - * If uffd is active for the vma, we need per-page fault
>>> - * fidelity to maintain the uffd semantics, then fallback
>>> - * to swapin order-0 folio, as well as for zswap case.
>>> - * Any existing sub folio in the swap cache also blocks
>>> - * mTHP swapin.
>>> - */
>>> - if (order > 0 && ((vma && unlikely(userfaultfd_armed(vma))) ||
>>> - !zswap_never_enabled() ||
>>> - non_swapcache_batch(swap, nr_pages) != nr_pages))
>>> - fallback_order0 = true;
>>> -
>>> - /* Skip swapcache for synchronous device. */
>>> - if (!fallback_order0 && data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
>>> - folio = shmem_swap_alloc_folio(inode, vma, index, swap, order, gfp);
>>> + /* Try direct mTHP swapin bypassing swap cache and readahead */
>>> + if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
>>> + swap_order = order;
>>> + folio = shmem_swapin_direct(inode, vma, index,
>>> + swap, &swap_order, gfp);
>>> if (!IS_ERR(folio)) {
>>> skip_swapcache = true;
>>> goto alloced;
>>> }
>>> -
>>> - /*
>>> - * Fallback to swapin order-0 folio unless the swap entry
>>> - * already exists.
>>> - */
>>> + /* Fallback if order > 0 swapin failed with -ENOMEM */
>>> error = PTR_ERR(folio);
>>> folio = NULL;
>>> - if (error == -EEXIST)
>>> + if (error != -ENOMEM || !swap_order)
>>> goto failed;
>>> }
>>> -
>>> /*
>>> - * Now swap device can only swap in order 0 folio, then we
>>> - * should split the large swap entry stored in the pagecache
>>> - * if necessary.
>>> + * Try order 0 swapin using swap cache and readahead, it still
>>> + * may return order > 0 folio due to raced swap cache.
>>> */
>>> - split_order = shmem_split_large_entry(inode, index, swap, gfp);
>>> - if (split_order < 0) {
>>> - error = split_order;
>>> - goto failed;
>>> - }
>>> -
>>> - /*
>>> - * If the large swap entry has already been split, it is
>>> - * necessary to recalculate the new swap entry based on
>>> - * the old order alignment.
>>> - */
>>> - if (split_order > 0) {
>>> - pgoff_t offset = index - round_down(index, 1 << split_order);
>>> -
>>> - swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
>>> - }
>>> -
>> For fallback order 0, we always call shmem_swapin_cluster() before but we will call
>> shmem_swap_alloc_folio() now. It seems fine to me. Just point this out for others
>> to recheck this.
>
> It's an improvement, I forgot to mention that in the commit message.
> Readahead is a burden for SWP_SYNCHRONOUS_IO devices so calling
> shmem_swap_alloc_folio is better. I'll update the commit message.
>
>>> - /* Here we actually start the io */
>>> folio = shmem_swapin_cluster(swap, gfp, info, index);
>>> if (!folio) {
>>> error = -ENOMEM;
>>> goto failed;
>>> }
>>> - } else if (order > folio_order(folio)) {
>>> - /*
>>> - * Swap readahead may swap in order 0 folios into swapcache
>>> - * asynchronously, while the shmem mapping can still stores
>>> - * large swap entries. In such cases, we should split the
>>> - * large swap entry to prevent possible data corruption.
>>> - */
>>> - split_order = shmem_split_large_entry(inode, index, swap, gfp);
>>> - if (split_order < 0) {
>>> - folio_put(folio);
>>> - folio = NULL;
>>> - error = split_order;
>>> - goto failed;
>>> - }
>>> -
>>> - /*
>>> - * If the large swap entry has already been split, it is
>>> - * necessary to recalculate the new swap entry based on
>>> - * the old order alignment.
>>> - */
>>> - if (split_order > 0) {
>>> - pgoff_t offset = index - round_down(index, 1 << split_order);
>>> -
>>> - swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
>>> - }
>>> - } else if (order < folio_order(folio)) {
>>> - swap.val = round_down(swp_type(swap), folio_order(folio));
>>> }
>>> -
>>> alloced:
>>> + /*
>>> + * We need to split an existing large entry if swapin brought in a
>>> + * smaller folio due to various of reasons.
>>> + *
>>> + * And worth noting there is a special case: if there is a smaller
>>> + * cached folio that covers @swap, but not @index (it only covers
>>> + * first few sub entries of the large entry, but @index points to
>>> + * later parts), the swap cache lookup will still see this folio,
>>> + * And we need to split the large entry here. Later checks will fail,
>>> + * as it can't satisfy the swap requirement, and we will retry
>>> + * the swapin from beginning.
>>> + */
>>> + swap_order = folio_order(folio);
>>> + if (order > swap_order) {
>>> + error = shmem_split_swap_entry(inode, index, swap, gfp);
>>> + if (error)
>>> + goto failed_nolock;
>>> + }
>>> +
>>> + index = round_down(index, 1 << swap_order);
>>> + swap.val = round_down(swap.val, 1 << swap_order);
>>> +
>>
>> If swap entry order is reduced but index and value keep unchange,
>> the shmem_split_swap_entry() will split the reduced large swap entry
>> successfully but index and swap.val will be incorrect beacuse of wrong
>> swap_order. We can catch unexpected order and swap entry in
>> shmem_add_to_page_cache() and will retry the swapin, but this will
>> introduce extra cost.
>>
>> If we return order of entry which is splited in shmem_split_swap_entry()
>> and update index and swap.val with returned order, we can avoid the extra
>> cost for mentioned racy case.
>
> The swap_order here is simply the folio's order, so doing this
> round_down will get the swap entry and index that will be covered by
> this folio. (the later folio->swap.val != swap.val ensures the values
> are valid here).
>
> Remember the previous patch mentioned that, we may see the shmem
> mapping's entry got split but still seeing a large folio here. With
> current design, shmem_add_to_page_cache can still succeed as it should
> be, but if we round_down with the returned order of
> shmem_split_swap_entry, it will fail.
My bad... Thanks for explanation. The calculation looks fine to me.
>
> So I think to make it better to keep it this way, and besides, the
> next patch makes use of this for sanity checks and reducing false
> positives of swap cache lookups.
>
Powered by blists - more mailing lists