linux-kernel - Re: [PATCH v2 2/4] mm/huge_memory: replace can_split

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <66C159D8-D267-4B3B-9384-1CE94533990E@nvidia.com>
Date: Mon, 24 Nov 2025 16:08:34 -0500
From: Zi Yan <ziy@...dia.com>
To: "David Hildenbrand (Red Hat)" <david@...nel.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
 Andrew Morton <akpm@...ux-foundation.org>,
 Baolin Wang <baolin.wang@...ux.alibaba.com>,
 "Liam R. Howlett" <Liam.Howlett@...cle.com>, Nico Pache <npache@...hat.com>,
 Ryan Roberts <ryan.roberts@....com>, Dev Jain <dev.jain@....com>,
 Barry Song <baohua@...nel.org>, Lance Yang <lance.yang@...ux.dev>,
 Miaohe Lin <linmiaohe@...wei.com>, Naoya Horiguchi <nao.horiguchi@...il.com>,
 Wei Yang <richard.weiyang@...il.com>, Balbir Singh <balbirs@...dia.com>,
 linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 2/4] mm/huge_memory: replace can_split_folio() with
 direct refcount calculation

On 24 Nov 2025, at 14:22, David Hildenbrand (Red Hat) wrote:

> On 11/24/25 18:05, Zi Yan wrote:
>> On 24 Nov 2025, at 5:41, David Hildenbrand (Red Hat) wrote:
>>
>>> On 11/22/25 03:55, Zi Yan wrote:
>>>> can_split_folio() is just a refcount comparison, making sure only the
>>>> split caller holds an extra pin. Open code it with
>>>> folio_expected_ref_count() != folio_ref_count() - 1. For the extra_pins
>>>> used by folio_ref_freeze(), add folio_cache_references() to calculate it.
>>>>
>>>> Suggested-by: David Hildenbrand (Red Hat) <david@...nel.org>
>>>> Signed-off-by: Zi Yan <ziy@...dia.com>
>>>> ---
>>>>    include/linux/huge_mm.h |  1 -
>>>>    mm/huge_memory.c        | 43 ++++++++++++++++-------------------------
>>>>    mm/vmscan.c             |  3 ++-
>>>>    3 files changed, 19 insertions(+), 28 deletions(-)
>>>>
>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>> index 97686fb46e30..1ecaeccf39c9 100644
>>>> --- a/include/linux/huge_mm.h
>>>> +++ b/include/linux/huge_mm.h
>>>> @@ -369,7 +369,6 @@ enum split_type {
>>>>    	SPLIT_TYPE_NON_UNIFORM,
>>>>    };
>>>>   -bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>>>>    int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>>    		unsigned int new_order);
>>>>    int folio_split_unmapped(struct folio *folio, unsigned int new_order);
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index c1f1055165dd..6c821c1c0ac3 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -3455,23 +3455,6 @@ static void lru_add_split_folio(struct folio *folio, struct folio *new_folio,
>>>>    	}
>>>>    }
>>>>   -/* Racy check whether the huge page can be split */
>>>> -bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
>>>> -{
>>>> -	int extra_pins;
>>>> -
>>>> -	/* Additional pins from page cache */
>>>> -	if (folio_test_anon(folio))
>>>> -		extra_pins = folio_test_swapcache(folio) ?
>>>> -				folio_nr_pages(folio) : 0;
>>>> -	else
>>>> -		extra_pins = folio_nr_pages(folio);
>>>> -	if (pextra_pins)
>>>> -		*pextra_pins = extra_pins;
>>>> -	return folio_mapcount(folio) == folio_ref_count(folio) - extra_pins -
>>>> -					caller_pins;
>>>> -}
>>>> -
>>>>    static bool page_range_has_hwpoisoned(struct page *page, long nr_pages)
>>>>    {
>>>>    	for (; nr_pages; page++, nr_pages--)
>>>> @@ -3776,17 +3759,26 @@ int folio_check_splittable(struct folio *folio, unsigned int new_order,
>>>>    	return 0;
>>>>    }
>>>>   +/* Number of folio references from the pagecache or the swapcache. */
>>>> +static unsigned int folio_cache_references(const struct folio *folio)
>>>> +{
>>>> +	if (folio_test_anon(folio) && !folio_test_swapcache(folio))
>>>> +		return 0;
>>>> +	return folio_nr_pages(folio);
>>>> +}
>>>> +
>>>>    static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int new_order,
>>>>    					     struct page *split_at, struct xa_state *xas,
>>>>    					     struct address_space *mapping, bool do_lru,
>>>>    					     struct list_head *list, enum split_type split_type,
>>>> -					     pgoff_t end, int *nr_shmem_dropped, int extra_pins)
>>>> +					     pgoff_t end, int *nr_shmem_dropped)
>>>>    {
>>>>    	struct folio *end_folio = folio_next(folio);
>>>>    	struct folio *new_folio, *next;
>>>>    	int old_order = folio_order(folio);
>>>>    	int ret = 0;
>>>>    	struct deferred_split *ds_queue;
>>>> +	int extra_pins = folio_cache_references(folio);
>>>
>>> Can we just inline the call do folio_cache_references() and get rid of extra_pins.
>>> (which is a bad name either way)
>>>
>>>
>>> if (folio_ref_freeze(folio, folio_cache_references(folio) + 1) {
>>>
>>>
>>> BTW, now that we have this helper, I wonder if we should then also do for
>>> clarification on the unfreeze path:
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 0acdc2f26ee0c..7cbcf61b7971d 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -3824,8 +3824,7 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
>>>                           zone_device_private_split_cb(folio, new_folio);
>>>   -                       expected_refs = folio_expected_ref_count(new_folio) + 1;
>>> -                       folio_ref_unfreeze(new_folio, expected_refs);
>>> +                       folio_ref_unfreeze(new_folio, folio_cache_references(new_folio) + 1);
>>>                           if (do_lru)
>>>                                  lru_add_split_folio(folio, new_folio, lruvec, list);
>>> @@ -3868,8 +3867,7 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
>>>                   * Otherwise, a parallel folio_try_get() can grab @folio
>>>                   * and its caller can see stale page cache entries.
>>>                   */
>>> -               expected_refs = folio_expected_ref_count(folio) + 1;
>>> -               folio_ref_unfreeze(folio, expected_refs);
>>> +               folio_ref_unfreeze(folio, folio_cache_references(folio) + 1);
>>>                   if (do_lru)
>>>                          unlock_page_lruvec(lruvec);
>>>
>>>
>>
>> Both make sense to me. Will make the change.
>>
>> By comparing folio_cache_references() with folio_expected_ref_count(),
>> one difference is that folio_expected_ref_count() does not give right
>> refcount for shmem in swapcache.
>
> Good point. Likely nobody runs into that right now because nobody can really does anything with these folios before they were re-added to the pagecache or mapped into page tables.
>
>>
>> This is the folio_expected_ref_count() code:
>>
>>          if (folio_test_anon(folio)) {
>>                  /* One reference per page from the swapcache. */
>>                  ref_count += folio_test_swapcache(folio) << order;
>>          } else {
>>                  /* One reference per page from the pagecache. */
>>                  ref_count += !!folio->mapping << order;
>>                  /* One reference from PG_private. */
>>                  ref_count += folio_test_private(folio);
>>          }
>>
>> shmem in swapcache mean !folio_test_anon(folio) && folio_test_swapcache(folio).
>
> See below, it's actually
>
> folio_test_anon(folio) && folio_test_swapbacked(folio)&& folio_test_swapcache(folio)

!folio_test_anon(folio) && folio_test_swapbacked(folio)&&
folio_test_swapcache(folio)

Right?

>
> I think ...
>
>> The above code gives 0, but folio_cache_references() gives folio_nr_pages(folio).
>> It should not cause any issue, since IIUC shmem in swapcache happens
>> when the folio has an additional ref,
>> folio_expected_ref_count() != folio_ref_count() anyway. For split, it is
>> not supported yet,
>
> Right.
>
>> so folio_expected_ref_count() in split code does not
>> affect shmem in swapcache. But folio_expected_ref_count() should be
>> fixed, right?
>
> We should better handle it, agreed.
>
> Staring at the history of folio_expected_ref_count() once again, back when we had folio_expected_refs() in migration code we didn't seem to handle it I think.
>
> -static int folio_expected_refs(struct address_space *mapping,
> -               struct folio *folio)
> -{
> -       int refs = 1;
> -       if (!mapping)
> -               return refs;
> -
> -       refs += folio_nr_pages(folio);
> -       if (folio_test_private(folio))
> -               refs++;
> -
> -       return refs;
> -}
>
>
> gup.c doesn't care, because the pages are still mapped.
>
> khugepaged.c similarly.
>
> memfd.c doesn't care because the pages are still in the pagecache.
>
> So I suspect nothing is broken, but the migration case needs a second look.

For migration, shmem in swapcache happens in shmem_writeout(), where an
additional ref is placed on the folio. And migration caller places
a ref on the folio to before a migration. The folio has 2 refs and it is
not equal to folio_expected_ref_count()(returning 0) + 1 ,
or folio_expected_refs()(returning 1).

So it is safe.

>
>>
>> Like:
>>
>>          if (folio_test_anon(folio)) {
>>                  /* One reference per page from the swapcache. */
>>                  ref_count += folio_test_swapcache(folio) << order;
>>          } else {
>> 				/* One reference per page from shmem in the swapcache. */
>>                  ref_count += folio_test_swapcache(folio) << order;
>>                  /* One reference per page from the pagecache. */
>>                  ref_count += !!folio->mapping << order;
>>                  /* One reference from PG_private. */
>>                  ref_count += folio_test_private(folio);
>>          }
>>
>> or simplified into
>>
>>     		if (!folio_test_anon(folio)) {
>>                  /* One reference per page from the pagecache. */
>>                  ref_count += !!folio->mapping << order;
>>                  /* One reference from PG_private. */
>>                  ref_count += folio_test_private(folio);
>>          }
>> 		/* One reference per page from the swapcache (anon or shmem). */
>>          ref_count += folio_test_swapcache(folio) << order;
>> ?
>
> That is incorrect I think due to swapcache being able to give false positives (PG_owner_priv_1).

Got it. So it should be:

          if (folio_test_anon(folio)) {
                  /* One reference per page from the swapcache. */
                  ref_count += folio_test_swapcache(folio) << order;
          } else {
 				/* One reference per page from shmem in the swapcache. */
                  ref_count += (folio_test_swapbacked (folio) &&
								folio_test_swapcache(folio)) << order;
                  /* One reference per page from the pagecache. */
                  ref_count += !!folio->mapping << order;
                  /* One reference from PG_private. */
                  ref_count += folio_test_private(folio);
          }

I wonder if we should have folio_test_shmem_in_swapcache() instead.

BTW, this page flag reuse is really confusing. I see PG_checked is
PG_owner_priv_1 too and __folio_migrate_mapping() uses folio_test_swapcache()
to decide the number of i_pages entries. Wouldn’t that cause any issue?
ext4 does not release_folio() for migration when PG_checked is set,
ubifs clears PG_checked in release_folio(). I have not checked all other FS
yet. Maybe later.


Best Regards,
Yan, Zi