linux-kernel - Re: [PATCH 2/2] mm/rmap: Improve mlock tracking for large folios

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ndryzvkmrfidmjgj4tl27hk2kmspmb42mxl2smuwgmp5hyedzh@thggle3dhp5j>
Date: Thu, 18 Sep 2025 14:48:20 +0100
From: Kiryl Shutsemau <kirill@...temov.name>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>, 
	Yin Fengwei <fengwei.yin@...el.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, 
	David Hildenbrand <david@...hat.com>, Hugh Dickins <hughd@...gle.com>, 
	Matthew Wilcox <willy@...radead.org>, "Liam R. Howlett" <Liam.Howlett@...cle.com>, 
	Vlastimil Babka <vbabka@...e.cz>, Mike Rapoport <rppt@...nel.org>, 
	Suren Baghdasaryan <surenb@...gle.com>, Michal Hocko <mhocko@...e.com>, Rik van Riel <riel@...riel.com>, 
	Harry Yoo <harry.yoo@...cle.com>, Johannes Weiner <hannes@...xchg.org>, 
	Shakeel Butt <shakeel.butt@...ux.dev>, linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 2/2] mm/rmap: Improve mlock tracking for large folios

On Thu, Sep 18, 2025 at 02:10:05PM +0100, Lorenzo Stoakes wrote:
> On Thu, Sep 18, 2025 at 12:21:57PM +0100, kirill@...temov.name wrote:
> > From: Kiryl Shutsemau <kas@...nel.org>
> >
> > The kernel currently does not mlock large folios when adding them to
> > rmap, stating that it is difficult to confirm that the folio is fully
> > mapped and safe to mlock it. However, nowadays the caller passes a
> > number of pages of the folio that are getting mapped, making it easy to
> > check if the entire folio is mapped to the VMA.
> >
> > mlock the folio on rmap if it is fully mapped to the VMA.
> >
> > Signed-off-by: Kiryl Shutsemau <kas@...nel.org>
> 
> The logic looks good to me, so:
> 
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
> 
> But note the comments below.
> 
> > ---
> >  mm/rmap.c | 13 ++++---------
> >  1 file changed, 4 insertions(+), 9 deletions(-)
> >
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 568198e9efc2..ca8d4ef42c2d 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1478,13 +1478,8 @@ static __always_inline void __folio_add_anon_rmap(struct folio *folio,
> >  				 PageAnonExclusive(cur_page), folio);
> >  	}
> >
> > -	/*
> > -	 * For large folio, only mlock it if it's fully mapped to VMA. It's
> > -	 * not easy to check whether the large folio is fully mapped to VMA
> > -	 * here. Only mlock normal 4K folio and leave page reclaim to handle
> > -	 * large folio.
> > -	 */
> > -	if (!folio_test_large(folio))
> > +	/* Only mlock it if the folio is fully mapped to the VMA */
> > +	if (folio_nr_pages(folio) == nr_pages)
> 
> OK this is nice, as partially mapped will have folio_nr_pages() != nr_pages. So
> logically this must be correct.
> 
> >  		mlock_vma_folio(folio, vma);
> >  }
> >
> > @@ -1620,8 +1615,8 @@ static __always_inline void __folio_add_file_rmap(struct folio *folio,
> >  	nr = __folio_add_rmap(folio, page, nr_pages, vma, level, &nr_pmdmapped);
> >  	__folio_mod_stat(folio, nr, nr_pmdmapped);
> >
> > -	/* See comments in folio_add_anon_rmap_*() */
> > -	if (!folio_test_large(folio))
> > +	/* Only mlock it if the folio is fully mapped to the VMA */
> > +	if (folio_nr_pages(folio) == nr_pages)
> >  		mlock_vma_folio(folio, vma);
> >  }
> >
> > --
> > 2.50.1
> >
> 
> I see in try_to_unmap_one():
> 
> 		if (!(flags & TTU_IGNORE_MLOCK) &&
> 		    (vma->vm_flags & VM_LOCKED)) {
> 			/* Restore the mlock which got missed */
> 			if (!folio_test_large(folio))
> 				mlock_vma_folio(folio, vma);
> 
> Do we care about this?
> 
> It seems like folio_referenced_one() does some similar logic:
> 
> 		if (vma->vm_flags & VM_LOCKED) {
> 			if (!folio_test_large(folio) || !pvmw.pte) {
> 				/* Restore the mlock which got missed */
> 				mlock_vma_folio(folio, vma);
> 				page_vma_mapped_walk_done(&pvmw);
> 				pra->vm_flags |= VM_LOCKED;
> 				return false; /* To break the loop */
> 			}
> 
> ...
> 
> 	if ((vma->vm_flags & VM_LOCKED) &&
> 			folio_test_large(folio) &&
> 			folio_within_vma(folio, vma)) {
> 		unsigned long s_align, e_align;
> 
> 		s_align = ALIGN_DOWN(start, PMD_SIZE);
> 		e_align = ALIGN_DOWN(start + folio_size(folio) - 1, PMD_SIZE);
> 
> 		/* folio doesn't cross page table boundary and fully mapped */
> 		if ((s_align == e_align) && (ptes == folio_nr_pages(folio))) {
> 			/* Restore the mlock which got missed */
> 			mlock_vma_folio(folio, vma);
> 			pra->vm_flags |= VM_LOCKED;
> 			return false; /* To break the loop */
> 		}
> 	}
> 
> So maybe we could do something similar in try_to_unmap_one()?

Hm. This seems to be buggy to me.

mlock_vma_folio() has to be called with ptl taken, no? It gets dropped
by this place.

+Fengwei.

I think this has to be handled inside the loop once ptes reaches
folio_nr_pages(folio).

Maybe something like this (untested):

diff --git a/mm/rmap.c b/mm/rmap.c
index ca8d4ef42c2d..719f1c99470c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -858,17 +858,13 @@ static bool folio_referenced_one(struct folio *folio,
 		address = pvmw.address;
 
 		if (vma->vm_flags & VM_LOCKED) {
-			if (!folio_test_large(folio) || !pvmw.pte) {
-				/* Restore the mlock which got missed */
-				mlock_vma_folio(folio, vma);
-				page_vma_mapped_walk_done(&pvmw);
-				pra->vm_flags |= VM_LOCKED;
-				return false; /* To break the loop */
-			}
+			unsigned long s_align, e_align;
+
+			/* Small folio or PMD-mapped large folio */
+			if (!folio_test_large(folio) || !pvmw.pte)
+				goto restore_mlock;
+
 			/*
-			 * For large folio fully mapped to VMA, will
-			 * be handled after the pvmw loop.
-			 *
 			 * For large folio cross VMA boundaries, it's
 			 * expected to be picked  by page reclaim. But
 			 * should skip reference of pages which are in
@@ -878,7 +874,23 @@ static bool folio_referenced_one(struct folio *folio,
 			 */
 			ptes++;
 			pra->mapcount--;
-			continue;
+
+			/* Folio must be fully mapped to be mlocked */
+			if (ptes != folio_nr_pages(folio))
+				continue;
+
+			s_align = ALIGN_DOWN(start, PMD_SIZE);
+			e_align = ALIGN_DOWN(start + folio_size(folio) - 1, PMD_SIZE);
+
+			/* folio doesn't cross page table */
+			if (s_align != e_align)
+				continue;
+restore_mlock:
+			/* Restore the mlock which got missed */
+			mlock_vma_folio(folio, vma);
+			page_vma_mapped_walk_done(&pvmw);
+			pra->vm_flags |= VM_LOCKED;
+			return false; /* To break the loop */
 		}
 
 		/*
@@ -914,23 +926,6 @@ static bool folio_referenced_one(struct folio *folio,
 		pra->mapcount--;
 	}
 
-	if ((vma->vm_flags & VM_LOCKED) &&
-			folio_test_large(folio) &&
-			folio_within_vma(folio, vma)) {
-		unsigned long s_align, e_align;
-
-		s_align = ALIGN_DOWN(start, PMD_SIZE);
-		e_align = ALIGN_DOWN(start + folio_size(folio) - 1, PMD_SIZE);
-
-		/* folio doesn't cross page table boundary and fully mapped */
-		if ((s_align == e_align) && (ptes == folio_nr_pages(folio))) {
-			/* Restore the mlock which got missed */
-			mlock_vma_folio(folio, vma);
-			pra->vm_flags |= VM_LOCKED;
-			return false; /* To break the loop */
-		}
-	}
-
 	if (referenced)
 		folio_clear_idle(folio);
 	if (folio_test_clear_young(folio))
-- 
  Kiryl Shutsemau / Kirill A. Shutemov