linux-kernel - Re: [RFC PATCH 1/3] mm, hugetlb: Clean up locking in hugetlb_fault and hugetlb

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aD8Yy0xfJdvLKp4X@x1.local>
Date: Tue, 3 Jun 2025 11:46:19 -0400
From: Peter Xu <peterx@...hat.com>
To: David Hildenbrand <david@...hat.com>
Cc: Oscar Salvador <osalvador@...e.de>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Muchun Song <muchun.song@...ux.dev>,
	James Houghton <jthoughton@...gle.com>,
	Gavin Guo <gavinguo@...lia.com>, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 1/3] mm, hugetlb: Clean up locking in hugetlb_fault
 and hugetlb_wp

On Tue, Jun 03, 2025 at 05:08:55PM +0200, David Hildenbrand wrote:
> On 03.06.25 16:57, Peter Xu wrote:
> > On Tue, Jun 03, 2025 at 03:50:54PM +0200, Oscar Salvador wrote:
> > > On Mon, Jun 02, 2025 at 05:30:19PM -0400, Peter Xu wrote:
> > > > Right, and thanks for the git digging as usual.  I would agree hugetlb is
> > > > more challenge than many other modules on git archaeology. :)
> > > > 
> > > > Even if I mentioned the invalidate_lock, I don't think I thought deeper
> > > > than that. I just wished whenever possible we still move hugetlb code
> > > > closer to generic code, so if that's the goal we may still want to one day
> > > > have a closer look at whether hugetlb can also use invalidate_lock.  Maybe
> > > > it isn't worthwhile at last: invalidate_lock is currently a rwsem, which
> > > > normally at least allows concurrent fault, but that's currently what isn't
> > > > allowed in hugetlb anyway..
> > > > 
> > > > If we start to remove finer grained locks that work will be even harder,
> > > > and removing folio lock in this case in fault path also brings hugetlbfs
> > > > even further from other file systems.  That might be slightly against what
> > > > we used to wish to do, which is to make it closer to others.  Meanwhile I'm
> > > > also not yet sure the benefit of not taking folio lock all across, e.g. I
> > > > don't expect perf would change at all even if lock is avoided.  We may want
> > > > to think about that too when doing so.
> > > 
> > > Ok, I have to confess I was not looking things from this perspective,
> > > but when doing so, yes, you are right, we should strive to find
> > > replacements wherever we can for not using hugetlb-specific code.
> > > 
> > > I do not know about this case though, not sure what other options do we
> > > have when trying to shut concurrent faults while doing other operation.
> > > But it is something we should definitely look at.
> > > 
> > > Wrt. to the lock.
> > > There were two locks, old_folio (taken in hugetlb_fault) and
> > > pagecache_folio one.
> > 
> > There're actually three places this patch touched, the 3rd one is
> > hugetlb_no_page(), in which case I also think we should lock it, not only
> > because file folios normally does it (see do_fault(), for example), but
> > also that's exactly what James mentioned I believe on possible race of
> > !uptodate hugetlb folio being injected by UFFDIO_CONTINUE, along the lines:
> > 
> > 		folio = alloc_hugetlb_folio(vma, vmf->address, false);
> >                  ...
> > 		folio_zero_user(folio, vmf->real_address);
> > 		__folio_mark_uptodate(folio);
> > 
> > > The thing was not about worry as how much perf we leave on the table
> > > because of these locks, as I am pretty sure is next to 0, but my drive
> > > was to understand what are protection and why, because as the discussion
> > > showed, none of us really had a good idea about it and it turns out that this
> > > goes back more than ~20 years ago.
> > > 
> > > Another topic for the lock (old_folio, so the one we copy from),
> > > when we compare it to generic code, we do not take the lock there.
> > > Looking at do_wp_page(), we do __get__ a reference on the folio we copy
> > > from, but not the lock, so AFAIU, the lock seems only to please
> > 
> > Yes this is a good point; for CoW path alone maybe we don't need to lock
> > old_folio.
> > 
> > > folio_move_anon_rmap() from hugetlb_wp.
> > > 
> > > Taking a look at do_wp_page()->wp_can_reuse_anon_folio() which also
> > > calls folio_move_anon_rmap() in case we can re-use the folio, it only
> > > takes the lock before the call to folio_move_anon_rmap(), and then
> > > unlocks it.
> > 
> > IMHO, do_wp_page() took the folio lock not for folio_move_anon_rmap(), but
> > for checking swapcache/ksm stuff which needs to be serialized with folio
> > lock.
> > 
> > So I'm not 100% confident on the folio_move_anon_rmap(), but I _think_ it
> > deserves a data_race() and IIUC it only work not because of the folio lock,
> > but because of how anon_vma is managed as a tree as of now, so that as long
> > as WRITE_ONCE() even a race is benign (because the rmap walker will either
> > see a complete old anon_vma that includes the parent process's anon_vma, or
> > the child's).  What really protects the anon_vma should really be anon_vma
> > lock.. That can definitely be a separate topic.  I'm not sure whether you'd
> > like to dig this part out, but if you do I'd also be more than happy to
> > know whether my understanding needs correction here.. :)
> > 
> > In general, I still agree with you that if hugetlb CoW path can look closer
> > to do_wp_page then it's great.
> 
> 
> As stated elsewhere, the mapcount check + folio_move_anon_rmap need the
> folio lock.

Could you elaborate what would go wrong if we do folio_move_anon_rmap()
without folio lock here?  Just to make sure we're on the same page: we
already have pgtable lock held, and we decided to reuse an anonymous
hugetlb page.

Thanks,

-- 
Peter Xu