linux-kernel - Re: [RFC PATCH 1/3] mm, hugetlb: Clean up locking in hugetlb_fault and hugetlb

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <fd10b2b3-064c-4ee2-ad7d-e30a0e194533@redhat.com>
Date: Tue, 3 Jun 2025 17:08:55 +0200
From: David Hildenbrand <david@...hat.com>
To: Peter Xu <peterx@...hat.com>, Oscar Salvador <osalvador@...e.de>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
 Muchun Song <muchun.song@...ux.dev>, James Houghton <jthoughton@...gle.com>,
 Gavin Guo <gavinguo@...lia.com>, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 1/3] mm, hugetlb: Clean up locking in hugetlb_fault
 and hugetlb_wp

On 03.06.25 16:57, Peter Xu wrote:
> On Tue, Jun 03, 2025 at 03:50:54PM +0200, Oscar Salvador wrote:
>> On Mon, Jun 02, 2025 at 05:30:19PM -0400, Peter Xu wrote:
>>> Right, and thanks for the git digging as usual.  I would agree hugetlb is
>>> more challenge than many other modules on git archaeology. :)
>>>
>>> Even if I mentioned the invalidate_lock, I don't think I thought deeper
>>> than that. I just wished whenever possible we still move hugetlb code
>>> closer to generic code, so if that's the goal we may still want to one day
>>> have a closer look at whether hugetlb can also use invalidate_lock.  Maybe
>>> it isn't worthwhile at last: invalidate_lock is currently a rwsem, which
>>> normally at least allows concurrent fault, but that's currently what isn't
>>> allowed in hugetlb anyway..
>>>
>>> If we start to remove finer grained locks that work will be even harder,
>>> and removing folio lock in this case in fault path also brings hugetlbfs
>>> even further from other file systems.  That might be slightly against what
>>> we used to wish to do, which is to make it closer to others.  Meanwhile I'm
>>> also not yet sure the benefit of not taking folio lock all across, e.g. I
>>> don't expect perf would change at all even if lock is avoided.  We may want
>>> to think about that too when doing so.
>>
>> Ok, I have to confess I was not looking things from this perspective,
>> but when doing so, yes, you are right, we should strive to find
>> replacements wherever we can for not using hugetlb-specific code.
>>
>> I do not know about this case though, not sure what other options do we
>> have when trying to shut concurrent faults while doing other operation.
>> But it is something we should definitely look at.
>>
>> Wrt. to the lock.
>> There were two locks, old_folio (taken in hugetlb_fault) and
>> pagecache_folio one.
> 
> There're actually three places this patch touched, the 3rd one is
> hugetlb_no_page(), in which case I also think we should lock it, not only
> because file folios normally does it (see do_fault(), for example), but
> also that's exactly what James mentioned I believe on possible race of
> !uptodate hugetlb folio being injected by UFFDIO_CONTINUE, along the lines:
> 
> 		folio = alloc_hugetlb_folio(vma, vmf->address, false);
>                  ...
> 		folio_zero_user(folio, vmf->real_address);
> 		__folio_mark_uptodate(folio);
> 
>> The thing was not about worry as how much perf we leave on the table
>> because of these locks, as I am pretty sure is next to 0, but my drive
>> was to understand what are protection and why, because as the discussion
>> showed, none of us really had a good idea about it and it turns out that this
>> goes back more than ~20 years ago.
>>
>> Another topic for the lock (old_folio, so the one we copy from),
>> when we compare it to generic code, we do not take the lock there.
>> Looking at do_wp_page(), we do __get__ a reference on the folio we copy
>> from, but not the lock, so AFAIU, the lock seems only to please
> 
> Yes this is a good point; for CoW path alone maybe we don't need to lock
> old_folio.
> 
>> folio_move_anon_rmap() from hugetlb_wp.
>>
>> Taking a look at do_wp_page()->wp_can_reuse_anon_folio() which also
>> calls folio_move_anon_rmap() in case we can re-use the folio, it only
>> takes the lock before the call to folio_move_anon_rmap(), and then
>> unlocks it.
> 
> IMHO, do_wp_page() took the folio lock not for folio_move_anon_rmap(), but
> for checking swapcache/ksm stuff which needs to be serialized with folio
> lock.
> 
> So I'm not 100% confident on the folio_move_anon_rmap(), but I _think_ it
> deserves a data_race() and IIUC it only work not because of the folio lock,
> but because of how anon_vma is managed as a tree as of now, so that as long
> as WRITE_ONCE() even a race is benign (because the rmap walker will either
> see a complete old anon_vma that includes the parent process's anon_vma, or
> the child's).  What really protects the anon_vma should really be anon_vma
> lock.. That can definitely be a separate topic.  I'm not sure whether you'd
> like to dig this part out, but if you do I'd also be more than happy to
> know whether my understanding needs correction here.. :)
> 
> In general, I still agree with you that if hugetlb CoW path can look closer
> to do_wp_page then it's great.


As stated elsewhere, the mapcount check + folio_move_anon_rmap need the 
folio lock.

-- 
Cheers,

David / dhildenb