linux-kernel - Re: [PATCH v2 0/9] support large folio swap-out and swap-in for shmem

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1cdf6a43-9599-4e34-8f70-29ed16295a60@redhat.com>
Date: Thu, 20 Jun 2024 18:51:58 +0200
From: David Hildenbrand <david@...hat.com>
To: Hugh Dickins <hughd@...gle.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, Barry Song
 <baohua@...nel.org>, Baolin Wang <baolin.wang@...ux.alibaba.com>,
 willy@...radead.org, wangkefeng.wang@...wei.com, chrisl@...nel.org,
 ying.huang@...el.com, 21cnbao@...il.com, ryan.roberts@....com,
 shy828301@...il.com, ziy@...dia.com, ioworker0@...il.com,
 da.gomez@...sung.com, p.raghav@...sung.com, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 0/9] support large folio swap-out and swap-in for shmem

On 20.06.24 18:27, Hugh Dickins wrote:
> On Thu, 20 Jun 2024, David Hildenbrand wrote:
> 
>>>
>>> (I do have doubts about Barry's: the "_new" in folio_add_new_anon_rmap()
>>> was all about optimizing a known-exclusive case, so it surprises me
>>> to see it being extended to non-exclusive; and I worry over how its
>>> atomic_set(&page->_mapcount, 0)s can be safe when non-exclusive (but
>>> I've never caught up with David's exclusive changes, I'm out of date).
>>
>> We discussed that a while ago: if we wouldn't be holding the folio lock in the
>> "folio == swapcache" at that point (which we do for both do_swap_page() and
>> unuse_pte()) things would already be pretty broken.
> 
> You're thinking of the non-atomic-ness of atomic_set(): I agree that
> the folio lock makes that safe (against other adds; but not against
> concurrent removes, which could never occur with the old "_new" usage).
> 
> But what I'm worried about is the 0 in atomic_set(&page->_mapcount, 0):
> once the folio lock has been dropped, another non-exclusive add could
> come in and set _mapcount again to 0 instead of to 1 (mapcount 1 when
> it should be 2)?

That would mean that we have someone checking folio_test_anon() before 
doing the folio_add_new*, and *not* performing that check under folio 
lock such that we can race with another folio_add_new*() call. (or not 
checking for folio_test_anon() at all, which would be similarly broken)

When thinking about this in the past I concluded that such code would be 
inherently racy and wrong already and the existing VM_WARN_ON_FOLIO() 
would have revealed that race.

Whoever calls folio_add_new*() either (a) just allocated that thing and 
can be sure that nobody else does something concurrently -- no need for 
the folio lock; or (b) calls it only when !folio_test_anon(), and 
synchronizes against any other possible races using the folio lock.

swapin code falls into category (b), everything else we have in the tree 
into category (a).

So far my thinking :)

> 
>>
>> That's I added a while ago:
>>
>> if (unlikely(!folio_test_anon(folio))) {
>> 	VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
>> 	/*
>> 	 * For a PTE-mapped large folio, we only know that the single
>> 	 * PTE is exclusive. Further, __folio_set_anon() might not get
>> 	 * folio->index right when not given the address of the head
>> 	 * page.
>> 	 */
>> 	...
>>
>> We should probably move that VM_WARN_ON_FOLIO() to folio_add_new_anon_rmap()
>> and document that it's required in the non-exclusive case.
>>
>>>
>>> But even if those are wrong, I'd expect them to tend towards a mapped
>>> page becoming unreclaimable, then "Bad page map" when munmapped,
>>> not to any of the double-free symptoms I've actually seen.)
>>
>> What's the first known-good commit?
> 
> I cannot answer that with any certainty: we're on the shifting sands
> of next and mm-unstable, and the bug itself is looking rather like
> something which gets amplified or masked by other changes - witness
> my confident arrival at Barry's series as introducing the badness,
> only for a longer run then to contradict that conclusion.
> 
> There was no sign of a problem in a 20-hour run of the same load on
> rc3-based next-2024-06-13 (plus my posted fixes); there has been no
> sign of this problem on 6.10-rc1, rc2, rc3 (but I've not tried rc4
> itself, mm.git needing the more urgent attention). mm-everything-
> 2024-06-15 (minus Chris's mTHP swap) did not show this problem, but
> did show filemap_get_folio() hang. mm-everything-2024-06-18 (minus
> Baolin's mTHP shmem swap) is where I'm hunting it.

Good, as long as we suspect only something  very recent is affected.

There was this report against rc3 upstream:

https://lore.kernel.org/all/CA+G9fYvKmr84WzTArmfaypKM9+=Aw0uXCtuUKHQKFCNMGJyOgQ@mail.gmail.com/

It's only been observed once, and who knows what's happening in that NFS 
setup. I suspected NUMA hinting code, but I am not sure if that really 
explains the issue ...

-- 
Cheers,

David / dhildenb