linux-kernel - Re: [PATCH v2] mm/swap: fix race when skipping swapcache

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <6ad329f6-6e1b-411b-a5d7-be1c8fd89d96@redhat.com>
Date: Fri, 16 Feb 2024 17:16:31 +0100
From: David Hildenbrand <david@...hat.com>
To: Kairui Song <ryncsn@...il.com>
Cc: linux-mm@...ck.org, Andrew Morton <akpm@...ux-foundation.org>,
 "Huang, Ying" <ying.huang@...el.com>, Chris Li <chrisl@...nel.org>,
 Minchan Kim <minchan@...nel.org>, Yu Zhao <yuzhao@...gle.com>,
 Barry Song <v-songbaohua@...o.com>, SeongJae Park <sj@...nel.org>,
 Hugh Dickins <hughd@...gle.com>, Johannes Weiner <hannes@...xchg.org>,
 Matthew Wilcox <willy@...radead.org>, Michal Hocko <mhocko@...e.com>,
 Yosry Ahmed <yosryahmed@...gle.com>, stable@...r.kernel.org,
 linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] mm/swap: fix race when skipping swapcache

>> (4) relock the folio. (we do that already, might not want to fail)
>>
>> (4) take the PTE lock. If the PTE did not change, turn it into a present
>> PTE entry. Otherwise, cleanup.
> 
> Very interesting idea!
> 
> I'm just not sure what actual benefit it brings. The only concern
> about reusing swapcache_prepare so far is repeated page faults that
> may hurt performance or statistics, this issue is basically gone after
> adding a schedule().

I think you know that slapping in random schedule() calls is not a 
proper way to wait for an event to happen :) It's pretty much 
unpredictable how long the schedule() will take and if there even is 
anything to schedule to!

With what I propose, just like with page migration, you really do wait 
for the event (page read and unlocked, only the PTE has to be updated) 
to happen before you try anything else.

Now, the difference is most likely that the case here happens much less 
frequently than page migration. Still, you could have all threads 
faulting one the same page and all would do the same dance here.

> 
> We can't drop all the operations around swap cache and map anyway. It
> doesn't know if it should skip the swapcache until swapcache lookup
> and swap count look up are all done. So I think it can be done more
> naturally here with a special value, making things simpler, robust,
> and improving performance a bit more.
> 

The issue will always be that someone can zap the PTE concurrently, 
which would free up the swap cache. With what I propose, that cannot 
happen in the sync swapin case we are discussing here.

If someone where to zap the PTE in the meantime, it would only remove 
the special non-swap entry, indicating to swapin code that concurrent 
zapping happened. The swapin code would handle the swapcache freeing 
instead, after finishing reading the content.

So the swapcache would not get freed concurrently anymore if I am not wrong.

At least the theory, I didn't prototype it yet.

> And in another series [1] I'm working on making shmem make use of
> cache bypassed swapin as well, following this approach I'll have to
> implement another shmem map based synchronization too.
> 

I'd have to look further into that, if something like that could 
similarly apply to shmem. But yes, it's no using PTEs, so a PTE-based 
sync mechanism does definitely not apply..

> After all it's only a rare race, I think a simpler solution might be better.

I'm not sure that simpler means better here. Simpler might be better for 
a backport, though.

The "use schedule() to wait" looks odd, maybe it's a common primitive 
that I simply didn't stumble over yet. (I doubt it but it could be possible)

-- 
Cheers,

David / dhildenb