linux-kernel - Re: [PATCH v4] mm/swap: fix race when skipping swapcache

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAMgjq7DgBOJhDJStwGuD+C6-FNYZBp-cu6M_HAgRry3gBSf7GA@mail.gmail.com>
Date: Tue, 20 Feb 2024 11:42:07 +0800
From: Kairui Song <ryncsn@...il.com>
To: Andrew Morton <akpm@...ux-foundation.org>
Cc: linux-mm@...ck.org, "Huang, Ying" <ying.huang@...el.com>, Chris Li <chrisl@...nel.org>, 
	Minchan Kim <minchan@...nel.org>, Barry Song <v-songbaohua@...o.com>, Yu Zhao <yuzhao@...gle.com>, 
	SeongJae Park <sj@...nel.org>, David Hildenbrand <david@...hat.com>, Hugh Dickins <hughd@...gle.com>, 
	Johannes Weiner <hannes@...xchg.org>, Matthew Wilcox <willy@...radead.org>, Michal Hocko <mhocko@...e.com>, 
	Yosry Ahmed <yosryahmed@...gle.com>, Aaron Lu <aaron.lu@...el.com>, stable@...r.kernel.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v4] mm/swap: fix race when skipping swapcache

On Tue, Feb 20, 2024 at 9:31 AM Andrew Morton <akpm@...ux-foundation.org> wrote:
>
> On Mon, 19 Feb 2024 16:20:40 +0800 Kairui Song <ryncsn@...il.com> wrote:
>
> > From: Kairui Song <kasong@...cent.com>
> >
> > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads
> > swapin the same entry at the same time, they get different pages (A, B).
> > Before one thread (T0) finishes the swapin and installs page (A)
> > to the PTE, another thread (T1) could finish swapin of page (B),
> > swap_free the entry, then swap out the possibly modified page
> > reusing the same entry. It breaks the pte_same check in (T0) because
> > PTE value is unchanged, causing ABA problem. Thread (T0) will
> > install a stalled page (A) into the PTE and cause data corruption.
> >
> > @@ -3867,6 +3868,20 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >       if (!folio) {
> >               if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> >                   __swap_count(entry) == 1) {
> > +                     /*
> > +                      * Prevent parallel swapin from proceeding with
> > +                      * the cache flag. Otherwise, another thread may
> > +                      * finish swapin first, free the entry, and swapout
> > +                      * reusing the same entry. It's undetectable as
> > +                      * pte_same() returns true due to entry reuse.
> > +                      */
> > +                     if (swapcache_prepare(entry)) {
> > +                             /* Relax a bit to prevent rapid repeated page faults */
> > +                             schedule_timeout_uninterruptible(1);
>
> Well this is unpleasant.  How often can we expect this to occur?
>

The chance is very low, using the current mainline kernel and ZRAM,
even with threads set to race on purpose using the reproducer I
provides, for 647132 page faults it occured 1528 times (~0.2%).

If I run MySQL and sysbench with 128 threads and 16G buffer pool, with
6G cgroup limit and 32G ZRAM, it occured 1372 times for 40 min,
109930201 page faults in total (~0.001%).