linux-kernel - Re: [PATCH v2] mm/swap: fix race when skipping swapcache

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMgjq7C4jWRB913Zmav5BWhL8VrQ7XOVZn62SdWBXrcUMpBEow@mail.gmail.com>
Date: Fri, 16 Feb 2024 03:07:44 +0800
From: Kairui Song <ryncsn@...il.com>
To: Minchan Kim <minchan@...nel.org>, Barry Song <21cnbao@...il.com>
Cc: Chris Li <chrisl@...nel.org>, "Huang, Ying" <ying.huang@...el.com>, linux-mm@...ck.org, 
	Andrew Morton <akpm@...ux-foundation.org>, Yu Zhao <yuzhao@...gle.com>, 
	Barry Song <v-songbaohua@...o.com>, SeongJae Park <sj@...nel.org>, Hugh Dickins <hughd@...gle.com>, 
	Johannes Weiner <hannes@...xchg.org>, Matthew Wilcox <willy@...radead.org>, Michal Hocko <mhocko@...e.com>, 
	Yosry Ahmed <yosryahmed@...gle.com>, David Hildenbrand <david@...hat.com>, stable@...r.kernel.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] mm/swap: fix race when skipping swapcache

On Thu, Feb 15, 2024 at 8:44 AM Minchan Kim <minchan@...nel.org> wrote:
>
> Hi Kairui,
>

Hi, Minchan,

>
> I also played the swap refcount stuff ideas and encoutered a lot
> problems(e.g., kernel crash and/or data missing).
>
> Now I realize your original solution would be best to fix under such a
> small change.
>
> Folks, please chime in if you have another idea.
>
> >
> > Instead if we add a schedule or schedule_timeout_uninterruptible(1),
>
> How much your workload is going If we use schedule instead of
> schedule_timeout_uninterruptible(1)? If that doesn't increase the
> statistics a lot, I prefer the schedule.
> (TBH, I didn't care much about the stat since it would be
> very rare).

I've tested both schedule() and schedule_timeout_uninterruptible(1)
locally using the reproducer, and some other cases, and looked all
good.

The reproducer I provided demonstrates an extreme case, so I think
schedule() is good enough already.

I currently don't have any more benchmarks or tests, as the change is
very small for most workloads. I'll use schedule() here in V3 if no
one else has other suggestions.

I remember Barry's series about large folio swapin may change the ZRAM
read time depending on folio size, and cause strange races that's
different from the reproducer. Not sure if I can test using some known
race cases. That could be helpful to verify this fix and if schedule()
or schedule_timeout_uninterruptible(1) is better here.

> > it seems quite enough to avoid repeated page faults. I did following
> > test with the reproducer I provided in the commit message:
> >
> > Using ZRAM instead of brd for more realistic workload:
> > $ perf stat -e minor-faults,major-faults -I 30000 ./a.out
> >
> > Unpatched kernel:
> > #           time             counts unit events
> >     30.000096504             33,089      minor-faults:u
> >     30.000096504            958,745      major-faults:u
> >     60.000375308             22,150      minor-faults:u
> >     60.000375308          1,221,665      major-faults:u
> >     90.001062176             24,840      minor-faults:u
> >     90.001062176          1,005,427      major-faults:u
> >    120.004233268             23,634      minor-faults:u
> >    120.004233268          1,045,596      major-faults:u
> >    150.004530091             22,358      minor-faults:u
> >    150.004530091          1,097,871      major-faults:u
> >
> > Patched kernel:
> > #           time             counts unit events
> >     30.000069753            280,094      minor-faults:u
> >     30.000069753          1,198,871      major-faults:u
> >     60.000415915            436,862      minor-faults:u
> >     60.000415915            919,860      major-faults:u
> >     90.000651670            359,176      minor-faults:u
> >     90.000651670            955,340      major-faults:u
> >    120.001088135            289,137      minor-faults:u
> >    120.001088135            992,317      major-faults:u
> >
> > Indeed there is a huge increase of minor page faults, which indicate
> > the raced path returned without handling the fault. This reproducer
> > tries to maximize the race chance so the reading is more terrifying
> > than usual.
> >
> > But after adding a schedule/schedule_timeout_uninterruptible(1) after
> > swapcache_prepare failed, still using the same test:
> >
> > Patched kernel (adding a schedule() before goto out):
> > #           time             counts unit events
> >     30.000335901             43,066      minor-faults:u
> >     30.000335901          1,163,232      major-faults:u
> >     60.034629687             35,652      minor-faults:u
> >     60.034629687            844,188      major-faults:u
> >     90.034941472             34,957      minor-faults:u
> >     90.034941472          1,218,174      major-faults:u
> >    120.035255386             36,241      minor-faults:u
> >    120.035255386          1,073,395      major-faults:u
> >    150.035535786             33,057      minor-faults:u
> >    150.035535786          1,171,684      major-faults:u
> >
> > Patched kernel (adding a schedule_timeout_uninterruptible(1) before goto out):
> > #           time             counts unit events
> >     30.000044452             26,748      minor-faults:u
> >     30.000044452          1,202,064      major-faults:u
> >     60.000317379             19,883      minor-faults:u
> >     60.000317379          1,186,152      major-faults:u
> >     90.000568779             18,333      minor-faults:u
> >     90.000568779          1,151,015      major-faults:u
> >    120.000787253             22,844      minor-faults:u
> >    120.000787253            887,531      major-faults:u
> >    150.001309065             18,900      minor-faults:u
> >    150.001309065          1,211,894      major-faults:u
> >
> > The minor faults are basically the same as before, or even lower since
> > other races are also reduced, with no observable performance
> > regression so far.
> > If the vague margin of this schedule call is still concerning, I think
> > another approach is just a new swap map value for parallel cache
> > bypassed swapping to loop on.
>
> Long term, the swap map vaule would be good but for now, I prefer
> your original solution with some delay with schedule.

Yes, that's also what I have in mind.

With a swap map value, actually I think the cache bypassed path may
even go faster as it can simplify some swap map value change process,
but that requires quite some extra work and change, could be
introduced later as an optimization.

> Thanks, Kairui!

Thanks for the comments!