linux-kernel - Re: [PATCH] f2fs: move f2fs to use reader-unfair rwsems

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAEe=SxnWeK0pSfijPKJSTxBiMgD1Ev69fV3qSTCgWASk0b3vhA@mail.gmail.com>
Date:   Mon, 10 Jan 2022 11:41:23 -0800
From:   Tim Murray <timmurray@...gle.com>
To:     Waiman Long <longman@...hat.com>
Cc:     Christoph Hellwig <hch@...radead.org>,
        Jaegeuk Kim <jaegeuk@...nel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        linux-f2fs-devel@...ts.sourceforge.net,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>, Will Deacon <will@...nel.org>,
        Boqun Feng <boqun.feng@...il.com>
Subject: Re: [PATCH] f2fs: move f2fs to use reader-unfair rwsems

On Mon, Jan 10, 2022 at 7:02 AM Peter Zijlstra <peterz@...radead.org> wrote:
> Can we start by describing the actual problem?

Of course. Sorry, I didn't realize Jaegeuk was going to move so
quickly with this patch :)

We have a few thousand traces from internal Android devices where
large numbers of threads throughout the system end up blocked on
f2fs_issue_checkpoint(), which userspace code usually reaches via
fsync(). In the same traces, the f2fs-ckpt kthread is often blocked
around f2fs_write_checkpoint(), which does the actual fsync()
operation on behalf of some number of callers (coalescing multiple
fsync calls into a single checkpoint op). I think the problem is
something like:

1. f2fs-ckpt thread is running f2fs_write_checkpoint(), holding the
cp_rwsem write lock while doing so via f2fs_lock_all() in
block_operations().
2. Random very-low-priority thread A makes some other f2fs call that
tries to get the cp_rwsem read lock by atomically adding on the rwsem,
fails and deschedules in uninterruptible sleep. cp_rwsem now has a
non-zero reader count but is write-locked.
3. f2fs-ckpt thread releases the cp_rwsem write lock. cp_rwsem now has
a non-zero reader count and is not write-locked, so is reader-locked.
4. Other threads call fsync(), which requests checkpoints from
f2fs-ckpt, and block on a completion event that f2fs-ckpt dispatches.
cp_rwsem still has a non-zero reader count because the low-prio thread
A from (2) has not been scheduled again yet.
5. f2fs-ckpt wakes up to perform checkpoints, but it stalls on the
write lock via cmpxchg in block_operations() until the low-prio thread
A has run and released the cp_rwsem read lock. Because f2fs-ckpt can't
run, all fsync() callers are also effectively blocked by the
low-priority thread holding the read lock.

I think this is the rough shape of the problem (vs readers holding the
lock for too long or something like that) because the low-priority
thread is never run between when it is initially made runnable by
f2fs-ckpt and when it runs tens/hundreds of milliseconds later then
immediately unblocks f2fs-ckpt.

When there's a lot of oversubscription, threads running IO, etc, we
see f2fs-ckpt stalls of tens to hundreds of milliseconds per f2fs-ckpt
iteration. In one recent experiment of the immediate aftermath of
booting and unlocking a phone for the first time, over a 10s period,
we saw f2fs-ckpt blocked on the rwsem for 9.7s, running for <50ms over
110 separate slices (so probably ~110 fsyncs), and blocked on IO
operations for <100ms. The worst 10s period I can find in a similar
experiment with the unfair reader approach has f2fs-ckpt blocked on
the rwsem for 130ms, running for 20ms over 95 slices, blocked on IO
for 40ms, and sleeping the rest of the time.

Unfair rwsems are rarely appropriate, but I think the lack of fairness
makes sense here because of the relative importance of f2fs-ckpt vs
any particular reader of that rwsem. The one concern I have about
down_read_unfair() is whether the fairness should be a property of the
down_read operation or the rw_semaphore itself. In the f2fs case, it
seems like all read locks of the rw_semaphores should be unfair and
that adding a single fair down_read() could introduce a similar
regression, but duplicating rw_semaphore also seems bad. If it does
make sense to make the fairness a property of the semaphore, could
that property be flagged in an init_unfair_rwsem() macro and then
checked in lockdep?