linux-kernel - Re: [PATCH] f2fs: move f2fs to use reader-unfair rwsems

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEe=Sx=6FCvrp_6x2Bqp3YTzep2s=aWdCmP29g7+sGCWkpNvkg@mail.gmail.com>
Date:   Wed, 12 Jan 2022 14:06:50 -0800
From:   Tim Murray <timmurray@...gle.com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Waiman Long <longman@...hat.com>,
        Christoph Hellwig <hch@...radead.org>,
        Jaegeuk Kim <jaegeuk@...nel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        linux-f2fs-devel@...ts.sourceforge.net,
        Ingo Molnar <mingo@...hat.com>, Will Deacon <will@...nel.org>,
        Boqun Feng <boqun.feng@...il.com>
Subject: Re: [PATCH] f2fs: move f2fs to use reader-unfair rwsems

On Wed, Jan 12, 2022 at 6:06 AM Peter Zijlstra <peterz@...radead.org> wrote:
> *urgh*... so you're making the worst case less likely but fundamentally
> you don't change anything.
>
> If one of those low prio threads manages to block while holding
> cp_rwsem your checkpoint thread will still block for a very long time.
>
> So while you improve the average case, the worst case doesn't improve
> much I think.

You're right about the worst case latency, but I think a lock
optimization is still useful. If I understand the rwsem implementation
correctly, if there's one pending reader while a writer holds the
rwsem and the reader is sleeping when the writer releases the lock,
the writer will mark the rwsem as reader-owned by incrementing the
reader count then wake the reader in rwsem_mark_wake(). Assuming
that's correct, hitting this case guarantees the lowest possible
latency of a second write lock is the sum of reader's scheduling
delay, the reader's critical section duration, and the writer's
scheduling delay. What we see here is that the reader's scheduling
delay is orders of magnitude larger than the read lock duration when
the readers are low priority (tens of milliseconds of scheduling delay
when the device is heavily loaded vs tens of microseconds of holding
the read lock). The scheduling delay of f2fs-ckpt seems insignificant
compared to the scheduling delay of the reader (and is easily
addressed because it's a kthread vs arbitrary userspace code).

If the stalls were due to preemption of a low-priority reader that
holds the cp_rwsem read lock, I would expect to see some traces like:

1. low-prio userspace thread is running
2. low-prio userspace thread grabs cp_rwsem read lock, is preempted,
switches back to runnable
3. f2fs-ckpt wakes up, blocks on cp_rwsem write lock
4. low-prio userspace thread switches to running and unblocks f2fs-ckpt

Specifically, the low-prio userspace thread would need to wake up
f2fs-ckpt without the userspace thread ending up in uninterruptible
sleep around that period. To my knowledge, we've never seen that in a
trace. The bad behavior we see always comes after the low-prio thread
was blocked acquiring the read lock and switches back to runnable from
uninterruptible sleep. I think the patches that Waiman mentioned don't
address the case we're seeing; I think they fix different patterns of
new readers preventing a writer from becoming runnable vs a long
scheduling delay of a reader awakened by a writer that in turn makes a
writer runnable once the reader finally runs.

With the existing rwsem usage in f2fs, f2fs-ckpt is guaranteed to wait
for the reader's scheduling delay plus the read lock time every time
there's a pending reader, which is the common case with f2fs-ckpt
(because the write lock is held for milliseconds of IO ops vs
microseconds for readers) and why we often see hundreds of 10-100ms
checkpoints in a row. By moving to a reader-unfair rwsem, f2fs-ckpt
would only pay the latency of the reader's scheduling delay if the
reader is preempted while holding the lock. I'm sure that preemption
does happen and that the max latency is about the same as a result,
but given how short the read locks are held vs the write lock and the
relative likelihood of a preempted reader vs any reader arriving at
the lock while the writer is doing IO, the reader-unfair approach
should be a dramatic improvement in 99th percentile (and greater) f2fs
checkpoint latency.

> Also, given that this is a system wide rwsem, would percpu-rwsem not be
> 'better' ? Arguably with the same hack cgroups uses for it (see
> cgroup_init()) to lower the cost of percpu_down_write().

Funny you should mention this--the next thing on my todo list is to
root causing stalls around cgroup_threadgroup_rwsem (mostly in
cgroup_css_set_fork() hit from pthread_create in userspace).

If I'm reading percpu-rwsem.c correctly, it's not fair to readers in
the same way that this patch is. It's probably okay to switch to
percpu_rwsem in f2fs given the distribution of readers to writers, but
I am concerned about giving up spin-on-owner given how often these
locks will be hit from different processes at very different
priorities. Does that seem overly paranoid?

> Now, I'm not a filesystem developer and I'm not much familiar with the
> problem space, but this locking reads like a fairly big problem. I'm not
> sure optimizing the lock is the answer.

I'm not a filesystem developer either, but rewriting the locking was
my first suggestion too. It seems much harder than tweaking the lock.
Rewriting the locking might still be necessary if we see regular
preemption while the cp_rwsem read lock is held, but I don't have any
data showing that preemption is happening regularly right now.