lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEe=Sxmcn5+YUXBQhxDpzZVJu_T6S6+EURDqrP9uUS-PHGyuSg@mail.gmail.com>
Date:   Tue, 22 Mar 2022 17:34:17 -0700
From:   Tim Murray <timmurray@...gle.com>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     Waiman Long <longman@...hat.com>, Jaegeuk Kim <jaegeuk@...nel.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Linux F2FS Dev Mailing List 
        <linux-f2fs-devel@...ts.sourceforge.net>
Subject: Re: [GIT PULL] f2fs for 5.18

Hi Linus,

On Tue, Mar 22, 2022 at 10:50 AM Linus Torvalds
<torvalds@...ux-foundation.org> wrote:
>
> Even when people get the semantics and memory ordering right (which is
> not always the case, but at least the f2fs code uses real lock
> primitives - just oddly - and should thus be ok), it invariably tends
> to be a sign of something else being very wrong.
>
> And I can easily believe that in this case it's due to a rmsem issue
> that was already fixed long long ago as per Waiman.
>
> Can people please test with the actual modern rwsem code and with the
> odd reader-unfair locks disabled?

I did try the patch from Waiman (backported to a local Android device
running 5.10 without the unfair-rwsem patch) when we were first
discussing this, and as far as I could tell, his patch did not make a
difference for this problem. I think I can explain why it solves a
different problem, and it is related to how f2fs uses locking
primitives and worker threads. In this case, it's important to know
that f2fs's checkpoint thread coalesces checkpoints on behalf of all
userspace threads calling fsync(), userspace threads calling many f2fs
functions take the read lock on the rwsem, and the checkpoint thread
takes the write lock. The userspace threads can be any CFS priority
and may be throttled for various reasons (constrained cpuset or cpuctl
runtime limits).

Here's one example from a trace associated with a very slow app start
from a 5.10 device before the unfair-rwsem patch:

1. pid 6711, a random userspace thread, is running on a CPU at time 0ms.
2. After 2.6ms of running, pid 6711 enters state D, deschedules, and
is blocked in f2fs_new_inode according to sched_blocked_reason (a
tracepoint we added in the Android kernel that emits the function
where a thread is blocked in uninterruptible sleep).
3. The f2fs_checkpoint thread runs 5ms after pid 6711 deschedules.
f2fs_ckpt runs for 63us, makes pid 6711 runnable, and then deschedules
in state D blocked at rwsem_down_write_slowpath.
4. pid 6711 remains runnable for 341ms--it's a low priority thread in
a throttled process and the system is busy. Meanwhile, f2fs_ckpt
remains in D for 341ms.
5. pid 6711 finally runs and makes f2fs_ckpt runnable after 11us.

If the problem were related to optimistic spinning that would be
addressed by Waiman's patch, I'd expect to see pid 6711 running
concurrently with f2fs_ckpt. However, that's not what happens; there's
a 5ms gap in between pid 6711 blocking on the read lock and f2fs_ckpt
running. AFAICT, what's happening is that rwsem_down_read_slowpath
modifies sem->count to indicate that there's a pending reader while
f2fs_ckpt holds the write lock, and when f2fs_ckpt releases the write
lock, it wakes pending readers and hands the lock over to readers.
This means that any subsequent attempt to grab the write lock from
f2fs_ckpt will stall until the newly-awakened reader releases the read
lock, which depends on the readers' arbitrarily long scheduling
delays.

AFAICT, any solution for this problem that doesn't require an overhaul
of f2fs locking requires that the newly-awakened readers can't block
f2fs_ckpt until they are scheduled and grab the read lock. Either we
could do something trylock-related like this patch implemented, or we
could move f2fs to percpu_rwsem. However, moving to percpu_rwsem means
no optimistic spinning, which also seems bad given how short the
read-locked critical sections are. We have seen similar problems with
contention on the cgroup percpu_rwsem that spin-on-owner could have
alleviated, so I'm hesitant to get rid of optimistic spinning for f2fs
by switching to percpu_rwsem.

The most obvious objection to the patch is that f2fs_ckpt will still
stall if a reader deschedules while holding the lock. That is
absolutely true! However, in practice, we were hitting the worst case
every time because any attempt to grab the read lock while f2fs_ckpt
holds the write lock exposes f2fs_ckpt to arbitrary scheduling delay
from the reader. The reader-unfair rwsem patch prevents this problem.
We have evidence this is true in practice, too. I just ran a quick
comparison on 150 slow app start traces from our internal population
comparing kernels pre- and post-unfair-rwsem patch, and the percentage
of time the thread responsible for UI was stalled on f2fs locks went
from ~50% of all uninterruptible sleep time on that thread to less
than 5%.

I think this is a special case where fairness to readers is explicitly
harmful because any thread at any priority can block on the only
writer (f2fs_ckpt) but any low priority thread could be a reader
blocking the only writer. I don't know if it's worth explicit support
in rw_semaphore for this kind of use case, but I don't think an
existing rwsem patch addresses this issue because it is so unusual.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ