linux-kernel - Re: [PATCH] f2fs: move f2fs to use reader-unfair rwsems

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4a2352a9-42b4-56cd-423a-825faffcd801@redhat.com>
Date:   Tue, 11 Jan 2022 17:01:06 -0500
From:   Waiman Long <longman@...hat.com>
To:     Jaegeuk Kim <jaegeuk@...nel.org>
Cc:     Tim Murray <timmurray@...gle.com>,
        Christoph Hellwig <hch@...radead.org>,
        LKML <linux-kernel@...r.kernel.org>,
        linux-f2fs-devel@...ts.sourceforge.net,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>, Will Deacon <will@...nel.org>,
        Boqun Feng <boqun.feng@...il.com>
Subject: Re: [PATCH] f2fs: move f2fs to use reader-unfair rwsems

On 1/11/22 12:07, Jaegeuk Kim wrote:
> On 01/11, Waiman Long wrote:
>> On 1/11/22 01:53, Tim Murray wrote:
>>> On Mon, Jan 10, 2022 at 8:15 PM Waiman Long <longman@...hat.com> wrote:
>>>> That is not how rwsem works. A reader which fails to get the lock
>>>> because it is write-locked will remove its reader count before going to
>>>> sleep. So the reader count will be zero eventually. Of course, there is
>>>> a short period of time where the reader count will be non-zero until the
>>>> reader removes its own reader count. So if a new writer comes in at that
>>>> time, it will fail its initial trylock and probably go to optimistic
>>>> spinning mode. If the writer that owns the lock release it at the right
>>>> moment, the reader may acquire the read lock.
>>> Thanks for the correction, that makes sense. I haven't spent too much
>>> time on rwsem internals and I'm not confident about when flags are set
>>> and cleared in sem->count; is there a case where sem->count after
>>> up_write() could be nonzero?
>>>
>>> An example from one trace:
>>>
>>> 1. Low-priority userspace thread 4764 is blocked in f2fs_unlink,
>>> probably at f2fs_lock_op, which is a wrapper around
>>> down_read(cp_rwsem).
>>> 2. f2fs-ckpt runs at t=0ms and wakes thread 4764, making it runnable.
>>> 3. At t=1ms, f2fs-ckpt enters uninterruptible sleep and blocks at
>>> rwsem_down_write_slowpath per sched_blocked_reason.
>>> 4. At t=26ms, thread 4764 runs for the first time since being made
>>> runnable. Within 40us, thread 4764 unblocks f2fs-ckpt and makes it
>>> runnable.
>>>
>>> Since thread 4764 is awakened by f2fs-ckpt but never runs before it
>>> unblocks f2fs-ckpt in down_write_slowpath(), the only idea I had is
>>> that cp_rwsem->count is nonzero after f2fs-ckpt's up_write() in step 2
>>> (maybe because of rwsem_mark_wake()?).
>>>
>>>> I do have a question about the number of readers in such a case compared
>>>> with the number of writers. Are there a large number of low priority
>>>> hanging around? What is an average read lock hold time?
>>>>
>>>> Blocking for 9.7s for a write lock is quite excessive and we need to
>>>> figure out how this happen.,
>>> Just to be 100% clear, it's not a single 9.7s stall, it's many smaller
>>> stalls of 10-500+ms in f2fs-ckpt that add up to 9.7s over that range.
>>>
>>> f2fs is not my area of expertise, but my understanding is that
>>> cp_rwsem in f2fs has many (potentially unbounded) readers and a single
>>> writer. Arbitrary userspace work (fsync, creating/deleting/truncating
>>> files, atomic writes) may grab the read lock, but assuming the
>>> merge_checkpoint option is enabled, only f2fs-ckpt will ever grab the
>>> write lock during normal operation. However, in this particular
>>> example, it looks like there may have been 5-10 threads blocked on
>>> f2fs-ckpt that were awakened alongside thread 4764 in step 2.
>>>
>>> I'll defer to the f2fs experts on the average duration that the read
>>> lock is held.
>> Thanks for the explanation.
>>
>> Another question that I have is whether the test result is based on the
>> latest upstream kernel or earlier kernel version. We used to allow reader
>> optimistic spinning which was then removed in later kernel. Reader
>> optimistic spinning may further increase writer wait time.
> It's on 5.10 kernel having all the upstream f2fs patches, and yes, we wanted
> to get higher priority on writer over many readers since the writer, checkpoint,
> is the  most latency-critical operation that can block all the other filesystem
> operations.

v5.10 kernel still have reader optimistic spinning enabled in rwsem 
which may have worsen the writer wait time. Could you try with a more 
up-to-date kernel or backport the relevant rwsem patches into your test 
kernel to see how much it can help?


>> Anyway, AFAICS, this patch keeps readers out of the rwsem wait queue and so
>> only writers can go into it. We can make an unfair rwsem to give preference
>> to writers in the wait queue and wake up readers only if there is no more
>> writers in the wait queue or even in the optimistic spinning queue. That
>> should achieve the same effect as this patch.
> Can we get a patch for the variant to test a bit? Meanwhile, I think we can
> merge this patch to add a wraper first and switches to it later?

Give me a week or so and I can make a RFC patch to support unfair rwsem 
for you to try out. You do need to try it on the latest kernel, though.

Cheers,
longman