linux-kernel - Re: [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <af9d0c9f-e65b-b2d1-5e4a-cb4785de8881@kernel.dk>
Date:   Wed, 22 Aug 2018 15:05:56 -0600
From:   Jens Axboe <axboe@...nel.dk>
To:     Anchal Agarwal <anchalag@...zon.com>
Cc:     fllinden@...zon.com, sblbir@...zon.com,
        holger@...lied-asynchrony.com, msw@...zon.com,
        linux-block@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] blk-wbt: Avoid lock contention and thundering herd issue
 in wbt_wait

On 8/22/18 2:26 PM, Anchal Agarwal wrote:
> On Wed, Aug 22, 2018 at 11:30:40AM -0600, Jens Axboe wrote:
>> On 8/22/18 10:42 AM, van der Linden, Frank wrote:
>>> On 8/22/18 7:27 AM, Jens Axboe wrote:
>>>> On 8/22/18 6:54 AM, Holger Hoffst??tte wrote:
>>>>> On 08/22/18 06:10, Jens Axboe wrote:
>>>>>> [...]
>>>>>> If you have time, please look at the 3 patches I posted earlier today.
>>>>>> Those are for mainline, so should be OK :-)
>>>>> I'm just playing along at home but with those 3 I get repeatable
>>>>> hangs & writeback not starting at all, but curiously *only* on my btrfs
>>>>> device; for inexplicable reasons some other devices with ext4/xfs flush
>>>>> properly. Yes, that surprised me too, but it's repeatable.
>>>>> Now this may or may not have something to do with some of my in-testing
>>>>> patches for btrfs itself, but if I remove those 3 wbt fixes, everything
>>>>> is golden again. Not eager to repeat since it hangs sync & requires a
>>>>> hard reboot.. :(
>>>>> Just thought you'd like to know.
>>>> Thanks, that's very useful info! I'll see if I can reproduce that.
>>>>
>>> I think it might be useful to kind of give a dump of what we discussed
>>> before this patch was sent, there was a little more than was in the
>>> description.
>>>
>>> We saw hangs and heavy lock contention in the wbt code under a specific
>>> workload, on XFS. Crash dump analysis showed the following issues:
>>>
>>> 1) wbt_done uses wake_up_all, which causes a thundering herd
>>> 2) __wbt_wait sets up a wait queue with the auto remove wake function
>>> (via DEFINE_WAIT), which caused two problems:
>>>    * combined with the use of wake_up_all, the wait queue would
>>> essentially be randomly reordered for tasks that did not get to run
>>>    * the waitqueue_active check in may_queue was not valid with the auto
>>> remove function, which could lead incoming tasks with requests to
>>> overtake existing requests
>>>
>>> 1) was fixed by using a plain wake_up
>>> 2) was fixed by keeping tasks on the queue until they could break out of
>>> the wait loop in __wbt_wait
>>>
>>>
>>> The random reordering, causing task starvation in __wbt_wait, was the
>>> main problem. Simply not using the auto remove wait function, e.g.
>>> *only* changing DEFINE_WAIT(wait) to DEFINE_WAIT_FUNC(wait,
>>> default_wake_function), fixed the hang / task starvation issue in our
>>> tests. But there was still more lock contention than there should be, so
>>> we also changed wake_up_all to wake_up.
>>>
>>> It might be useful to run your tests with only the DEFINE_WAIT change I
>>> describe above added to the original code to see if that still has any
>>> problems. That would give a good datapoint whether any remaining issues
>>> are due to missed wakeups or not.
>>>
>>> There is the issue of making forward progress, or at least making it
>>> fast enough. With the changes as they stand now, you could come up with
>>> a scenario where the throttling limit is hit, but then is raised. Since
>>> there might still be a wait queue, you could end up putting each
>>> incoming task to sleep, even though it's not needed.
>>>
>>> One way to guarantee that the wait queue clears up as fast as possible,
>>> without resorting to wakeup_all, is to use wakeup_nr, where the number
>>> of tasks to wake up is (limit - inflight).
>>>
>>> Also, to avoid having tasks going back to sleep in the loop, you could
>>> do what you already proposed - always just sleep at most once, and
>>> unconditionally proceed after waking up. To avoid incoming tasks
>>> overtaking the ones that are being woken up, you could have wbt_done
>>> increment inflight, effectively reserving a spot for the tasks that are
>>> about to be woken up.
>>>
>>> Another thing I thought about was recording the number of waiters in the
>>> wait queue, and modify the check from (inflight < limit) to (inflight <
>>> (limit - nwaiters)), and no longer use any waitqueue_active checks.
>>>
>>> The condition checks are of course complicated by the fact that
>>> condition manipulation is not always done under the same lock (e.g.
>>> wbt_wait can be called with a NULL lock).
>>>
>>>
>>> So, these are just some of the things to consider here - maybe there's
>>> nothing in there that you hadn't already considered, but I thought it'd
>>> be useful to summarize them.
>>>
>>> Thanks for looking in to this!
>>
>> It turned out to be an unrelated problem with rq reordering in blk-mq,
>> mainline doesn't have it.
>>
>> So I think the above change is safe and fine, but we definitely still
>> want the extra change of NOT allowing a queue token for the initial loop
>> inside __wbt_wait() for when we have current sleepers on the queue.
>> Without that, the initial check in __wbt_wait() is not useful at all.
>>
>>
>> -- 
>> Jens Axboe
>>
>>
> 
> Hi Jens,
> I tested your patches in my environment and they look good. There is no sudden increase in 
> lock contention either. Thanks for catching the corner case though.

Thanks for testing. Can I add your tested-by to the 3 patches?

-- 
Jens Axboe