linux-kernel - Re: [PATCHSET][RFC] Make background writeback not suck

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <56F1C130.8020200@fb.com>
Date:	Tue, 22 Mar 2016 16:03:28 -0600
From:	Jens Axboe <axboe@...com>
To:	Dave Chinner <david@...morbit.com>
CC:	<linux-kernel@...r.kernel.org>, <linux-fsdevel@...r.kernel.org>,
	<linux-block@...r.kernel.org>
Subject: Re: [PATCHSET][RFC] Make background writeback not suck

On 03/22/2016 03:51 PM, Dave Chinner wrote:
> On Tue, Mar 22, 2016 at 11:55:14AM -0600, Jens Axboe wrote:
>> This patchset isn't as much a final solution, as it's demonstration
>> of what I believe is a huge issue. Since the dawn of time, our
>> background buffered writeback has sucked. When we do background
>> buffered writeback, it should have little impact on foreground
>> activity. That's the definition of background activity... But for as
>> long as I can remember, heavy buffered writers has not behaved like
>> that.
>
> Of course not. The IO scheduler is supposed to determine how we
> meter out bulk vs latency sensitive IO that is queued. That's what
> all the things like anticipatory scheduling for read requests was
> supposed to address....
>
> I'm guessing you're seeing problems like this because blk-mq has no
> IO scheduler infrastructure and so no way of prioritising,
> scheduling and/or throttling different types of IO? Would that be
> accurate?

It's not just that, but obviously the IO scheduler would be one place to 
throttle it. This, in a way, is a way of scheduling the writeback writes 
better. But most of the reports I get on writeback sucking is not using 
scsi/blk-mq, they end up being "classic" on things like deadline.

>> For instance, if I do something like this:
>>
>> $ dd if=/dev/zero of=foo bs=1M count=10k
>>
>> on my laptop, and then try and start chrome, it basically won't start
>> before the buffered writeback is done. Or for server oriented workloads
>> where installation of a big RPM (or similar) adversely impacts data
>> base reads. When that happens, I get people yelling at me.
>>
>> A quick demonstration - a fio job that reads a a file, while someone
>> else issues the above 'dd'. Run on a flash device, using XFS. The
>> vmstat output looks something like this:
>>
>> --io---- -system-- ------cpu-----
>> bi    bo   in   cs us sy id wa st
>>     156  4648   58  151  0  1 98  1  0
>>       0     0   64   83  0  0 100  0  0
>>       0    32   76  119  0  0 100  0  0
>>   26616     0 7574 13907  7  0 91  2  0
>>   41992     0 10811 21395  0  2 95  3  0
>>   46040     0 11836 23395  0  3 94  3  0
>>   19376 1310736 5894 10080  0  4 93  3  0
>>     116 1974296 1858  455  0  4 93  3  0
>>     124 2020372 1964  545  0  4 92  4  0
>>     112 1678356 1955  620  0  3 93  3  0
>>    8560 405508 3759 4756  0  1 96  3  0
>>   42496     0 10798 21566  0  0 97  3  0
>>   42476     0 10788 21524  0  0 97  3  0
>
> So writeback is running at about 2GB/s, meaning the memory is
> cleaned in about 5s.

Correct, and at the same time destroying anything else that runs on the 
disk. For most use cases, not ideal. If we get in a tighter spot on 
memory or someone waits on it, yes, we should ramp up. But not for 
background cleaning.

>> The read starts out fine, but goes to shit when we start bacckground
>> flushing. The reader experiences latency spikes in the seconds range.
>> On flash.
>>
>> With this set of patches applies, the situation looks like this instead:
>>
>> --io---- -system-- ------cpu-----
>> bi    bo   in   cs us sy id wa st
>>   33544     0 8650 17204  0  1 97  2  0
>>   42488     0 10856 21756  0  0 97  3  0
>>   42032     0 10719 21384  0  0 97  3  0
>>   42544    12 10838 21631  0  0 97  3  0
>>   42620     0 10982 21727  0  3 95  3  0
>>   46392     0 11923 23597  0  3 94  3  0
>>   36268 512000 9907 20044  0  3 91  5  0
>>   31572 696324 8840 18248  0  1 91  7  0
>>   30748 626692 8617 17636  0  2 91  6  0
>>   31016 618504 8679 17736  0  3 91  6  0
>>   30612 648196 8625 17624  0  3 91  6  0
>>   30992 650296 8738 17859  0  3 91  6  0
>>   30680 604075 8614 17605  0  3 92  6  0
>>   30592 595040 8572 17564  0  2 92  6  0
>>   31836 539656 8819 17962  0  2 92  5  0
>
> And now it runs at ~600MB/s, slowing down the rate at which memory
> is cleaned by 60%.

Which is the point, correct... If we're not anywhere near being tight on 
memory AND nobody is waiting for this IO, then by definition, the 
foreground activity is the important one. For the case used here, that's 
the application doing reads.

> Given that background writeback is relied on by memory reclaim to
> clean memory faster than the LRUs are cycled, I suspect this is
> going to have a big impact on low memory behaviour and balance,
> which will then feed into IO breakdown problems caused by writeback
> being driven from the LRUs rather than the flusher threads.....

You're missing the part where the intent is to only throttle it heavily 
when it's pure background writeback. Of course, if we are low on memory 
and doing reclaim, we should get much closer to device bandwidth.

If I run the above dd without the reader running, I'm already at 90% of 
the device bandwidth - not quite all the way there, since I still want 
to quickly be able to inject reads (or other IO) without having to wait 
for the queues to purge thousands of requests.

>> The above was the why. The how is basically throttling background
>> writeback. We still want to issue big writes from the vm side of things,
>> so we get nice and big extents on the file system end. But we don't need
>> to flood the device with THOUSANDS of requests for background writeback.
>> For most devices, we don't need a whole lot to get decent throughput.
>
> Except, when the system is busy (e.g. CPU busy) and the writeback
> threads can be starved of CPU by other operations, the writeback
> queue depth needs to go way up so that we don't end up with idle
> devices because the flusher threads are starved of CPU....

Sure, writeback always needs to make stable progress.

>> This adds some simple blk-wb code that keeps limits how much buffered
>> writeback we keep in flight on the device end. The default is pretty
>> low. If we end up switching to WB_SYNC_ALL, we up the limits. If the
>> dirtying task ends up being throttled in balance_dirty_pages(), we up
>> the limit. Currently there are tunables associated with this, see the
>> last patch for descriptions of those.
>>
>> I welcome testing. The end goal here would be having much of this
>> auto-tuned, so that we don't lose substantial bandwidth for
>> background writes, while still maintaining decent non-wb
>> performance and latencies.
>
> Right, another layer of "writeback tunables" is not really a
> desirable outcome. We spent a lot of time making the dirty page
> cache flushing not need tunables (i.e. via careful design of closed
> loop feedback systems), so I think that if we're going to add a new
> layer of throttling, we need to do the same thing. i.e. it needs to
> adapt automatically and correctly to changing loads and workloads.

Fully agree, and that's what I stated as well. The current patchset is a 
way to experiment with improving background writeback, that's both in 
the very first paragraph of this email, and in the blk-wb.c file as 
well. I'm not a huge fan of tunables, nobody touches them, and we need 
to get it right out of the box.

I've already removed one set of tunables from this posting compared to 
what I had a week ago, it's moving in that direction.

-- 
Jens Axboe