linux-kernel - Re: [PATCH] bcache: consider the fragmentation when update the writeback rate

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3ca15755-9ad2-1d57-b86a-fb659f701cfb@suse.de>
Date:   Thu, 14 Jan 2021 21:31:25 +0800
From:   Coly Li <colyli@...e.de>
To:     Dongdong Tao <dongdong.tao@...onical.com>
Cc:     Kent Overstreet <kent.overstreet@...il.com>,
        "open list:BCACHE (BLOCK LAYER CACHE)" <linux-bcache@...r.kernel.org>,
        open list <linux-kernel@...r.kernel.org>,
        Gavin Guo <gavin.guo@...onical.com>,
        Gerald Yang <gerald.yang@...onical.com>,
        Trent Lloyd <trent.lloyd@...onical.com>,
        Dominique Poulain <dominique.poulain@...onical.com>,
        Dongsheng Yang <dongsheng.yang@...ystack.cn>,
        Benjamin Allot <benjamin.allot@...onical.com>
Subject: Re: [PATCH] bcache: consider the fragmentation when update the
 writeback rate

On 1/14/21 8:22 PM, Dongdong Tao wrote:
> Hi Coly,
> 
> Why you limit the iodeph to 8 and iops to 150 on cache device?
> For cache device the limitation is small. Iosp 150 with 4KB block size,
> it means every hour writing (150*4*60*60=2160000KB=) 2GB data. For 35
> hours it is only 70GB.
> 
> 
> What if the iodepth is 128 or 64, and no iops rate limitation ?
> -> There are two reasons why I limit the iodepth and iops rate.
> 1. If I don't limit them, the dirty cache will be filled up very
> quickly within 20 minutes.
>      It's almost NVME speed before it reaches the 70
> cutoff_writeback_sync, there is no way for any kind of writeback to
> stop it from
>      filling up due to the huge gap between NVME and HDD in terms of
> the throughput,
>      I don't think there is anything we can do about it? and it should
> only happen in a benchmark world, not should in production.
>      The improvement I'm trying to do here is just for normal
> production workload ,not for this benchmark scenario really.
>      I currently can't see any necessity to test this scenario, please
> kindly let me know about this if I'm wrong.
> 
> 2. The reason that I set iodepth to 8 and iops to 150 is based on the
> experience that I observed from production env, mostly ceph,
>     ceph-osd has less than 10 thread(default setting) that will send
> io to bcache in parallel. But I'm not sure about other applications.
>     I agree that we can increase the iodepth to 64 or 128 and it's
> doable. But we have to limit the iops, 150 IOPS is a reasonable
> workload.
>     The most busy ceph-osd that I've seen is about 1000 IOPS, but on
> average is still only about 600.
>     I can set the IOPS to a higher value like 600 and the iodepth to
> 128 to perform the later test if it make sense to you?
> 

OK, now I know the reason with the extra information. Since the cache
device is filled up within 20 minutes, it is unnecessary to do the
faster testing on your side. Let me do it later on my hardware.


> Lastly, please allow me to clarify more about the production issue
> that this patch is trying to address:
> 
> In the production env that hit this issue, it usually takes a very
> long time (many take days) for the cache_available_percent to drop to
> 30, and the dirty data is mostly staying at a very low level (around
> 10 percent), which means that the bcache isn't being stressed very
> hard most of the time.
>  There is no intention to save the cutoff_writeback_sync when the
> bcache is being stressed without limitation, hope above make sense :)
> 

Yes you explained clearly previously. What I worried was whether a
faster writeback may interfere throughput and latency of regular I/O
regular I/Os.

>From your current testing data it looks find with me.


> By the way, my colleague and I are trying to gathering some production
> bcache stats, I hope we can give you the performance number before and
> after applying the patch.

Yes that will be great.

And could you please gather all current data chats into a single email,
and reference it in your patch via lore ? Then for people don't
subscribe linux-bcache mailing list, they may find all the posted
performance data from you patch.

In general your testing data is convinced IMHO, and I will add your
updated patch for 5.12 merge window.


Thanks.

Coly Li


> 
> 
> On Thu, Jan 14, 2021 at 6:05 PM Coly Li <colyli@...e.de> wrote:
>>
>> On 1/14/21 12:45 PM, Dongdong Tao wrote:
>>> Hi Coly,
>>>
>>> I've got the testing data for multiple threads with larger IO depth.
>>>
>>
>> Hi Dongdong,
>>
>> Thanks for the testing number.
>>
>>> *Here is the testing steps:
>>> *1. make-bcache -B <> -C <> --writeback
>>>
>>> 2. Open two tabs, start different fio task in them at the same time.
>>> Tab1 run below fio command:
>>> sudo fio --name=random-writers --filename=/dev/bcache0 --ioengine=libaio
>>> --iodepth=32 --rw=randrw --blocksize=64k,8k  --direct=1 --runtime=24000
>>>
>>> Tab2 run below fio command:
>>> sudo fio --name=random-writers2 --filename=/dev/bcache0
>>> --ioengine=libaio --iodepth=8 --rw=randwrite --bs=4k --rate_iops=150
>>> --direct=1 --write_lat_log=rw --log_avg_msec=20
>>>
>>
>>
>> Why you limit the iodep to 8 and iops to 150 on cache device?
>> For cache device the limitation is small. Iosp 150 with 4KB block size,
>> it means every hour writing (150*4*60*60=2160000KB=) 2GB data. For 35
>> hours it is only 70GB.
>>
>>
>> What if the iodeps is 128 or 64, and no iops rate limitation ?
>>
>>
>>> Note
>>> - Tab1 fio will run for 24000 seconds, which is the one to cause the
>>> fragmentation and made the cache_available_percent drops to under 40.
>>> - Tab2 fio is the one that I'm capturing the latency and I have let it
>>> run for about 35 hours, which is long enough to allow the
>>> cache_available_percent drops under 30.
>>> - This testing method utilized fio benchmark with larger read block
>>> size/small write block size to cause the high fragmentation, However in
>>> a real production env, there could be
>>>    various reasons or a combination of various reasons to cause the high
>>> fragmentation,  but I believe it should be ok to use any method to cause
>>> the fragmentation to verify if
>>>    bcache with this patch is responding better than the master in this
>>> situation.
>>>
>>> *Below is the testing result:*
>>>
>>> The total run time is about 35 hours, the latency points in the charts
>>> for each run are 1.5 million
>>>
>>> Master:
>>> fio-lat-mater.png
>>>
>>> Master + patch:
>>> fio-lat-patch.png
>>> Combine them together:
>>> fio-lat-mix.png
>>>
>>> Now we can see the master is even worse when we increase the iodepth,
>>> which makes sense since the backing HDD is being stressed more hardly.
>>>
>>> *Below are the cache stats changing during the run:*
>>> Master:
>>> bcache-stats-master.png
>>>
>>> Master + the patch:
>>> bcache-stats-patch.png
>>>
>>> That's all the testing done with 400GB NVME with 512B block size.
>>>
>>> Coly, do you want me to continue the same testing on 1TB nvme with
>>> different block size ?
>>> or is it ok to skip the 1TB testing and continue the test with 400GB
>>> NVME but with different block size?
>>> feel free to let me know any other test scenarios that we should cover
>>> here.
>>
>> Yes please, more testing is desired for performance improvement. So far
>> I don't see performance number for real high work load yet.
>>
>> Thanks.
>>
>> Coly Li
>>