[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3fbadd9f-11dd-9043-11cf-f0839dcf30e1@opensource.wdc.com>
Date: Mon, 25 Apr 2022 12:24:59 +0900
From: Damien Le Moal <damien.lemoal@...nsource.wdc.com>
To: "yukuai (C)" <yukuai3@...wei.com>, axboe@...nel.dk,
bvanassche@....org, andriy.shevchenko@...ux.intel.com,
john.garry@...wei.com, ming.lei@...hat.com, qiulaibin@...wei.com
Cc: linux-block@...r.kernel.org, linux-kernel@...r.kernel.org,
yi.zhang@...wei.com
Subject: Re: [PATCH -next RFC v3 0/8] improve tag allocation under heavy load
On 4/24/22 11:43, yukuai (C) wrote:
> friendly ping ...
>
> 在 2022/04/15 18:10, Yu Kuai 写道:
>> Changes in v3:
>> - update 'waiters_cnt' before 'ws_active' in sbitmap_prepare_to_wait()
>> in patch 1, in case __sbq_wake_up() see 'ws_active > 0' while
>> 'waiters_cnt' are all 0, which will cause deap loop.
>> - don't add 'wait_index' during each loop in patch 2
>> - fix that 'wake_index' might mismatch in the first wake up in patch 3,
>> also improving coding for the patch.
>> - add a detection in patch 4 in case io hung is triggered in corner
>> cases.
>> - make the detection, free tags are sufficient, more flexible.
>> - fix a race in patch 8.
>> - fix some words and add some comments.
>>
>> Changes in v2:
>> - use a new title
>> - add patches to fix waitqueues' unfairness - path 1-3
>> - delete patch to add queue flag
>> - delete patch to split big io thoroughly
>>
>> In this patchset:
>> - patch 1-3 fix waitqueues' unfairness.
>> - patch 4,5 disable tag preemption on heavy load.
>> - patch 6 forces tag preemption for split bios.
>> - patch 7,8 improve large random io for HDD. We do meet the problem and
>> I'm trying to fix it at very low cost. However, if anyone still thinks
>> this is not a common case and not worth to optimize, I'll drop them.
>>
>> There is a defect for blk-mq compare to blk-sq, specifically split io
>> will end up discontinuous if the device is under high io pressure, while
>> split io will still be continuous in sq, this is because:
>>
>> 1) new io can preempt tag even if there are lots of threads waiting.
>> 2) split bio is issued one by one, if one bio can't get tag, it will go
>> to wail.
>> 3) each time 8(or wake batch) requests is done, 8 waiters will be woken up.
>> Thus if a thread is woken up, it will unlikey to get multiple tags.
>>
>> The problem was first found by upgrading kernel from v3.10 to v4.18,
>> test device is HDD with 256 'max_sectors_kb', and test case is issuing 1m
>> ios with high concurrency.
>>
>> Noted that there is a precondition for such performance problem:
>> There is a certain gap between bandwidth for single io with
>> bs=max_sectors_kb and disk upper limit.
>>
>> During the test, I found that waitqueues can be extremly unbalanced on
>> heavy load. This is because 'wake_index' is not set properly in
>> __sbq_wake_up(), see details in patch 3.
>>
>> Test environment:
>> arm64, 96 core with 200 BogoMIPS, test device is HDD. The default
>> 'max_sectors_kb' is 1280(Sorry that I was unable to test on the machine
>> where 'max_sectors_kb' is 256).>>
>> The single io performance(randwrite):
>>
>> | bs | 128k | 256k | 512k | 1m | 1280k | 2m | 4m |
>> | -------- | ---- | ---- | ---- | ---- | ----- | ---- | ---- |
>> | bw MiB/s | 20.1 | 33.4 | 51.8 | 67.1 | 74.7 | 82.9 | 82.9 |
These results are extremely strange, unless you are running with the
device write cache disabled ? If you have the device write cache enabled,
the problem you mention above would be most likely completely invisible,
which I guess is why nobody really noticed any issue until now.
Similarly, with reads, the device side read-ahead may hide the problem,
albeit that depends on how "intelligent" the drive is at identifying
sequential accesses.
>>
>> It can be seen that 1280k io is already close to upper limit, and it'll
>> be hard to see differences with the default value, thus I set
>> 'max_sectors_kb' to 128 in the following test.
>>
>> Test cmd:
>> fio \
>> -filename=/dev/$dev \
>> -name=test \
>> -ioengine=psync \
>> -allow_mounted_write=0 \
>> -group_reporting \
>> -direct=1 \
>> -offset_increment=1g \
>> -rw=randwrite \
>> -bs=1024k \
>> -numjobs={1,2,4,8,16,32,64,128,256,512} \
>> -runtime=110 \
>> -ramp_time=10
>>
>> Test result: MiB/s
>>
>> | numjobs | v5.18-rc1 | v5.18-rc1-patched |
>> | ------- | --------- | ----------------- |
>> | 1 | 67.7 | 67.7 |
>> | 2 | 67.7 | 67.7 |
>> | 4 | 67.7 | 67.7 |
>> | 8 | 67.7 | 67.7 |
>> | 16 | 64.8 | 65.6 |
>> | 32 | 59.8 | 63.8 |
>> | 64 | 54.9 | 59.4 |
>> | 128 | 49 | 56.9 |
>> | 256 | 37.7 | 58.3 |
>> | 512 | 31.8 | 57.9 |
Device write cache disabled ?
Also, what is the max QD of this disk ?
E.g., if it is SATA, it is 32, so you will only get at most 64 scheduler
tags. So for any of your tests with more than 64 threads, many of the
threads will be waiting for a scheduler tag for the BIO before the
bio_split problem you explain triggers. Given that the numbers you show
are the same for before-after patch with a number of threads <= 64, I am
tempted to think that the problem is not really BIO splitting...
What about random read workloads ? What kind of results do you see ?
>>
>> Yu Kuai (8):
>> sbitmap: record the number of waiters for each waitqueue
>> blk-mq: call 'bt_wait_ptr()' later in blk_mq_get_tag()
>> sbitmap: make sure waitqueues are balanced
>> blk-mq: don't preempt tag under heavy load
>> sbitmap: force tag preemption if free tags are sufficient
>> blk-mq: force tag preemption for split bios
>> blk-mq: record how many tags are needed for splited bio
>> sbitmap: wake up the number of threads based on required tags
>>
>> block/blk-merge.c | 8 +-
>> block/blk-mq-tag.c | 49 +++++++++----
>> block/blk-mq.c | 54 +++++++++++++-
>> block/blk-mq.h | 4 +
>> include/linux/blk_types.h | 4 +
>> include/linux/sbitmap.h | 9 +++
>> lib/sbitmap.c | 149 +++++++++++++++++++++++++++-----------
>> 7 files changed, 216 insertions(+), 61 deletions(-)
>>
--
Damien Le Moal
Western Digital Research
Powered by blists - more mailing lists