linux-kernel - Re: [PATCH -next RFC v3 0/8] improve tag allocation under heavy load

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6957af40-8720-d74b-5be7-6bcdd9aa1089@huawei.com>
Date:   Mon, 25 Apr 2022 14:47:24 +0800
From:   "yukuai (C)" <yukuai3@...wei.com>
To:     Damien Le Moal <damien.lemoal@...nsource.wdc.com>,
        <axboe@...nel.dk>, <bvanassche@....org>,
        <andriy.shevchenko@...ux.intel.com>, <john.garry@...wei.com>,
        <ming.lei@...hat.com>, <qiulaibin@...wei.com>
CC:     <linux-block@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
        <yi.zhang@...wei.com>
Subject: Re: [PATCH -next RFC v3 0/8] improve tag allocation under heavy load

在 2022/04/25 14:23, Damien Le Moal 写道:
> On 4/25/22 15:14, yukuai (C) wrote:
>> 在 2022/04/25 11:24, Damien Le Moal 写道:
>>> On 4/24/22 11:43, yukuai (C) wrote:
>>>> friendly ping ...
>>>>
>>>> 在 2022/04/15 18:10, Yu Kuai 写道:
>>>>> Changes in v3:
>>>>>     - update 'waiters_cnt' before 'ws_active' in sbitmap_prepare_to_wait()
>>>>>     in patch 1, in case __sbq_wake_up() see 'ws_active > 0' while
>>>>>     'waiters_cnt' are all 0, which will cause deap loop.
>>>>>     - don't add 'wait_index' during each loop in patch 2
>>>>>     - fix that 'wake_index' might mismatch in the first wake up in patch 3,
>>>>>     also improving coding for the patch.
>>>>>     - add a detection in patch 4 in case io hung is triggered in corner
>>>>>     cases.
>>>>>     - make the detection, free tags are sufficient, more flexible.
>>>>>     - fix a race in patch 8.
>>>>>     - fix some words and add some comments.
>>>>>
>>>>> Changes in v2:
>>>>>     - use a new title
>>>>>     - add patches to fix waitqueues' unfairness - path 1-3
>>>>>     - delete patch to add queue flag
>>>>>     - delete patch to split big io thoroughly
>>>>>
>>>>> In this patchset:
>>>>>     - patch 1-3 fix waitqueues' unfairness.
>>>>>     - patch 4,5 disable tag preemption on heavy load.
>>>>>     - patch 6 forces tag preemption for split bios.
>>>>>     - patch 7,8 improve large random io for HDD. We do meet the problem and
>>>>>     I'm trying to fix it at very low cost. However, if anyone still thinks
>>>>>     this is not a common case and not worth to optimize, I'll drop them.
>>>>>
>>>>> There is a defect for blk-mq compare to blk-sq, specifically split io
>>>>> will end up discontinuous if the device is under high io pressure, while
>>>>> split io will still be continuous in sq, this is because:
>>>>>
>>>>> 1) new io can preempt tag even if there are lots of threads waiting.
>>>>> 2) split bio is issued one by one, if one bio can't get tag, it will go
>>>>> to wail.
>>>>> 3) each time 8(or wake batch) requests is done, 8 waiters will be woken up.
>>>>> Thus if a thread is woken up, it will unlikey to get multiple tags.
>>>>>
>>>>> The problem was first found by upgrading kernel from v3.10 to v4.18,
>>>>> test device is HDD with 256 'max_sectors_kb', and test case is issuing 1m
>>>>> ios with high concurrency.
>>>>>
>>>>> Noted that there is a precondition for such performance problem:
>>>>> There is a certain gap between bandwidth for single io with
>>>>> bs=max_sectors_kb and disk upper limit.
>>>>>
>>>>> During the test, I found that waitqueues can be extremly unbalanced on
>>>>> heavy load. This is because 'wake_index' is not set properly in
>>>>> __sbq_wake_up(), see details in patch 3.
>>>>>
>>>>> Test environment:
>>>>> arm64, 96 core with 200 BogoMIPS, test device is HDD. The default
>>>>> 'max_sectors_kb' is 1280(Sorry that I was unable to test on the machine
>>>>> where 'max_sectors_kb' is 256).>>
>>>>> The single io performance(randwrite):
>>>>>
>>>>> | bs       | 128k | 256k | 512k | 1m   | 1280k | 2m   | 4m   |
>>>>> | -------- | ---- | ---- | ---- | ---- | ----- | ---- | ---- |
>>>>> | bw MiB/s | 20.1 | 33.4 | 51.8 | 67.1 | 74.7  | 82.9 | 82.9 |
>>>
>>> These results are extremely strange, unless you are running with the
>>> device write cache disabled ? If you have the device write cache enabled,
>>> the problem you mention above would be most likely completely invisible,
>>> which I guess is why nobody really noticed any issue until now.
>>>
>>> Similarly, with reads, the device side read-ahead may hide the problem,
>>> albeit that depends on how "intelligent" the drive is at identifying
>>> sequential accesses.
>>>
>>>>>
>>>>> It can be seen that 1280k io is already close to upper limit, and it'll
>>>>> be hard to see differences with the default value, thus I set
>>>>> 'max_sectors_kb' to 128 in the following test.
>>>>>
>>>>> Test cmd:
>>>>>            fio \
>>>>>            -filename=/dev/$dev \
>>>>>            -name=test \
>>>>>            -ioengine=psync \
>>>>>            -allow_mounted_write=0 \
>>>>>            -group_reporting \
>>>>>            -direct=1 \
>>>>>            -offset_increment=1g \
>>>>>            -rw=randwrite \
>>>>>            -bs=1024k \
>>>>>            -numjobs={1,2,4,8,16,32,64,128,256,512} \
>>>>>            -runtime=110 \
>>>>>            -ramp_time=10
>>>>>
>>>>> Test result: MiB/s
>>>>>
>>>>> | numjobs | v5.18-rc1 | v5.18-rc1-patched |
>>>>> | ------- | --------- | ----------------- |
>>>>> | 1       | 67.7      | 67.7              |
>>>>> | 2       | 67.7      | 67.7              |
>>>>> | 4       | 67.7      | 67.7              |
>>>>> | 8       | 67.7      | 67.7              |
>>>>> | 16      | 64.8      | 65.6              |
>>>>> | 32      | 59.8      | 63.8              |
>>>>> | 64      | 54.9      | 59.4              |
>>>>> | 128     | 49        | 56.9              |
>>>>> | 256     | 37.7      | 58.3              |
>>>>> | 512     | 31.8      | 57.9              |
>>>
>>> Device write cache disabled ?
>>>
>>> Also, what is the max QD of this disk ?
>>>
>>> E.g., if it is SATA, it is 32, so you will only get at most 64 scheduler
>>> tags. So for any of your tests with more than 64 threads, many of the
>>> threads will be waiting for a scheduler tag for the BIO before the
>>> bio_split problem you explain triggers. Given that the numbers you show
>>> are the same for before-after patch with a number of threads <= 64, I am
>>> tempted to think that the problem is not really BIO splitting...
>>>
>>> What about random read workloads ? What kind of results do you see ?
>>
>> Hi,
>>
>> Sorry about the misleading of this test case.
>>
>> This testcase is high concurrency huge randwrite, it's just for the
>> problem that split bios won't be issued continuously, which is the
>> root cause of the performance degradation as the numjobs increases.
>>
>> queue_depth is 32, and numjobs is 64, thus when numjobs is not greater
>> than 8, performance is fine, because the ratio of sequential io should
>> be 7/8. However, as numjobs increases, performance is worse because
>> the ratio is lower. For example, when numjobs is 512, the ratio of
>> sequential io is about 20%.
> 
> But with 512 jobs, you will get only 64 jobs only with IOs in the queue.
> All other jobs will be waiting for a scheduler tag before being able to
> issue their large BIO. No ?

Hi,

It's right.

In fact, after this patchset, since each large io will need total 8
tags, only 8 jobs can be in the queue while others are waiting for
scheduler tag.

> 
> It sounds like the set of scheduler tags should be a bit more elastic:
> always allow BIOs from a split of a large BIO to be submitted (that is to
> get a scheduler tag) even if that causes a temporary excess of the number
> of requests beyond the default number of scheduler tags. Doing so, all
> fragments of a large BIOs can be queued immediately. From there, if the
> scheduler operates correctly, all the requests from the large BIOs split
> would be issued in sequence to the device.

This solution sounds feasible in theory, however, I'm not sure yet how
to implement that 'temporary excess'.

Thanks,
Kuai
> 
> 
>>
>> patch 6-8 will let split bios still be issued continuously under high
>> pressure.
>>
>> Thanks,
>> Kuai
>>
> 
>