lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Sun, 24 Apr 2022 10:43:22 +0800
From:   "yukuai (C)" <yukuai3@...wei.com>
To:     <axboe@...nel.dk>, <bvanassche@....org>,
        <andriy.shevchenko@...ux.intel.com>, <john.garry@...wei.com>,
        <ming.lei@...hat.com>, <qiulaibin@...wei.com>
CC:     <linux-block@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
        <yi.zhang@...wei.com>
Subject: Re: [PATCH -next RFC v3 0/8] improve tag allocation under heavy load

friendly ping ...

在 2022/04/15 18:10, Yu Kuai 写道:
> Changes in v3:
>   - update 'waiters_cnt' before 'ws_active' in sbitmap_prepare_to_wait()
>   in patch 1, in case __sbq_wake_up() see 'ws_active > 0' while
>   'waiters_cnt' are all 0, which will cause deap loop.
>   - don't add 'wait_index' during each loop in patch 2
>   - fix that 'wake_index' might mismatch in the first wake up in patch 3,
>   also improving coding for the patch.
>   - add a detection in patch 4 in case io hung is triggered in corner
>   cases.
>   - make the detection, free tags are sufficient, more flexible.
>   - fix a race in patch 8.
>   - fix some words and add some comments.
> 
> Changes in v2:
>   - use a new title
>   - add patches to fix waitqueues' unfairness - path 1-3
>   - delete patch to add queue flag
>   - delete patch to split big io thoroughly
> 
> In this patchset:
>   - patch 1-3 fix waitqueues' unfairness.
>   - patch 4,5 disable tag preemption on heavy load.
>   - patch 6 forces tag preemption for split bios.
>   - patch 7,8 improve large random io for HDD. We do meet the problem and
>   I'm trying to fix it at very low cost. However, if anyone still thinks
>   this is not a common case and not worth to optimize, I'll drop them.
> 
> There is a defect for blk-mq compare to blk-sq, specifically split io
> will end up discontinuous if the device is under high io pressure, while
> split io will still be continuous in sq, this is because:
> 
> 1) new io can preempt tag even if there are lots of threads waiting.
> 2) split bio is issued one by one, if one bio can't get tag, it will go
> to wail.
> 3) each time 8(or wake batch) requests is done, 8 waiters will be woken up.
> Thus if a thread is woken up, it will unlikey to get multiple tags.
> 
> The problem was first found by upgrading kernel from v3.10 to v4.18,
> test device is HDD with 256 'max_sectors_kb', and test case is issuing 1m
> ios with high concurrency.
> 
> Noted that there is a precondition for such performance problem:
> There is a certain gap between bandwidth for single io with
> bs=max_sectors_kb and disk upper limit.
> 
> During the test, I found that waitqueues can be extremly unbalanced on
> heavy load. This is because 'wake_index' is not set properly in
> __sbq_wake_up(), see details in patch 3.
> 
> Test environment:
> arm64, 96 core with 200 BogoMIPS, test device is HDD. The default
> 'max_sectors_kb' is 1280(Sorry that I was unable to test on the machine
> where 'max_sectors_kb' is 256).
> 
> The single io performance(randwrite):
> 
> | bs       | 128k | 256k | 512k | 1m   | 1280k | 2m   | 4m   |
> | -------- | ---- | ---- | ---- | ---- | ----- | ---- | ---- |
> | bw MiB/s | 20.1 | 33.4 | 51.8 | 67.1 | 74.7  | 82.9 | 82.9 |
> 
> It can be seen that 1280k io is already close to upper limit, and it'll
> be hard to see differences with the default value, thus I set
> 'max_sectors_kb' to 128 in the following test.
> 
> Test cmd:
>          fio \
>          -filename=/dev/$dev \
>          -name=test \
>          -ioengine=psync \
>          -allow_mounted_write=0 \
>          -group_reporting \
>          -direct=1 \
>          -offset_increment=1g \
>          -rw=randwrite \
>          -bs=1024k \
>          -numjobs={1,2,4,8,16,32,64,128,256,512} \
>          -runtime=110 \
>          -ramp_time=10
> 
> Test result: MiB/s
> 
> | numjobs | v5.18-rc1 | v5.18-rc1-patched |
> | ------- | --------- | ----------------- |
> | 1       | 67.7      | 67.7              |
> | 2       | 67.7      | 67.7              |
> | 4       | 67.7      | 67.7              |
> | 8       | 67.7      | 67.7              |
> | 16      | 64.8      | 65.6              |
> | 32      | 59.8      | 63.8              |
> | 64      | 54.9      | 59.4              |
> | 128     | 49        | 56.9              |
> | 256     | 37.7      | 58.3              |
> | 512     | 31.8      | 57.9              |
> 
> Yu Kuai (8):
>    sbitmap: record the number of waiters for each waitqueue
>    blk-mq: call 'bt_wait_ptr()' later in blk_mq_get_tag()
>    sbitmap: make sure waitqueues are balanced
>    blk-mq: don't preempt tag under heavy load
>    sbitmap: force tag preemption if free tags are sufficient
>    blk-mq: force tag preemption for split bios
>    blk-mq: record how many tags are needed for splited bio
>    sbitmap: wake up the number of threads based on required tags
> 
>   block/blk-merge.c         |   8 +-
>   block/blk-mq-tag.c        |  49 +++++++++----
>   block/blk-mq.c            |  54 +++++++++++++-
>   block/blk-mq.h            |   4 +
>   include/linux/blk_types.h |   4 +
>   include/linux/sbitmap.h   |   9 +++
>   lib/sbitmap.c             | 149 +++++++++++++++++++++++++++-----------
>   7 files changed, 216 insertions(+), 61 deletions(-)
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ