linux-kernel - Re: [PATCH -next v2 7/7] md/raid1-10: limit the number of plugged bio

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c56e7e9c-90ca-29ca-2003-1a9a88d75fa6@huaweicloud.com>
Date:   Mon, 29 May 2023 16:50:32 +0800
From:   Yu Kuai <yukuai1@...weicloud.com>
To:     Xiao Ni <xni@...hat.com>, Yu Kuai <yukuai1@...weicloud.com>
Cc:     song@...nel.org, akpm@...l.org, neilb@...e.de,
        linux-raid@...r.kernel.org, linux-kernel@...r.kernel.org,
        yi.zhang@...wei.com, yangerkun@...wei.com,
        "yukuai (C)" <yukuai3@...wei.com>
Subject: Re: [PATCH -next v2 7/7] md/raid1-10: limit the number of plugged bio

Hi,

在 2023/05/29 15:57, Xiao Ni 写道:
> On Mon, May 29, 2023 at 11:18 AM Yu Kuai <yukuai1@...weicloud.com> wrote:
>>
>> Hi,
>>
>> 在 2023/05/29 11:10, Xiao Ni 写道:
>>> On Mon, May 29, 2023 at 10:20 AM Yu Kuai <yukuai1@...weicloud.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> 在 2023/05/29 10:08, Xiao Ni 写道:
>>>>> Hi Kuai
>>>>>
>>>>> There is a limitation of the memory in your test. But for most
>>>>> situations, customers should not set this. Can this change introduce a
>>>>> performance regression against other situations?
>>>>
>>>> Noted that this limitation is just to triggered writeback as soon as
>>>> possible in the test, and it's 100% sure real situations can trigger
>>>> dirty pages write back asynchronously and continue to produce new dirty
>>>> pages.
>>>
>>> Hi
>>>
>>> I'm confused here. If we want to trigger write back quickly, it needs
>>> to set these two values with a smaller number, rather than 0 and 60.
>>> Right?
>>
>> 60 is not required, I'll remove this setting.
>>
>> 0 just means write back if there are any dirty pages.
> 
> Hi Kuai
> 
> Does 0 mean disabling write back? I tried to find the doc that
> describes the meaning when setting dirty_background_ratio to 0, but I
> didn't find it.
> In https://www.kernel.org/doc/html/next/admin-guide/sysctl/vm.html it
> doesn't describe this. But it says something like this
> 
> Note:
>    dirty_background_bytes is the counterpart of dirty_background_ratio. Only
>    one of them may be specified at a time. When one sysctl is written it is
>    immediately taken into account to evaluate the dirty memory limits and the
>    other appears as 0 when read.
> 
> Maybe you can specify dirty_background_ratio to 1 if you want to
> trigger write back ASAP.

The purpose here is to trigger write back ASAP, I'm not an expert here,
but based on test result, 0 obviously doesn't mean disable write back.

Set dirty_background_bytes to a value, dirty_background_ratio will be
set to 0 together, which means dirty_background_ratio is disabled.
However, change dirty_background_ratio from default value to 0, will end
up both dirty_background_ratio and dirty_background_bytes to be 0, and
based on following related code, I think 0 just means write back if
there are any dirty pages.

domain_dirty_limits:
  bg_bytes = dirty_background_bytes -> 0
  bg_ratio = (dirty_background_ratio * PAGE_SIZE) / 100 -> 0

  if (bg_bytes)
	 bg_thresh = DIV_ROUND_UP(bg_bytes, PAGE_SIZE);
  else
	 bg_thresh = (bg_ratio * available_memory) / PAGE_SIZE; -> 0

  dtc->bg_thresh = bg_thresh; -> 0

balance_dirty_pages
  nr_reclaimable = global_node_page_state(NR_FILE_DIRTY);
  if (!laptop_mode && nr_reclaimable > gdtc->bg_thresh &&
       !writeback_in_progress(wb))
    wb_start_background_writeback(wb); -> writeback ASAP

Thanks,
Kuai
> 
>>>>
>>>> If a lot of bio is not plugged, then it's the same as before; if a lot
>>>> of bio is plugged, noted that before this patchset, these bio will spent
>>>> quite a long time in plug, and hence I think performance should be
>>>> better.
>>>
>>> Hmm, it depends on if it's sequential or not? If it's a big io
>>> request, can it miss the merge opportunity?
>>
>> The bio will still be merged to underlying disks' rq(if it's rq based),
>> underlying disk won't flush plug untill the number of request exceed
>> threshold.
> 
> Thanks for this.
> 
> Regards
> Xiao
>>
>> Thanks,
>> Kuai
>>>
>>> Regards
>>> Xiao
>>>
>>>>
>>>> Thanks,
>>>> Kuai
>>>>>
>>>>> Best Regards
>>>>> Xiao
>>>>>
>>>>> On Wed, Apr 26, 2023 at 4:24 PM Yu Kuai <yukuai1@...weicloud.com> wrote:
>>>>>>
>>>>>> From: Yu Kuai <yukuai3@...wei.com>
>>>>>>
>>>>>> bio can be added to plug infinitely, and following writeback test can
>>>>>> trigger huge amount of plugged bio:
>>>>>>
>>>>>> Test script:
>>>>>> modprobe brd rd_nr=4 rd_size=10485760
>>>>>> mdadm -CR /dev/md0 -l10 -n4 /dev/ram[0123] --assume-clean
>>>>>> echo 0 > /proc/sys/vm/dirty_background_ratio
>>>>>> echo 60 > /proc/sys/vm/dirty_ratio
>>>>>> fio -filename=/dev/md0 -ioengine=libaio -rw=write -bs=4k -numjobs=1 -iodepth=128 -name=test
>>>>>>
>>>>>> Test result:
>>>>>> Monitor /sys/block/md0/inflight will found that inflight keep increasing
>>>>>> until fio finish writing, after running for about 2 minutes:
>>>>>>
>>>>>> [root@...ora ~]# cat /sys/block/md0/inflight
>>>>>>           0  4474191
>>>>>>
>>>>>> Fix the problem by limiting the number of plugged bio based on the number
>>>>>> of copies for original bio.
>>>>>>
>>>>>> Signed-off-by: Yu Kuai <yukuai3@...wei.com>
>>>>>> ---
>>>>>>     drivers/md/raid1-10.c | 9 ++++++++-
>>>>>>     drivers/md/raid1.c    | 2 +-
>>>>>>     drivers/md/raid10.c   | 2 +-
>>>>>>     3 files changed, 10 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/md/raid1-10.c b/drivers/md/raid1-10.c
>>>>>> index 98d678b7df3f..35fb80aa37aa 100644
>>>>>> --- a/drivers/md/raid1-10.c
>>>>>> +++ b/drivers/md/raid1-10.c
>>>>>> @@ -21,6 +21,7 @@
>>>>>>     #define IO_MADE_GOOD ((struct bio *)2)
>>>>>>
>>>>>>     #define BIO_SPECIAL(bio) ((unsigned long)bio <= 2)
>>>>>> +#define MAX_PLUG_BIO 32
>>>>>>
>>>>>>     /* for managing resync I/O pages */
>>>>>>     struct resync_pages {
>>>>>> @@ -31,6 +32,7 @@ struct resync_pages {
>>>>>>     struct raid1_plug_cb {
>>>>>>            struct blk_plug_cb      cb;
>>>>>>            struct bio_list         pending;
>>>>>> +       unsigned int            count;
>>>>>>     };
>>>>>>
>>>>>>     static void rbio_pool_free(void *rbio, void *data)
>>>>>> @@ -127,7 +129,7 @@ static inline void md_submit_write(struct bio *bio)
>>>>>>     }
>>>>>>
>>>>>>     static inline bool md_add_bio_to_plug(struct mddev *mddev, struct bio *bio,
>>>>>> -                                     blk_plug_cb_fn unplug)
>>>>>> +                                     blk_plug_cb_fn unplug, int copies)
>>>>>>     {
>>>>>>            struct raid1_plug_cb *plug = NULL;
>>>>>>            struct blk_plug_cb *cb;
>>>>>> @@ -147,6 +149,11 @@ static inline bool md_add_bio_to_plug(struct mddev *mddev, struct bio *bio,
>>>>>>
>>>>>>            plug = container_of(cb, struct raid1_plug_cb, cb);
>>>>>>            bio_list_add(&plug->pending, bio);
>>>>>> +       if (++plug->count / MAX_PLUG_BIO >= copies) {
>>>>>> +               list_del(&cb->list);
>>>>>> +               cb->callback(cb, false);
>>>>>> +       }
>>>>>> +
>>>>>>
>>>>>>            return true;
>>>>>>     }
>>>>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>>>>>> index 639e09cecf01..c6066408a913 100644
>>>>>> --- a/drivers/md/raid1.c
>>>>>> +++ b/drivers/md/raid1.c
>>>>>> @@ -1562,7 +1562,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
>>>>>>                                                  r1_bio->sector);
>>>>>>                    /* flush_pending_writes() needs access to the rdev so...*/
>>>>>>                    mbio->bi_bdev = (void *)rdev;
>>>>>> -               if (!md_add_bio_to_plug(mddev, mbio, raid1_unplug)) {
>>>>>> +               if (!md_add_bio_to_plug(mddev, mbio, raid1_unplug, disks)) {
>>>>>>                            spin_lock_irqsave(&conf->device_lock, flags);
>>>>>>                            bio_list_add(&conf->pending_bio_list, mbio);
>>>>>>                            spin_unlock_irqrestore(&conf->device_lock, flags);
>>>>>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>>>>>> index bd9e655ca408..7135cfaf75db 100644
>>>>>> --- a/drivers/md/raid10.c
>>>>>> +++ b/drivers/md/raid10.c
>>>>>> @@ -1306,7 +1306,7 @@ static void raid10_write_one_disk(struct mddev *mddev, struct r10bio *r10_bio,
>>>>>>
>>>>>>            atomic_inc(&r10_bio->remaining);
>>>>>>
>>>>>> -       if (!md_add_bio_to_plug(mddev, mbio, raid10_unplug)) {
>>>>>> +       if (!md_add_bio_to_plug(mddev, mbio, raid10_unplug, conf->copies)) {
>>>>>>                    spin_lock_irqsave(&conf->device_lock, flags);
>>>>>>                    bio_list_add(&conf->pending_bio_list, mbio);
>>>>>>                    spin_unlock_irqrestore(&conf->device_lock, flags);
>>>>>> --
>>>>>> 2.39.2
>>>>>>
>>>>>
>>>>> .
>>>>>
>>>>
>>>
>>> .
>>>
>>
> 
> .
>