linux-kernel - Re: [PATCH md-6.9 03/10] md/raid1: fix choose next idle in read

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <813BAD45-4484-4B1E-BCD0-40C159DA62BA@intel.com>
Date: Tue, 27 Feb 2024 14:26:39 +0000
From: "Luse, Paul E" <paul.e.luse@...el.com>
To: Xiao Ni <xni@...hat.com>
CC: Yu Kuai <yukuai1@...weicloud.com>, Paul E Luse
	<paul.e.luse@...ux.intel.com>, "song@...nel.org" <song@...nel.org>,
	"neilb@...e.com" <neilb@...e.com>, "shli@...com" <shli@...com>,
	"linux-raid@...r.kernel.org" <linux-raid@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"yi.zhang@...wei.com" <yi.zhang@...wei.com>, "yangerkun@...wei.com"
	<yangerkun@...wei.com>, "yukuai (C)" <yukuai3@...wei.com>
Subject: Re: [PATCH md-6.9 03/10] md/raid1: fix choose next idle in
 read_balance()



> On Feb 26, 2024, at 9:49 PM, Xiao Ni <xni@...hat.com> wrote:
> 
> On Tue, Feb 27, 2024 at 10:38 AM Yu Kuai <yukuai1@...weicloud.com> wrote:
>> 
>> Hi,
>> 
>> 在 2024/02/27 10:23, Xiao Ni 写道:
>>> On Thu, Feb 22, 2024 at 4:04 PM Yu Kuai <yukuai1@...weicloud.com> wrote:
>>>> 
>>>> From: Yu Kuai <yukuai3@...wei.com>
>>>> 
>>>> Commit 12cee5a8a29e ("md/raid1: prevent merging too large request") add
>>>> the case choose next idle in read_balance():
>>>> 
>>>> read_balance:
>>>>  for_each_rdev
>>>>   if(next_seq_sect == this_sector || disk == 0)
>>>>   -> sequential reads
>>>>    best_disk = disk;
>>>>    if (...)
>>>>     choose_next_idle = 1
>>>>     continue;
>>>> 
>>>>  for_each_rdev
>>>>  -> iterate next rdev
>>>>   if (pending == 0)
>>>>    best_disk = disk;
>>>>    -> choose the next idle disk
>>>>    break;
>>>> 
>>>>   if (choose_next_idle)
>>>>    -> keep using this rdev if there are no other idle disk
>>>>    contine
>>>> 
>>>> However, commit 2e52d449bcec ("md/raid1: add failfast handling for reads.")
>>>> remove the code:
>>>> 
>>>> -               /* If device is idle, use it */
>>>> -               if (pending == 0) {
>>>> -                       best_disk = disk;
>>>> -                       break;
>>>> -               }
>>>> 
>>>> Hence choose next idle will never work now, fix this problem by
>>>> following:
>>>> 
>>>> 1) don't set best_disk in this case, read_balance() will choose the best
>>>>    disk after iterating all the disks;
>>>> 2) add 'pending' so that other idle disk will be chosen;
>>>> 3) set 'dist' to 0 so that if there is no other idle disk, and all disks
>>>>    are rotational, this disk will still be chosen;
>>>> 
>>>> Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.")
>>>> Co-developed-by: Paul Luse <paul.e.luse@...ux.intel.com>
>>>> Signed-off-by: Paul Luse <paul.e.luse@...ux.intel.com>
>>>> Signed-off-by: Yu Kuai <yukuai3@...wei.com>
>>>> ---
>>>>  drivers/md/raid1.c | 21 ++++++++++++---------
>>>>  1 file changed, 12 insertions(+), 9 deletions(-)
>>>> 
>>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>>>> index c60ea58ae8c5..d0bc67e6d068 100644
>>>> --- a/drivers/md/raid1.c
>>>> +++ b/drivers/md/raid1.c
>>>> @@ -604,7 +604,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>>>>         unsigned int min_pending;
>>>>         struct md_rdev *rdev;
>>>>         int choose_first;
>>>> -       int choose_next_idle;
>>>> 
>>>>         /*
>>>>          * Check if we can balance. We can balance on the whole
>>>> @@ -619,7 +618,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>>>>         best_pending_disk = -1;
>>>>         min_pending = UINT_MAX;
>>>>         best_good_sectors = 0;
>>>> -       choose_next_idle = 0;
>>>>         clear_bit(R1BIO_FailFast, &r1_bio->state);
>>>> 
>>>>         if ((conf->mddev->recovery_cp < this_sector + sectors) ||
>>>> @@ -712,7 +710,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>>>>                         int opt_iosize = bdev_io_opt(rdev->bdev) >> 9;
>>>>                         struct raid1_info *mirror = &conf->mirrors[disk];
>>>> 
>>>> -                       best_disk = disk;
>>>>                         /*
>>>>                          * If buffered sequential IO size exceeds optimal
>>>>                          * iosize, check if there is idle disk. If yes, choose
>>>> @@ -731,15 +728,21 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>>>>                             mirror->next_seq_sect > opt_iosize &&
>>>>                             mirror->next_seq_sect - opt_iosize >=
>>>>                             mirror->seq_start) {
>>>> -                               choose_next_idle = 1;
>>>> -                               continue;
>>>> +                               /*
>>>> +                                * Add 'pending' to avoid choosing this disk if
>>>> +                                * there is other idle disk.
>>>> +                                * Set 'dist' to 0, so that if there is no other
>>>> +                                * idle disk and all disks are rotational, this
>>>> +                                * disk will still be chosen.
>>>> +                                */
>>>> +                               pending++;
>>>> +                               dist = 0;
>>>> +                       } else {
>>>> +                               best_disk = disk;
>>>> +                               break;
>>>>                         }
>>>> -                       break;
>>>>                 }
>>> 
>>> Hi Kuai
>>> 
>>> I noticed something. In patch 12cee5a8a29e, it sets best_disk if it's
>>> a sequential read. If there are no other idle disks, it will read from
>>> the sequential disk. With this patch, it reads from the
>>> best_pending_disk even min_pending is not 0. It looks like a wrong
>>> behaviour?
>> 
>> Yes, nice catch, I didn't notice this yet... So there is a hidden
>> logical, sequential IO priority is higher than minimal 'pending'
>> selection, it's only less than 'choose_next_idle' where idle disk
>> exist.
> 
> Yes.
> 
> 
>> 
>> Looks like if we want to keep this behaviour, we can add a 'sequential
>> disk':
>> 
>> if (is_sequential())
>>  if (!should_choose_next())
>>   return disk;
>>  ctl.sequential_disk = disk;
>> 
>> ...
>> 
>> if (ctl.min_pending != 0 && ctl.sequential_disk != -1)
>>  return ctl.sequential_disk;
> 
> Agree with this, thanks :)
> 
> Best Regards
> Xiao

Yup, agree as well.  This will help for sure with the followup to this series for seq read improvements :) 

>> 
>> Thanks,
>> Kuai
>> 
>>> 
>>> Best Regards
>>> Xiao
>>>> 
>>>> -               if (choose_next_idle)
>>>> -                       continue;
>>>> -
>>>>                 if (min_pending > pending) {
>>>>                         min_pending = pending;
>>>>                         best_pending_disk = disk;
>>>> --
>>>> 2.39.2
>>>> 
>>>> 
>>> 
>>> .
>>> 
>> 
> 
>