[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-Id: <EFB9C449-8E8B-4BBE-AC1E-11F82703E5F3@linaro.org>
Date: Tue, 4 Oct 2022 10:46:27 +0200
From: Paolo Valente <paolo.valente@...aro.org>
To: Damien Le Moal <Damien.LeMoal@....com>
Cc: Damien Le Moal <damien.lemoal@...nsource.wdc.com>,
Arie van der Hoeven <arie.vanderhoeven@...gate.com>,
Tyler Erickson <tyler.erickson@...gate.com>,
Muhammad Ahmad <muhammad.ahmad@...gate.com>,
"linux-block@...r.kernel.org" <linux-block@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
Jan Kara <jack@...e.cz>,
"andrea.righi@...onical.com" <andrea.righi@...onical.com>,
"glen.valante@...aro.org" <glen.valante@...aro.org>,
"axboe@...nel.dk" <axboe@...nel.dk>,
Michael English <michael.english@...gate.com>,
Andrew Ring <andrew.ring@...gate.com>,
Varun Boddu <varunreddy.boddu@...gate.com>
Subject: Re: block: wrong return value by bio_end_sector?
> Il giorno 3 ott 2022, alle ore 00:33, Damien Le Moal <Damien.LeMoal@....com> ha scritto:
>
> On 2022/10/03 1:20, Paolo Valente wrote:
>>
>>
>>> Il giorno 1 ott 2022, alle ore 02:50, Damien Le Moal <damien.lemoal@...nsource.wdc.com> ha scritto:
>>>
>>> On 10/1/22 00:59, Paolo Valente wrote:
>>>> Hi Jens, Damien, all other possibly interested people, this is to raise
>>>> attention on a mistake that has emerged in a thread on a bfq extension
>>>> for multi-actuary drives [1].
>>>>
>>>> The mistake is apparently in the macro bio_end_sector (defined in
>>>> include/linux/bio.h), which seems to be translated (incorrectly) as
>>>> sector+size, and not as sector+size-1.
>>>
>>> This has been like this for a long time, I think.
>>>
>>>>
>>>> For your convenience, I'm pasting a detailed description of the
>>>> problem, by Tyler (description taken from the above thread [1]).
>>>>
>>>> The drive reports the actuator ranges as a starting LBA and a count of
>>>> LBAs for the range. If the code reading the reported values simply does
>>>> startingLBA + range, this is an incorrect ending LBA for that actuator.
>>>
>>> Well, yes. LBA 0 + drive capacity is also an incorrect LBA. If the code
>>> assumes that it is, you have a classic off-by-one bug.
>>>
>>>> This is because LBAs are zero indexed and this simple addition is not
>>>> taking that into account. The proper way to get the endingLBA is
>>>> startingLBA + range - 1 to get the last LBA value for where to issue a
>>>> final IO read/write to account for LBA values starting at zero rather
>>>> than one.
>>>
>>> Yes. And ? Where is the issue ?
>>>
>>>>
>>>> Here is an example from the output in SeaChest/openSeaChest:
>>>> ====Concurrent Positioning Ranges====
>>>>
>>>> Range# #Elements Lowest LBA # of LBAs 0
>>>> 1 0
>>>> 17578328064 1 1 17578328064
>>>> 17578328064
>>>>
>>>> If using the incorrect formula to get the final LBA for actuator 0, you
>>>> would get 17578328064, but this is the starting LBA reported by the
>>>> drive for actuator 1. So to be consistent for all ranges, the final LBA
>>>> for a given actuator should be calculated as starting LBA + range - 1.
>>>>
>>>> I had reached out to Seagate's T10 and T13 representatives for
>>>> clarification and verification and this is most likely what is causing
>>>> the error is a missing - 1 somewhere after getting the information
>>>> reported by the device. They agreed that the reporting from the drive
>>>> and the SCSI to ATA translation is correct.
>>>>
>>>> I'm not sure where this is being read and calculated, but it is not an
>>>> error in the low-level libata or sd level of the kernel. It may be in
>>>> bfq, or it may be in some other place after the sd layer. I know there
>>>> were some additions to read this and report it up the stack, but I did
>>>> not think those were wrong as they seemed to pass the drive reported
>>>> information up the stack.
>>>>
>>>> Jens, Damien, can you shed a light on this?
>>>
>>> I am not clear on what the problem is exactly. This all sound like a
>>> simple off-by-one issue if bfq support code. No ?
>>
>> Yes, it's (also) in bfq code now; no, it's not only in bfq code.
>>
>> The involved bfq code is a replica of the following original function
>> (living in block/blk-iaranges.c):
>>
>> static struct blk_independent_access_range *
>> disk_find_ia_range(struct blk_independent_access_ranges *iars,
>> sector_t sector)
>> {
>> struct blk_independent_access_range *iar;
>> int i;
>>
>> for (i = 0; i < iars->nr_ia_ranges; i++) {
>> iar = &iars->ia_range[i];
>> if (sector >= iar->sector &&
>> sector < iar->sector + iar->nr_sectors)
>> return iar;
>> }
>>
>> return NULL;
>> }
>>
>> bfq's replica simply contains also this warning, right before the return NULL:
>>
>> WARN_ONCE(true,
>> "bfq_actuator_index: bio sector out of ranges: end=%llu\n",
>> end);
>>
>> So, if this warning is triggered, then the sector does not belong to
>> any range.
>>
>> That warning gets actually triggered [1], for a sector number that is
>> larger than the largest extreme (iar->sector + iar->nr_sectors) in the
>> iar data structure. The offending value resulted to be simply equal
>> to the largest possible value accepted by the iar data structure, plus
>> one.
>>
>> So, yes, this is an off-by-one error. More precisely, there is a
>> mismatch between the above code (or the values stored the iar data
>> structure) and the value of bio_end_sector (the latter is passed as
>> input to the above function). bio_end_sector does not seem to give
>> the end sector of a request (equal to begin+size-1), as apparently
>> expected by the above code, but the sector after the last one (namely,
>> begin+size).
>>
>> Should I express an opinion on where the error is, I would say that
>> the mistake is in bio_end_sector. But I could be totally wrong, as
>> I'm not a expert of either of the involved parts. In addition,
>> bio_end_sector is already in use, with its current formula, by other
>> parts of the block layer. If you guys (Damien, Jens, Tyler?, ...)
>> give us some guidance on what to fix exactly, we will be happy to make
>> a fix.
>
> I do not think there is any error/problem anywhere with bio_end_sector(). But
> indeed, the name is slightly misleading as it doe not return the last sector
> processed by the bio but the next one. The reason is that with this
> implementation, bio_end_sector() always gives you the first sector for the next
> bio with a multi-bio request.
>
> When you need the last sector processed by the bio, you need to use
> bio_end_sector(bio) - 1.
>
> So to find the access range that served a completed bio, you simply need to call:
>
> disk_find_ia_range(iars, bio_end_sector(bio) - 1);
>
Thank you very much, I got it now.
> You probably could add a couple of helper functions for getting an access range
> from a bio sector or end sector. Something like:
>
> static inline struct blk_independent_access_range *
> __bio_sector_access_range(struct bio *bio, sector_t sector)
> {
> struct gendisk *disk = bio->bi_bdev->bd_disk;
>
> if (!disk->ia_ranges)
> return NULL;
>
> iar = disk_find_ia_range(disk->ia_ranges, sector);
> WARN_ONCE(!iar,
> "bfq_actuator_index: bio sector out of ranges: end=%llu\n",
> end);
>
> return iar;
> }
>
> static inline struct blk_independent_access_range *
> bio_sector_access_range(struct bio *bio)
> {
> return __bio_sector_access_range(bio, bio_sector(bio));
> }
>
> static inline struct blk_independent_access_range *
> bio_end_sector_access_range(struct bio *bio)
> {
> return __bio_sector_access_range(bio, bio_end_sector(bio) - 1);
> }
>
> or similar. Then your code will be simpler and much less worries about of-by-one
> bugs.
>
For the moment, bio_end_sector is invoked in only one place (for the
range-computation stuff). So, for simplicity, I'd go for just
appending a -1 after such invocation. To give you credits, I'm adding
you in a reviewed-by tag. If anything of this is wrong, I'm willing
to change it. I'll send a V2 soon.
Thanks,
Paolo
>>
>> Thanks,
>> Paolo
>>
>>
>>
>>>
>>>>
>>>> Thanks, Paolo
>>>>
>>>> [1] https://www.spinics.net/lists/kernel/msg4507408.html
>>>
>>> --
>>> Damien Le Moal
>>> Western Digital Research
>>
>>
>
> --
> Damien Le Moal
> Western Digital Research
Powered by blists - more mailing lists