linux-kernel - Re: block: wrong return value by bio_end

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7b36f881-1423-c01d-c584-806d1741081d@wdc.com>
Date:   Sun, 2 Oct 2022 22:33:23 +0000
From:   Damien Le Moal <Damien.LeMoal@....com>
To:     Paolo Valente <paolo.valente@...aro.org>,
        Damien Le Moal <damien.lemoal@...nsource.wdc.com>
CC:     Arie van der Hoeven <arie.vanderhoeven@...gate.com>,
        Tyler Erickson <tyler.erickson@...gate.com>,
        Muhammad Ahmad <muhammad.ahmad@...gate.com>,
        "linux-block@...r.kernel.org" <linux-block@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Jan Kara <jack@...e.cz>,
        "andrea.righi@...onical.com" <andrea.righi@...onical.com>,
        "glen.valante@...aro.org" <glen.valante@...aro.org>,
        "axboe@...nel.dk" <axboe@...nel.dk>,
        Michael English <michael.english@...gate.com>,
        Andrew Ring <andrew.ring@...gate.com>,
        Varun Boddu <varunreddy.boddu@...gate.com>
Subject: Re: block: wrong return value by bio_end_sector?

On 2022/10/03 1:20, Paolo Valente wrote:
> 
> 
>> Il giorno 1 ott 2022, alle ore 02:50, Damien Le Moal <damien.lemoal@...nsource.wdc.com> ha scritto:
>>
>> On 10/1/22 00:59, Paolo Valente wrote:
>>> Hi Jens, Damien, all other possibly interested people, this is to raise
>>> attention on a mistake that has emerged in a thread on a bfq extension
>>> for multi-actuary drives [1].
>>>
>>> The mistake is apparently in the macro bio_end_sector (defined in 
>>> include/linux/bio.h), which seems to be translated (incorrectly) as 
>>> sector+size, and not as sector+size-1.
>>
>> This has been like this for a long time, I think.
>>
>>>
>>> For your convenience, I'm pasting a detailed description of the 
>>> problem, by Tyler (description taken from the above thread [1]).
>>>
>>> The drive reports the actuator ranges as a starting LBA and a count of
>>> LBAs for the range. If the code reading the reported values simply does
>>> startingLBA + range, this is an incorrect ending LBA for that actuator.
>>
>> Well, yes. LBA 0 + drive capacity is also an incorrect LBA. If the code
>> assumes that it is, you have a classic off-by-one bug.
>>
>>> This is because LBAs are zero indexed and this simple addition is not
>>> taking that into account. The proper way to get the endingLBA is
>>> startingLBA + range - 1 to get the last LBA value for where to issue a
>>> final IO read/write to account for LBA values starting at zero rather
>>> than one.
>>
>> Yes. And ? Where is the issue ?
>>
>>>
>>> Here is an example from the output in SeaChest/openSeaChest: 
>>> ====Concurrent Positioning Ranges====
>>>
>>> Range#     #Elements            Lowest LBA          # of LBAs 0
>>> 1                                               0
>>> 17578328064 1            1                         17578328064
>>> 17578328064
>>>
>>> If using the incorrect formula to get the final LBA for actuator 0, you
>>> would get 17578328064, but this is the starting LBA reported by the
>>> drive for actuator 1. So to be consistent for all ranges, the final LBA
>>> for a given actuator should be calculated as starting LBA + range - 1.
>>>
>>> I had reached out to Seagate's T10 and T13 representatives for
>>> clarification and verification and this is most likely what is causing
>>> the error is a missing - 1 somewhere after getting the information
>>> reported by the device. They agreed that the reporting from the drive
>>> and the SCSI to ATA translation is correct.
>>>
>>> I'm not sure where this is being read and calculated, but it is not an
>>> error in the low-level libata or sd level of the kernel. It may be in
>>> bfq, or it may be in some other place after the sd layer. I know there
>>> were some additions to read this and report it up the stack, but I did
>>> not think those were wrong as they seemed to pass the drive reported
>>> information up the stack.
>>>
>>> Jens, Damien, can you shed a light on this?
>>
>> I am not clear on what the problem is exactly. This all sound like a
>> simple off-by-one issue if bfq support code. No ?
> 
> Yes, it's (also) in bfq code now; no, it's not only in bfq code. 
> 
> The involved bfq code is a replica of the following original function
> (living in block/blk-iaranges.c):
> 
> static struct blk_independent_access_range *
> disk_find_ia_range(struct blk_independent_access_ranges *iars,
> 		  sector_t sector)
> {
> 	struct blk_independent_access_range *iar;
> 	int i;
> 
> 	for (i = 0; i < iars->nr_ia_ranges; i++) {
> 		iar = &iars->ia_range[i];
> 		if (sector >= iar->sector &&
> 		    sector < iar->sector + iar->nr_sectors)
> 			return iar;
> 	}
> 
> 	return NULL;
> }
> 
> bfq's replica simply contains also this warning, right before the return NULL:
> 
> 	WARN_ONCE(true,
> 		  "bfq_actuator_index: bio sector out of ranges: end=%llu\n",
> 		  end);
> 
> So, if this warning is triggered, then the sector does not belong to
> any range.
> 
> That warning gets actually triggered [1], for a sector number that is
> larger than the largest extreme (iar->sector + iar->nr_sectors) in the
> iar data structure.  The offending value resulted to be simply equal
> to the largest possible value accepted by the iar data structure, plus
> one.
> 
> So, yes, this is an off-by-one error.  More precisely, there is a
> mismatch between the above code (or the values stored the iar data
> structure) and the value of bio_end_sector (the latter is passed as
> input to the above function).  bio_end_sector does not seem to give
> the end sector of a request (equal to begin+size-1), as apparently
> expected by the above code, but the sector after the last one (namely,
> begin+size).
> 
> Should I express an opinion on where the error is, I would say that
> the mistake is in bio_end_sector.  But I could be totally wrong, as
> I'm not a expert of either of the involved parts.  In addition,
> bio_end_sector is already in use, with its current formula, by other
> parts of the block layer.  If you guys (Damien, Jens, Tyler?, ...)
> give us some guidance on what to fix exactly, we will be happy to make
> a fix.

I do not think there is any error/problem anywhere with bio_end_sector(). But
indeed, the name is slightly misleading as it doe not return the last sector
processed by the bio but the next one. The reason is that with this
implementation, bio_end_sector() always gives you the first sector for the next
bio with a multi-bio request.

When you need the last sector processed by the bio, you need to use
bio_end_sector(bio) - 1.

So to find the access range that served a completed bio, you simply need to call:

disk_find_ia_range(iars, bio_end_sector(bio) - 1);

You probably could add a couple of helper functions for getting an access range
from a bio sector or end sector. Something like:

static inline struct blk_independent_access_range *
__bio_sector_access_range(struct bio *bio, sector_t sector)
{
	struct gendisk *disk = bio->bi_bdev->bd_disk;

	if (!disk->ia_ranges)
		return NULL;

	iar = disk_find_ia_range(disk->ia_ranges, sector);
	WARN_ONCE(!iar,
		  "bfq_actuator_index: bio sector out of ranges: end=%llu\n",
		  end);

	return iar;
}

static inline struct blk_independent_access_range *
bio_sector_access_range(struct bio *bio)
{
	return __bio_sector_access_range(bio, bio_sector(bio));
}

static inline struct blk_independent_access_range *
bio_end_sector_access_range(struct bio *bio)
{
	return __bio_sector_access_range(bio, bio_end_sector(bio) - 1);
}

or similar. Then your code will be simpler and much less worries about of-by-one
bugs.

> 
> Thanks,
> Paolo
> 
> 
> 
>>
>>>
>>> Thanks, Paolo
>>>
>>> [1] https://www.spinics.net/lists/kernel/msg4507408.html
>>
>> -- 
>> Damien Le Moal
>> Western Digital Research
> 
> 

-- 
Damien Le Moal
Western Digital Research