linux-kernel - Re: [BUG 4.4.26] bio->bi_bdev == NULL in raid6 return

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Date:   Mon, 21 Nov 2016 18:32:48 +0300
From:   Konstantin Khlebnikov <khlebnikov@...dex-team.ru>
To:     NeilBrown <neilb@...e.com>,
        Konstantin Khlebnikov <koct9i@...il.com>,
        Shaohua Li <shli@...nel.org>
Cc:     "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        linux-raid@...r.kernel.org, linux-block@...r.kernel.org,
        Jens Axboe <axboe@...nel.dk>, Christoph Hellwig <hch@....de>
Subject: Re: [BUG 4.4.26] bio->bi_bdev == NULL in raid6 return_io()

On 21.11.2016 04:23, NeilBrown wrote:
> On Sun, Nov 20 2016, Konstantin Khlebnikov wrote:
>
>> On 07.11.2016 23:34, Konstantin Khlebnikov wrote:
>>> On Mon, Nov 7, 2016 at 10:46 PM, Shaohua Li <shli@...nel.org> wrote:
>>>> On Sat, Nov 05, 2016 at 01:48:45PM +0300, Konstantin Khlebnikov wrote:
>>>>> return_io() resolves request_queue even if trace point isn't active:
>>>>>
>>>>> static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
>>>>> {
>>>>>       return bdev->bd_disk->queue;    /* this is never NULL */
>>>>> }
>>>>>
>>>>> static void return_io(struct bio_list *return_bi)
>>>>> {
>>>>>       struct bio *bi;
>>>>>       while ((bi = bio_list_pop(return_bi)) != NULL) {
>>>>>               bi->bi_iter.bi_size = 0;
>>>>>               trace_block_bio_complete(bdev_get_queue(bi->bi_bdev),
>>>>>                                        bi, 0);
>>>>>               bio_endio(bi);
>>>>>       }
>>>>> }
>>>>
>>>> I can't see how this could happen. What kind of tests/environment are these running?
>>>
>>> That was a random piece of production somewhere.
>>> Cording to time all crashes happened soon after reboot.
>>> There're several raids, probably some of them were still under resync.
>>>
>>> For now we have only few machines with this kernel. But I'm sure that
>>> I'll get much more soon =)
>>
>> I've added this debug patch for catching overflow of active stripes in bio
>>
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -164,6 +164,7 @@ static inline void raid5_inc_bi_active_stripes(struct bio *bio)
>>   {
>>          atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
>>          atomic_inc(segments);
>> +       BUG_ON(!(atomic_read(segments) & 0xffff));
>>   }
>>
>> And got this. Counter in %edx = 0x00010000
>>
>> So, looks like one bio (discard?) can cover more than 65535 stripes
>
> 65535 stripes - 256M.  I guess that is possible.  Christoph has
> suggested that now would be a good time to stop using bi_phys_segments
> like this.

Is it possible to fix this by limiting max_hw_sectors and
max_hw_discard_sectors for raid queue?

This should be much easier to backport into stable kernels.

I've found that setup also have dm/lvm on the top of md raid so
hat might be more complicated problem.
Because I cannot see how bio could be big enough to overflow that counter.
That was raid6 with 10 disks and 256k chunk. max_hw_discard_sectors and
max_hw_sectors cannot be bigger than UINT_MAX. Thus in this case bio
cannot cover more than 16384 data chunks, 20480 chunks including checksums.
Please fix me if I'm wrong.

>
> I have some patches which should fix this.  I'll post them shortly.  I'd
> appreciate it if you would test and confirm that they work (and don't
> break anything else)

Ok, I'll try to check that patchset.

-- 
Konstantin