linux-kernel - Re: [RFC PATCH] blk: reset 'bi_next' when bio is done inside request

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <87y3vfj0xe.fsf@notabene.neil.brown.name>
Date:   Wed, 05 Apr 2017 07:52:13 +1000
From:   NeilBrown <neilb@...e.com>
To:     Michael Wang <yun.wang@...fitbricks.com>,
        "linux-kernel\@vger.kernel.org" <linux-kernel@...r.kernel.org>,
        linux-block@...r.kernel.org, linux-raid@...r.kernel.org
Cc:     Jens Axboe <axboe@...nel.dk>, Shaohua Li <shli@...nel.org>,
        Jinpu Wang <jinpu.wang@...fitbricks.com>
Subject: Re: [RFC PATCH] blk: reset 'bi_next' when bio is done inside request

On Tue, Apr 04 2017, Michael Wang wrote:

> On 04/04/2017 11:37 AM, NeilBrown wrote:
>> On Tue, Apr 04 2017, Michael Wang wrote:
> [snip]
>>>>
>>>> If sync_request_write() is using a bio that has already been used, it
>>>> should call bio_reset() and fill in the details again.
>>>> However I don't see how that would happen.
>>>> Can you give specific details on the situation that triggers the bug?
>>>
>>> We have storage side mapping lv through scst to server, on server side
>>> we assemble them into multipath device, and then assemble these dm into
>>> two raid1.
>>>
>>> The test is firstly do mkfs.ext4 on raid1 then start fio on it, on storage
>>> side we unmap all the lv (could during mkfs or fio), then on server side
>>> we hit the BUG (reproducible).
>> 
>> So I assume the initial resync is still happening at this point?
>> And you unmap *all* the lv's so you expect IO to fail?
>> I can see that the code would behave strangely if you have a
>> bad-block-list configured (which is the default).
>> Do you have a bbl?  If you create the array without the bbl, does it
>> still crash?
>
> The resync is at least happen concurrently in this case, we try
> to simulate the case that all the connections dropped, the IO do
> failed, also bunch of kernel log like:
>
> md: super_written gets error=-5
> blk_update_request: I/O error, dev dm-3, sector 64184
> md/raid1:md1: dm-2: unrecoverable I/O read error for block 46848
>
> we expect that to happen, but server should not crash on BUG.
>
> And we haven't done any thing special regarding bbl, the bad_blocks
> in sysfs are all empty.
>
>> 
>>>
>>> The path of bio was confirmed by add tracing, it is reused in sync_request_write()
>>> with 'bi_next' once chained inside blk_attempt_plug_merge().
>> 
>> I still don't see why it is re-used.
>> I assume you didn't explicitly ask for a check/repair (i.e. didn't write
>> to .../md/sync_action at all?).  In that case MD_RECOVERY_REQUESTED is
>> not set.
>
> Just unmap lv on storage side, no operation on server side.
>
>> So sync_request() sends only one bio to generic_make_request():
>>    r1_bio->bios[r1_bio->read_disk];
>> 
>> then sync_request_write() *doesn't* send that bio again, but does send
>> all the others.
>> 
>> So where does it reuse a bio?
>
> If that's the design then it would be strange... the log do showing the path
> of that bio go through sync_request(), will do more investigation.
>
>> 
>>>
>>> We also tried to reset the bi_next inside sync_request_write() before
>>> generic_make_request() which also works.
>>>
>>> The testing was done with 4.4, but we found upstream also left bi_next
>>> chained after done in request, thus we post this RFC.
>>>
>>> Regarding raid1, we haven't found the place on path where the bio was
>>> reset... where does it supposed to be?
>> 
>> I'm not sure what you mean.
>> We only reset bios when they are being reused.
>> One place is in process_checks() where bio_reset() is called before
>> filling in all the details.
>> 
>> 
>> Maybe, in sync_request_write(), before
>> 
>> 	wbio->bi_rw = WRITE;
>> 
>> add something like
>>   if (wbio->bi_next)
>>      printk("bi_next!= NULL i=%d read_disk=%d bi_end_io=%pf\n",
>>           i, r1_bio->read_disk, wbio->bi_end_io);
>> 
>> that might help narrow down what is happening.
>
> Just triggered again in 4.4, dmesg like:
>
> [  399.240230] md: super_written gets error=-5
> [  399.240286] md: super_written gets error=-5
> [  399.240286] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [  399.240300] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [  399.240312] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [  399.240323] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [  399.240334] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [  399.240341] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [  399.240349] md/raid1:md0: dm-0: unrecoverable I/O read error for block 204160
> [  399.240352] bi_next!= NULL i=0 read_disk=0 bi_end_io=end_sync_write [raid1]

It is very peculiar for r1_bio->bios[r1_bio->read_disk].bi_end_io to be
end_sync_write.
I can only see this happening when sync_request_write() assigns that,
and this printk is before there.
That seems to suggest that sync_request_write() is being performed on
the same r1_bio twice, which is also very peculiar.

I might have to try to reproduce this myself and see what is happening.

Thanks,
NeilBrown

Download attachment "signature.asc" of type "application/pgp-signature" (833 bytes)