linux-kernel - Re: [6.2][regression] after commit 947a629988f191807d2d22ba63ae18259bb645c5 btrfs volume periodical forced switch to readonly after a lot of disk writes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <fd0a0bfe-5c67-fd95-b17c-78a14c63bea6@gmx.com>
Date:   Wed, 28 Dec 2022 09:08:14 +0800
From:   Qu Wenruo <quwenruo.btrfs@....com>
To:     Mikhail Gavrilov <mikhail.v.gavrilov@...il.com>,
        Qu Wenruo <wqu@...e.com>
Cc:     dsterba@...e.com, Btrfs BTRFS <linux-btrfs@...r.kernel.org>,
        Linux List Kernel Mailing <linux-kernel@...r.kernel.org>
Subject: Re: [6.2][regression] after commit
 947a629988f191807d2d22ba63ae18259bb645c5 btrfs volume periodical forced
 switch to readonly after a lot of disk writes

On 2022/12/27 21:11, Mikhail Gavrilov wrote:
> On Tue, Dec 27, 2022 at 4:03 PM Qu Wenruo <wqu@...e.com> wrote:
>>
>> I have a similar laptop (G14), only GPU is different (RTX3060), and I
>> failed to reproduce this so far...
>>
>> My gcc is only a small version behind (12.2.0).
>>
>> Thus none of the hardware seems suspicious at all...
>>
>> Anyway I have attached my last struggle for the weird problem.
>> For now, I have no idea why this can even happen...
> 
> The new Kernel log is attached.
> This time, the main difference was that the file system did not
> immediately switch to readonly.
> The Steam client stopped a couple of times with a write error, but
> after pressing the resume button, it resumed downloading. For the
> third or fourth time refused to download.
> 
I'm a total idiot.

 From the very first dmesg with calltrack, it already shows the 
submit_one_bio() is called from submit_extent_page(), which means cases 
cross stripe boundary, and has no parent_check populated at all.

And since you're using RAID0 on two NVMEs, it matches the symptom, while 
most tests done here are using single device (DUP and SINGLE), thus no 
stripe boundary cases at all.
(In fact it should still be possible to trigger on SINGLE, but way too 
hard to trigger)

With proper root cause found, this version should mostly handle the 
regression correctly.

This version should mostly be the formal one I'd later send to the 
mailing list.

I can not thank you more for all the testing you have provided, it not 
only pinned down the bug, but also proves I'm a total idiot...

Thanks,
Qu
View attachment "0001-btrfs-fix-the-false-alert-on-bad-tree-level.patch" of type "text/x-patch" (5723 bytes)