[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <de8fb932-2934-7ad2-112a-a9a2e63b532d@gmx.com>
Date: Thu, 29 Dec 2022 07:31:19 +0800
From: Qu Wenruo <quwenruo.btrfs@....com>
To: Mikhail Gavrilov <mikhail.v.gavrilov@...il.com>
Cc: Qu Wenruo <wqu@...e.com>, dsterba@...e.com,
Btrfs BTRFS <linux-btrfs@...r.kernel.org>,
Linux List Kernel Mailing <linux-kernel@...r.kernel.org>
Subject: Re: [6.2][regression] after commit
947a629988f191807d2d22ba63ae18259bb645c5 btrfs volume periodical forced
switch to readonly after a lot of disk writes
On 2022/12/28 22:12, Mikhail Gavrilov wrote:
> On Wed, Dec 28, 2022 at 6:08 AM Qu Wenruo <quwenruo.btrfs@....com> wrote:
>>
>> From the very first dmesg with calltrack, it already shows the
>> submit_one_bio() is called from submit_extent_page(), which means cases
>> cross stripe boundary, and has no parent_check populated at all.
>>
>> And since you're using RAID0 on two NVMEs, it matches the symptom, while
>> most tests done here are using single device (DUP and SINGLE), thus no
>> stripe boundary cases at all.
>> (In fact it should still be possible to trigger on SINGLE, but way too
>> hard to trigger)
>>
>> With proper root cause found, this version should mostly handle the
>> regression correctly.
>>
>> This version should mostly be the formal one I'd later send to the
>> mailing list.
>>
>> I can not thank you more for all the testing you have provided, it not
>> only pinned down the bug, but also proves I'm a total idiot...
>
> I have already written over 1.6Tb of data to disk and there are no
> hints of errors.
> For me, this is a sign that the problem has been fixed.
> Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@...il.com>
>
> ❯ dmesg | grep -i btrfs
> [ 0.000000] Linux version
> 6.2.0-rc1-1b929c02afd37871d5afb9d498426f83432e71c2-btrfs-fix+
> (mikhail@...hail-laptop) (gcc (GCC) 12.2.1 20221121 (Red Hat
> 12.2.1-4), GNU ld version 2.39-6.fc38) #7 SMP PREEMPT_DYNAMIC Wed Dec
> 28 10:00:39 +05 2022
> [ 0.000000] Command line:
> BOOT_IMAGE=(hd0,gpt1)/@...t/boot/vmlinuz-6.2.0-rc1-1b929c02afd37871d5afb9d498426f83432e71c2-btrfs-fix+
> root=UUID=40e0b5d2-df54-46e0-b6f4-2f868296271d ro
> rootflags=subvol=@...t
> resume=UUID=db79988f-6b70-4b52-84f5-3e505471c85e log_buf_len=16M
> sysrq_always_enabled=1 nmi_watchdog=1
> amdgpu.lockup_timeout=-1,-1,-1,-1
> [ 0.154567] Kernel command line:
> BOOT_IMAGE=(hd0,gpt1)/@...t/boot/vmlinuz-6.2.0-rc1-1b929c02afd37871d5afb9d498426f83432e71c2-btrfs-fix+
> root=UUID=40e0b5d2-df54-46e0-b6f4-2f868296271d ro
> rootflags=subvol=@...t
> resume=UUID=db79988f-6b70-4b52-84f5-3e505471c85e log_buf_len=16M
> sysrq_always_enabled=1 nmi_watchdog=1
> amdgpu.lockup_timeout=-1,-1,-1,-1
> [ 0.154654] Unknown kernel command line parameters
> "BOOT_IMAGE=(hd0,gpt1)/@...t/boot/vmlinuz-6.2.0-rc1-1b929c02afd37871d5afb9d498426f83432e71c2-btrfs-fix+",
> will be passed to user space.
> [ 4.496766] usb usb2: Manufacturer: Linux
> 6.2.0-rc1-1b929c02afd37871d5afb9d498426f83432e71c2-btrfs-fix+ xhci-hcd
> [ 4.498963] usb usb1: Manufacturer: Linux
> 6.2.0-rc1-1b929c02afd37871d5afb9d498426f83432e71c2-btrfs-fix+ xhci-hcd
> [ 4.500665] usb usb3: Manufacturer: Linux
> 6.2.0-rc1-1b929c02afd37871d5afb9d498426f83432e71c2-btrfs-fix+ xhci-hcd
> [ 4.501851] usb usb4: Manufacturer: Linux
> 6.2.0-rc1-1b929c02afd37871d5afb9d498426f83432e71c2-btrfs-fix+ xhci-hcd
> [ 4.735212] Btrfs loaded, crc32c=crc32c-generic, assert=on,
> zoned=yes, fsverity=yes
> [ 5.223368]
> BOOT_IMAGE=(hd0,gpt1)/@...t/boot/vmlinuz-6.2.0-rc1-1b929c02afd37871d5afb9d498426f83432e71c2-btrfs-fix+
> [ 6.923453] BTRFS: device label fedora_localhost-live devid 2
> transid 652981 /dev/nvme1n1p1 scanned by systemd-udevd (448)
> [ 6.974412] BTRFS: device label fedora_localhost-live devid 1
> transid 652981 /dev/nvme0n1p3 scanned by systemd-udevd (484)
> [ 11.113437] CPU: 15 PID: 478 Comm: systemd-udevd Tainted: G
> L 6.2.0-rc1-1b929c02afd37871d5afb9d498426f83432e71c2-btrfs-fix+
> #7
> [ 11.221359] CPU: 15 PID: 478 Comm: systemd-udevd Tainted: G
> W L 6.2.0-rc1-1b929c02afd37871d5afb9d498426f83432e71c2-btrfs-fix+
> #7
> [ 13.731015] BTRFS info (device nvme0n1p3): using crc32c
> (crc32c-intel) checksum algorithm
> [ 13.731147] BTRFS info (device nvme0n1p3): using free space tree
> [ 14.328439] BTRFS info (device nvme0n1p3): enabling ssd optimizations
> [ 14.328469] BTRFS info (device nvme0n1p3): auto enabling async discard
> [ 16.592713] BTRFS info (device nvme0n1p3: state M): use zstd
> compression, level 1
> [11691.071176] CPU: 11 PID: 2068 Comm: gnome-shell Tainted: G W
> L 6.2.0-rc1-1b929c02afd37871d5afb9d498426f83432e71c2-btrfs-fix+
> #7
>
>
> <OFFTOPIC>
> As I mentioned at the first message I also have a computer where the
> btrfs partition is located on a slow HDD.
> When I update the container (podman pull), the system becomes
> unresposible for half an hour, which is how long it takes to update
> the container.
> I do not expect any super-speed from the HDD, I just would like to do
> something else with this computer. Yes, at least watching videos on
> youtube. Is there anything that can be done here or is there nothing
> that we can do?
Are you using qgroup? If so it may be worthy trying disabling qgroup.
But for newer kernel, qgroup hang should only happen when dropping large
snapshot, I don't know if podman pull would cause older snapshots to be
deleted...
Thanks,
Qu
> [46944.301588] INFO: task btrfs-transacti:1184 blocked for more than
> 122 seconds.
> [46944.301825] Tainted: G W L ------- ---
> 6.2.0-0.rc1.14.fc38.x86_64+debug #1
> [46944.301829] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [46944.301832] task:btrfs-transacti state:D stack:12000 pid:1184
> ppid:2 flags:0x00004000
> [46944.301840] Call Trace:
> [46944.301843] <TASK>
> [46944.301851] __schedule+0x50c/0x1780
> [46944.301863] ? _raw_spin_unlock_irqrestore+0x30/0x60
> [46944.301876] schedule+0x5d/0xe0
> [46944.301881] wait_current_trans+0x110/0x170
> [46944.301888] ? __pfx_autoremove_wake_function+0x10/0x10
> [46944.301895] start_transaction+0x36c/0x680
> [46944.301904] transaction_kthread+0xb6/0x1b0
> [46944.301912] ? __pfx_transaction_kthread+0x10/0x10
> [46944.301916] kthread+0xf5/0x120
> [46944.301920] ? __pfx_kthread+0x10/0x10
> [46944.301926] ret_from_fork+0x2c/0x50
> [46944.301941] </TASK>
>
>
> I attached a full kernel log from this machine.
> I can start a separate thread if it makes sense.
> Sorry for oftop.
> </OFFTOPIC>
>
> --
> Best Regards,
> Mike Gavrilov.
Powered by blists - more mailing lists