linux-kernel - Re: LVM-on-LVM: error while submitting device barriers

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Ze2azGlb1WxVFv7Z@fedora>
Date: Sun, 10 Mar 2024 19:34:36 +0800
From: Ming Lei <ming.lei@...hat.com>
To: Patrick Plenefisch <simonpatp@...il.com>
Cc: Mike Snitzer <snitzer@...nel.org>,
	Goffredo Baroncelli <kreijack@...ind.it>,
	linux-kernel@...r.kernel.org, Alasdair Kergon <agk@...hat.com>,
	Mikulas Patocka <mpatocka@...hat.com>, Chris Mason <clm@...com>,
	Josef Bacik <josef@...icpanda.com>, David Sterba <dsterba@...e.com>,
	regressions@...ts.linux.dev, dm-devel@...ts.linux.dev,
	linux-btrfs@...r.kernel.org, ming.lei@...hat.com
Subject: Re: LVM-on-LVM: error while submitting device barriers

On Sat, Mar 09, 2024 at 03:39:02PM -0500, Patrick Plenefisch wrote:
> On Wed, Mar 6, 2024 at 11:00 AM Ming Lei <ming.lei@...hat.com> wrote:
> >
> > On Tue, Mar 05, 2024 at 12:45:13PM -0500, Mike Snitzer wrote:
> > > On Thu, Feb 29 2024 at  5:05P -0500,
> > > Goffredo Baroncelli <kreijack@...ind.it> wrote:
> > >
> > > > On 29/02/2024 21.22, Patrick Plenefisch wrote:
> > > > > On Thu, Feb 29, 2024 at 2:56 PM Goffredo Baroncelli <kreijack@...ind.it> wrote:
> > > > > >
> > > > > > > Your understanding is correct. The only thing that comes to my mind to
> > > > > > > cause the problem is asymmetry of the SATA devices. I have one 8TB
> > > > > > > device, plus a 1.5TB, 3TB, and 3TB drives. Doing math on the actual
> > > > > > > extents, lowerVG/single spans (3TB+3TB), and
> > > > > > > lowerVG/lvmPool/lvm/brokenDisk spans (3TB+1.5TB). Both obviously have
> > > > > > > the other leg of raid1 on the 8TB drive, but my thought was that the
> > > > > > > jump across the 1.5+3TB drive gap was at least "interesting"
> > > > > >
> > > > > >
> > > > > > what about lowerVG/works ?
> > > > > >
> > > > >
> > > > > That one is only on two disks, it doesn't span any gaps
> > > >
> > > > Sorry, but re-reading the original email I found something that I missed before:
> > > >
> > > > > BTRFS error (device dm-75): bdev /dev/mapper/lvm-brokenDisk errs: wr
> > > > > 0, rd 0, flush 1, corrupt 0, gen 0
> > > > > BTRFS warning (device dm-75): chunk 13631488 missing 1 devices, max
> > > >                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > > > tolerance is 0 for writable mount
> > > > > BTRFS: error (device dm-75) in write_all_supers:4379: errno=-5 IO
> > > > > failure (errors while submitting device barriers.)
> > > >
> > > > Looking at the code, it seems that if a FLUSH commands fails, btrfs
> > > > considers that the disk is missing. The it cannot mount RW the device.
> > > >
> > > > I would investigate with the LVM developers, if it properly passes
> > > > the flush/barrier command through all the layers, when we have an
> > > > lvm over lvm (raid1). The fact that the lvm is a raid1, is important because
> > > > a flush command to be honored has to be honored by all the
> > > > devices involved.
> >
> > Hello Patrick & Goffredo,
> >
> > I can trigger this kind of btrfs complaint by simulating one FLUSH failure.
> >
> > If you can reproduce this issue easily, please collect log by the
> > following bpftrace script, which may show where the flush failure is,
> > and maybe it can help to narrow down the issue in the whole stack.
> >
> >
> > #!/usr/bin/bpftrace
> >
> > #ifndef BPFTRACE_HAVE_BTF
> > #include <linux/blkdev.h>
> > #endif
> >
> > kprobe:submit_bio_noacct,
> > kprobe:submit_bio
> > / (((struct bio *)arg0)->bi_opf & (1 << __REQ_PREFLUSH)) != 0 /
> > {
> >         $bio = (struct bio *)arg0;
> >         @submit_stack[arg0] = kstack;
> >         @tracked[arg0] = 1;
> > }
> >
> > kprobe:bio_endio
> > /@...cked[arg0] != 0/
> > {
> >         $bio = (struct bio *)arg0;
> >
> >         if (($bio->bi_flags & (1 << BIO_CHAIN)) && $bio->__bi_remaining.counter > 1) {
> >                 return;
> >         }
> >
> >         if ($bio->bi_status != 0) {
> >                 printf("dev %s bio failed %d, submitter %s completion %s\n",
> >                         $bio->bi_bdev->bd_disk->disk_name,
> >                         $bio->bi_status, @submit_stack[arg0], kstack);
> >         }
> >         delete(@submit_stack[arg0]);
> >         delete(@tracked[arg0]);
> > }
> >
> > END {
> >         clear(@submit_stack);
> >         clear(@tracked);
> > }
> >
> 
> Attaching 4 probes...
> dev dm-77 bio failed 10, submitter
>        submit_bio_noacct+5
>        __send_duplicate_bios+358
>        __send_empty_flush+179
>        dm_submit_bio+857
>        __submit_bio+132
>        submit_bio_noacct_nocheck+345
>        write_all_supers+1718
>        btrfs_commit_transaction+2342
>        transaction_kthread+345
>        kthread+229
>        ret_from_fork+49
>        ret_from_fork_asm+27
> completion
>        bio_endio+5
>        dm_submit_bio+955
>        __submit_bio+132
>        submit_bio_noacct_nocheck+345
>        write_all_supers+1718
>        btrfs_commit_transaction+2342
>        transaction_kthread+345
>        kthread+229
>        ret_from_fork+49
>        ret_from_fork_asm+27
> 
> dev dm-86 bio failed 10, submitter
>        submit_bio_noacct+5
>        write_all_supers+1718
>        btrfs_commit_transaction+2342
>        transaction_kthread+345
>        kthread+229
>        ret_from_fork+49
>        ret_from_fork_asm+27
> completion
>        bio_endio+5
>        clone_endio+295
>        clone_endio+295
>        process_one_work+369
>        worker_thread+635
>        kthread+229
>        ret_from_fork+49
>        ret_from_fork_asm+27
> 
> 
> For context, dm-86 is /dev/lvm/brokenDisk and dm-77 is /dev/lowerVG/lvmPool

io_status is 10(BLK_STS_IOERR), which is produced in submission code path on
/dev/dm-77(/dev/lowerVG/lvmPool) first, so looks it is one device mapper issue.

The error should be from the following code only:

static void __map_bio(struct bio *clone)

	...
	if (r == DM_MAPIO_KILL)
		dm_io_dec_pending(io, BLK_STS_IOERR);
	else
		dm_io_dec_pending(io, BLK_STS_DM_REQUEUE);
    break;

Patrick, you mentioned lvmPool is raid1, can you explain how lvmPool is
built? It is dm-raid1 target or over plain raid1 device which is
build over /dev/lowerVG?

Mike, the logic in the following code doesn't change from v5.18-rc2 to
v5.19, but I still can't understand why STS_IOERR is set in
dm_io_complete() in case of BLK_STS_DM_REQUEUE && !__noflush_suspending(),
since DMF_NOFLUSH_SUSPENDING is only set in __dm_suspend() which
is supposed to not happen in Patrick's case.

dm_io_complete()
	...
	if (io->status == BLK_STS_DM_REQUEUE) {
	        unsigned long flags;
	        /*
	         * Target requested pushing back the I/O.
	         */
	        spin_lock_irqsave(&md->deferred_lock, flags);
	        if (__noflush_suspending(md) &&
	            !WARN_ON_ONCE(dm_is_zone_write(md, bio))) {
	                /* NOTE early return due to BLK_STS_DM_REQUEUE below */
	                bio_list_add_head(&md->deferred, bio);
	        } else {
	                /*
	                 * noflush suspend was interrupted or this is
	                 * a write to a zoned target.
	                 */
	                io->status = BLK_STS_IOERR;
	        }
	        spin_unlock_irqrestore(&md->deferred_lock, flags);
	}



thanks,
Ming