linux-kernel - Re: [dm-devel] bio too big device md1 (16

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <17948.8871.997123.363344@notabene.brown>
Date:	Wed, 11 Apr 2007 09:49:59 +1000
From:	Neil Brown <neilb@...e.de>
To:	syrius.ml@...log.org
Cc:	linux-kernel@...r.kernel.org, Alasdair G Kergon <agk@...hat.com>,
	Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [dm-devel] bio too big device md1 (16 > 8)

This is difficult.

Summary of problem is:
  Filesystem on LVM on md/raid1
  Add an md/linear to the md/raid1 and fs dies with 'bio too big'.

Where to begin....

The fs level builds bios to send down to the device, and it queries
the device to find out how big the bio can be.
There are two ways it can query the device
  1/ it can look at the max_sectors field in the queue structure.
  2/ it can call the merge_bvec_fn function in the queue structure.

It actually does both.  max_sectors sets an over-all maximum, and
merge_bvec_fn - if defined - can ACK or NAK each page being added.

It is currently very awkward for a stacked device to respond to a
merge_bvec_fn by calling the merge_bvec_fn of the underlying devices.
So they don't.
I'm not sure what dm does, but md simply sets max_sectors to 1 page if
there is a merge_bvec_fn in an underlying device.

So in this particular case, as md/linear defines a merge_bvec_fn, as
soon as the md/linear array is added to the md/raid1 array, the
max_sectors of the md/raid1 is set to 1 page.

This works ok when a filesystem is on an md/raid1 as the filesystem
check max_sectors every time before building the bio.

However in your case, dm (aka LVM) is in there too.

dm doesn't know that md/raid1 has just changed max_sectors and there
is no convenient way for it to find out.  So when the filesystem tries
to get the max_sectors for the dm device, it gets the value that dm set
up when it was created, which was somewhat larger than one page.
When the request gets down to the raid1 layer, it caused a problem.

You can probably make your array work by adding the md/linear array to
the md/raid1 *before* enabling the LVM device on top of it.  However
that isn't really a good long-term solution.

We really need better ways for stacked devices to communicate.

Alasdair Kergon (dm developer) has some patches to make it practical
to recursively call merge_bvec_fn so they might be part of the
solution, but more discussion is needed before that becomes a reality.

Sorry I cannot be more helpful at this stage.

NeilBrown

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/