linux-kernel - Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d51d6f85b5728648fe9c584f9cb3acee12c4873f.camel@suse.com>
Date: Tue, 13 May 2025 10:00:57 +0200
From: Martin Wilck <mwilck@...e.com>
To: Kevin Wolf <kwolf@...hat.com>, Christoph Hellwig <hch@...radead.org>
Cc: dm-devel@...ts.linux.dev, hreitz@...hat.com, mpatocka@...hat.com, 
	snitzer@...nel.org, bmarzins@...hat.com, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active
 paths

Hi Kevin,

On Mon, 2025-05-12 at 17:18 +0200, Kevin Wolf wrote:
> Am 08.05.2025 um 15:51 hat Martin Wilck geschrieben:
> > Hello Kevin,
> > 
> > [I'm sorry for the previous email. It seems that I clicked "send"
> > in
> > the wrong window].
> > 
> > On Tue, 2025-04-29 at 18:50 +0200, Kevin Wolf wrote:
> > > Multipath cannot directly provide failover for ioctls in the
> > > kernel
> > > because it doesn't know what each ioctl means and which result
> > > could
> > > indicate a path error. Userspace generally knows what the ioctl
> > > it
> > > issued means and if it might be a path error, but neither does it
> > > know
> > > which path the ioctl took nor does it necessarily have the
> > > privileges
> > > to
> > > fail a path using the control device.
> > 
> > Thanks for this effort.
> > 
> > I have some remarks about your approach. The most important one is
> > that
> > the DM_MPATH_PROBE_PATHS_CMD ioctl appears to be a dangerous
> > command.
> > It sends IO to possibly broken paths and waits for it to complete.
> > In
> > the common error case of a device not responding, this might cause
> > the
> > code to hang for a long time in the kernel ioctl code path, in
> > uninterruptible sleep (note that these probing commands will be
> > queued
> > after other possibly hanging IO). In the past we have put a lot of
> > effort into other code paths in multipath-tools and elsewhere to
> > avoid
> > this kind of hang to the extent possible. It seems to me that your
> > set
> > re-introduces this risk.
> 
> That's a fair point. I don't expect this code to be fully final,
> another
> thing that isn't really optimal (as mentioned in the comment message)
> is
> that we're not probing paths in parallel, potentially adding up
> timeouts
> from multiple paths.
> 
> I don't think this is a problem of the interface, though, but we can
> improve the implementation keeping the same interface.

> Maybe I'm missing something, but I think the reason why it has to be
> uninterruptible is just that submit_bio_wait() has the completion on
> the
> stack? So I suppose we could just kmalloc() some state, submit all
> the
> bios, let the completion callback free it, and then we can just stop
> waiting early (i.e. wait_for_completion_interruptible/killable) and
> let
> the requests complete on their own in the background.
> 
> Would this address your concerns or is there another part to it?

It'd be an improvement. Not sure if it'd solve the problem entirely.

> > Apart from that, minor observations are that in patch 2/2 you don't
> > check whether the map is in queueing state, and I don't quite
> > understand why you only check paths in the map's active path group
> > without attempting a PG failover in the case where all paths in the
> > current PG fail.
> 
> Ben already fixed up the missing check.

I missed that.

> Not attempting PG failover was also his suggestion, on the basis that
> it
> would be additional work for no real benefit when you can only submit
> requests for the current PG anyway. If userspace retries the SG_IO,
> it
> will already pick a different PG, so having the kernel do the same
> doesn't really improve anything.

Ok. But continuing along this line of reasoning, I'd conclude that it
would be even more useful to fail the path in device-mapper directly
after a failed IO request (given that we find an agreeable way to do
this, see below). Qemu could then retry, depending on the error code.
Paths would be failed one after the other, and eventually a PG failover
would occur.

> > I am wondering whether the DM_MPATH_PROBE_PATHS_CMD ioctl is
> > necessary
> > at all. Rather than triggering explicit path probing, you could
> > have
> > dm-mpath "handle" IO errors by failing the active path for "path
> > errors". That would be similar to my patch set from 2021 [1], but
> > it
> > would avoid the extra IO and thus the additional risk of hanging in
> > the
> > kernel.
> > 
> > Contrary to my set, you wouldn't attempt retries in the kernel, but
> > leave this part to qemu instead, like in the current set. That
> > would
> > avoid Christoph's main criticism that "Failing over SG_IO does not
> > make
> > sense" [2].
> 
> Maybe I misunderstood, but my understanding from the feedback you got
> back then was that no SCSI-specific code was wanted in device mapper.
> This is why we kept interpreting the status codes in userspace.

As I wrote in my reply to Mikulas already, my understanding was that
Christoph's main objection was against retrying SG_IO ioctls in the
kernel on different paths, not against inspecting the error code.

> While discussing the approaches with Mikuláš and Ben, one option that
> was briefly discussed was a pair of ioctls: One that wraps ioctls and
> returns which path the ioctl took, and another one to fail that path
> if
> inspection of the result in userspace comes to the conclusion that
> it's
> a path error. I didn't think this could be implemented without also
> allowing an unprivileged process to DoS the device by just failing
> all
> paths even if they are still good, so we didn't continue there.

This doesn't sound too promising IMO.

If we want to find a solution that's focused on user space, I'd suggest
to use multipathd rather than additional ioctls. After all, all
necessary information is available in multipathd.

Already today, qemu can use the multipath socket to query the current
state of paths in a map. I can imagine a new multipathd CLI command
that would instruct multipathd to check all paths of a given map
immediately (@Ben: set pp->tick to 1 for all paths of the map). This
command could be sent to multipathd in case of a suspected path
failure. This would have similar semantics as the
DM_MPATH_PROBE_PATHS_CMD ioctl, with some advantages, as it would take
the multipath configuration of the specific storage array into account.
It'd also avoid blocking qemu. multipathd would allow this command for
non-root connections, but to avoid DoS, accept it only once every N
seconds.

> Anyway, if it actually were acceptable for the kernel to check SG_IO
> results for path errors and fail the path while still returning the
> result unchanged, then that would work for us for the specific case,
> of
> course. But I don't expect this really addresses Christoph's
> concerns.

For regular block IO, the "path error" logic is in blk_path_error() [1]
and scsi_result_to_blk_status() [2]. blk_path_error() can be called
from dm; the problem in the SG_IO code path is that we don't have a
blk_status_t to examine. Therefore, in my old code, I replicated the
logic of scsi_result_to_blk_status() in the dm layer. It works, but
it's admittedly not the cleanest possible approach. OTOH, SG_IO on dm
devices isn't the cleanest thing to do in general ;-)

> If you think it does, is there another reason why you didn't try this
> before?

It didn't occur to me back then that we could fail paths without
retrying in the kernel.

Perhaps we could have the sg driver pass the blk_status_t (which is
available on the sg level) to device mapper somehow in the sg_io_hdr
structure? That way we could entirely avoid the layering violation
between SCSI and dm. Not sure if that would be acceptible to Christoph,
as blk_status_t is supposed to be exclusive to the kernel. Can we find
a way to make sure it's passed to DM, but not to user space?

Alternatively (if maintainers strictly object the error code inspection
by dm), I wonder whether we could just _always_ fail the current path
in case of errors in the dm-mpath SG_IO code path, without inspecting
the error code. Rationale: a) most potential error conditions are
treated as "path error" in block IO code path, b) qemu can still
inspect the error code and avoid retrying for errors that aren't path
errors, and c) if multipathd is running and the path has been failed
mistakenly, it will reinstate that path after just a few seconds. In
the worst case, an unintentional failover would occur. That isn't as
bad as it used to be, as active-passive configurations with explicit
TPGS are less common then in the past.

> (And it wouldn't be generic, so would we potentially have to repeat
> the
> exercise for other ioctls in the future?)

I don't think so. Do you have anything specific in mind?

> > Doing the failover in qemu has the general disadvantage that qemu
> > has
> > no notion about the number of available and/or healthy paths; it
> > can
> > thus only guess the reasonable number of retries for any given I/O.
> > But
> > that's unavoidable, given that the idea to do kernel-level failover
> > on
> > SG_IO is rejected.
> 
> Yes, it's a bit unfortunate, but we have to work with what we have.
> QEMU
> doesn't even necessarily know that it's dealing with a multipath
> device,
> so it just has to blindly try the ioctl and see if it works.

See the paragraph about multipathd above.

Regards
Martin

[1] https://elixir.bootlin.com/linux/v6.14.6/source/include/linux/blk_types.h#L185
[2] https://elixir.bootlin.com/linux/v6.14.6/source/drivers/scsi/scsi_lib.c#L685

PS: Yet another option that Christoph has suggested in the past would
be to move away from qemu's "scsi-block", and use the generic block
reservation mechanism to allow passthrough of reservation commands from
VMs to the host. Nobody seems to have explored this option seriously so
far.