linux-kernel - Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active paths

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aCOKyZU5T8HxGzM_@redhat.com>
Date: Tue, 13 May 2025 14:09:13 -0400
From: Benjamin Marzinski <bmarzins@...hat.com>
To: Hannes Reinecke <hare@...e.de>
Cc: Kevin Wolf <kwolf@...hat.com>, Martin Wilck <mwilck@...e.com>,
        dm-devel@...ts.linux.dev, hreitz@...hat.com, mpatocka@...hat.com,
        snitzer@...nel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active
 paths

On Tue, May 13, 2025 at 08:30:55AM +0200, Hannes Reinecke wrote:
> On 5/12/25 17:18, Kevin Wolf wrote:
> > Am 08.05.2025 um 15:51 hat Martin Wilck geschrieben:
> > > Hello Kevin,
> > > 
> > > [I'm sorry for the previous email. It seems that I clicked "send" in
> > > the wrong window].
> > > 
> > > On Tue, 2025-04-29 at 18:50 +0200, Kevin Wolf wrote:
> > > > Multipath cannot directly provide failover for ioctls in the kernel
> > > > because it doesn't know what each ioctl means and which result could
> > > > indicate a path error. Userspace generally knows what the ioctl it
> > > > issued means and if it might be a path error, but neither does it
> > > > know
> > > > which path the ioctl took nor does it necessarily have the privileges
> > > > to
> > > > fail a path using the control device.
> > > 
> > > Thanks for this effort.
> > > 
> > > I have some remarks about your approach. The most important one is that
> > > the DM_MPATH_PROBE_PATHS_CMD ioctl appears to be a dangerous command.
> > > It sends IO to possibly broken paths and waits for it to complete. In
> > > the common error case of a device not responding, this might cause the
> > > code to hang for a long time in the kernel ioctl code path, in
> > > uninterruptible sleep (note that these probing commands will be queued
> > > after other possibly hanging IO). In the past we have put a lot of
> > > effort into other code paths in multipath-tools and elsewhere to avoid
> > > this kind of hang to the extent possible. It seems to me that your set
> > > re-introduces this risk.
> > 
> > That's a fair point. I don't expect this code to be fully final, another
> > thing that isn't really optimal (as mentioned in the comment message) is
> > that we're not probing paths in parallel, potentially adding up timeouts
> > from multiple paths.
> > 
> > I don't think this is a problem of the interface, though, but we can
> > improve the implementation keeping the same interface.
> > 
> > Maybe I'm missing something, but I think the reason why it has to be
> > uninterruptible is just that submit_bio_wait() has the completion on the
> > stack? So I suppose we could just kmalloc() some state, submit all the
> > bios, let the completion callback free it, and then we can just stop
> > waiting early (i.e. wait_for_completion_interruptible/killable) and let
> > the requests complete on their own in the background.
> > 
> > Would this address your concerns or is there another part to it?
> > 
> Not really. There are two issues which ultimately made us move away
> from read I/O as path checkers:
> 
> - Spurious timeouts due to blocked or congested submission
> - Error handling delays and stalls
> 
> When using normal read/write functions for path checking you are
> subjected to normal I/O processing rules, ie I/O is inserted
> at the end of the submission queue. If the system is busy the
> path checker will time out prematurely without ever having reached
> the target. One could avoid that by extending the timeout, but that
> would make the path checker unreliable.

I get your complaint in general, but this interface is just giving users
the ability to probe the active pathgroup by sending normal IOs to its
paths.  Presumably, they won't use it if they are already sending normal
IOs to the multipath device, since this is just duplicating the effect
of those IOs. 
 
> But the real issue is error handling. Path checker are precisely there
> to check if the path is healthy, and as such _will_ be subjected
> to error handling (or might even trigger error handling itself).
> The thing about error handling is that you can only return affected
> commands once error handling is complete, as only then you can be
> sure that no DMA is pending on the buffers and one can free/re-use
> the associated pages.
> On SCSI error handling can be an _extremely_ lengthy affair
> (we had reports for over one hour), and the last thing you want is
> to delay your path checker for that time.

I can send a patch that will keep probe_active_paths() from holding the
work_mutex. This means that probe_active_paths() won't delay
multipath_message(), so the path checkers will not be effected by this.

> (And besides, I didn't even mention the third problem with I/O-based
> path checkers, namely that the check entirely the wrong thing; we
> are not interested whether we can do I/O, we are interested in whether
> we can send commands. In the light of eg ALUA these are two very
> different things.)

I actually disagree with this. There are issues with IO based checkers,
but in general, what we care about is paths being able to do IO, not
sending commands. All the improvements on the directio checker were made
because we still need it. For instance, some SCSI devices will claim
they are usable when you run a SCSI Test Unit Ready command, but they
aren't able to actually handle IO. These paths will have their state
ping-pong back and forth as multipathd restores them and the kernel uses
them and immeditately fails them again. The solution is often to switch
multipathd to the directio checker, since it verifies that the path can
actually handle IO. It's problematic when dealing with paths that need
to be initialized before accepting IO, but that's why this patch only
checks the paths in the active pathgroup. 
 
> > > Apart from that, minor observations are that in patch 2/2 you don't
> > > check whether the map is in queueing state, and I don't quite
> > > understand why you only check paths in the map's active path group
> > > without attempting a PG failover in the case where all paths in the
> > > current PG fail.
> > 
> > Ben already fixed up the missing check.
> > 
> > Not attempting PG failover was also his suggestion, on the basis that it
> > would be additional work for no real benefit when you can only submit
> > requests for the current PG anyway. If userspace retries the SG_IO, it
> > will already pick a different PG, so having the kernel do the same
> > doesn't really improve anything.
> > 
> Precisely.
> 
> I think the best way forward here is to have a 'post_ioctl' handler
> (much like we have a pre_ioctl handler nowadays).
> This would check the return code from the ioctl, and invalidate the
> current path if there was an I/O error.
> That way the user can resubmit the ioctl, until all paths are exhausted.
> For that to work we need to agree on two error codes, one for
> 'path failed, retry' and one for 'path failed, no retry possible'.
> Resetting path status will be delegated to multipathd, but then that's
> its main task anyway.

Isn't this basically the idea that Martin had, here:
https://lore.kernel.org/all/20210628151558.2289-1-mwilck@suse.com/

If this was acceptable, I would be all for it. But we often can't even
tell if an SG_IO has failed without decoding the scsi status/sense data,
and that got shot down when Martin posted patches to do it. Thus, we
have a general solution that has nothing to do with SG_IO ioctls and
will work for any type of path device, any time a user wants to verify
the state of the active paths.

-Ben

> Cheers,
> 
> Hannes
> -- 
> Dr. Hannes Reinecke                  Kernel Storage Architect
> hare@...e.de                                +49 911 74053 688
> SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
> HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich