[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ecff74ca-f3a5-4848-8836-54f1d7810aca@suse.de>
Date: Tue, 13 May 2025 08:30:55 +0200
From: Hannes Reinecke <hare@...e.de>
To: Kevin Wolf <kwolf@...hat.com>, Martin Wilck <mwilck@...e.com>
Cc: dm-devel@...ts.linux.dev, hreitz@...hat.com, mpatocka@...hat.com,
snitzer@...nel.org, bmarzins@...hat.com, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 0/2] dm mpath: Interface for explicit probing of active
paths
On 5/12/25 17:18, Kevin Wolf wrote:
> Am 08.05.2025 um 15:51 hat Martin Wilck geschrieben:
>> Hello Kevin,
>>
>> [I'm sorry for the previous email. It seems that I clicked "send" in
>> the wrong window].
>>
>> On Tue, 2025-04-29 at 18:50 +0200, Kevin Wolf wrote:
>>> Multipath cannot directly provide failover for ioctls in the kernel
>>> because it doesn't know what each ioctl means and which result could
>>> indicate a path error. Userspace generally knows what the ioctl it
>>> issued means and if it might be a path error, but neither does it
>>> know
>>> which path the ioctl took nor does it necessarily have the privileges
>>> to
>>> fail a path using the control device.
>>
>> Thanks for this effort.
>>
>> I have some remarks about your approach. The most important one is that
>> the DM_MPATH_PROBE_PATHS_CMD ioctl appears to be a dangerous command.
>> It sends IO to possibly broken paths and waits for it to complete. In
>> the common error case of a device not responding, this might cause the
>> code to hang for a long time in the kernel ioctl code path, in
>> uninterruptible sleep (note that these probing commands will be queued
>> after other possibly hanging IO). In the past we have put a lot of
>> effort into other code paths in multipath-tools and elsewhere to avoid
>> this kind of hang to the extent possible. It seems to me that your set
>> re-introduces this risk.
>
> That's a fair point. I don't expect this code to be fully final, another
> thing that isn't really optimal (as mentioned in the comment message) is
> that we're not probing paths in parallel, potentially adding up timeouts
> from multiple paths.
>
> I don't think this is a problem of the interface, though, but we can
> improve the implementation keeping the same interface.
>
> Maybe I'm missing something, but I think the reason why it has to be
> uninterruptible is just that submit_bio_wait() has the completion on the
> stack? So I suppose we could just kmalloc() some state, submit all the
> bios, let the completion callback free it, and then we can just stop
> waiting early (i.e. wait_for_completion_interruptible/killable) and let
> the requests complete on their own in the background.
>
> Would this address your concerns or is there another part to it?
>
Not really. There are two issues which ultimately made us move away
from read I/O as path checkers:
- Spurious timeouts due to blocked or congested submission
- Error handling delays and stalls
When using normal read/write functions for path checking you are
subjected to normal I/O processing rules, ie I/O is inserted
at the end of the submission queue. If the system is busy the
path checker will time out prematurely without ever having reached
the target. One could avoid that by extending the timeout, but that
would make the path checker unreliable.
But the real issue is error handling. Path checker are precisely there
to check if the path is healthy, and as such _will_ be subjected
to error handling (or might even trigger error handling itself).
The thing about error handling is that you can only return affected
commands once error handling is complete, as only then you can be
sure that no DMA is pending on the buffers and one can free/re-use
the associated pages.
On SCSI error handling can be an _extremely_ lengthy affair
(we had reports for over one hour), and the last thing you want is
to delay your path checker for that time.
(And besides, I didn't even mention the third problem with I/O-based
path checkers, namely that the check entirely the wrong thing; we
are not interested whether we can do I/O, we are interested in whether
we can send commands. In the light of eg ALUA these are two very
different things.)
>> Apart from that, minor observations are that in patch 2/2 you don't
>> check whether the map is in queueing state, and I don't quite
>> understand why you only check paths in the map's active path group
>> without attempting a PG failover in the case where all paths in the
>> current PG fail.
>
> Ben already fixed up the missing check.
>
> Not attempting PG failover was also his suggestion, on the basis that it
> would be additional work for no real benefit when you can only submit
> requests for the current PG anyway. If userspace retries the SG_IO, it
> will already pick a different PG, so having the kernel do the same
> doesn't really improve anything.
>
Precisely.
I think the best way forward here is to have a 'post_ioctl' handler
(much like we have a pre_ioctl handler nowadays).
This would check the return code from the ioctl, and invalidate the
current path if there was an I/O error.
That way the user can resubmit the ioctl, until all paths are exhausted.
For that to work we need to agree on two error codes, one for
'path failed, retry' and one for 'path failed, no retry possible'.
Resetting path status will be delegated to multipathd, but then that's
its main task anyway.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@...e.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
Powered by blists - more mailing lists