lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8ac6cc96-8877-4ddc-b57a-2a096f446a4c@grimberg.me>
Date: Wed, 16 Apr 2025 02:35:02 +0300
From: Sagi Grimberg <sagi@...mberg.me>
To: Randy Jennings <randyj@...estorage.com>
Cc: Daniel Wagner <dwagner@...e.de>,
 Mohamed Khalfella <mkhalfella@...estorage.com>,
 Daniel Wagner <wagi@...nel.org>, Christoph Hellwig <hch@....de>,
 Keith Busch <kbusch@...nel.org>, Hannes Reinecke <hare@...e.de>,
 John Meneghini <jmeneghi@...hat.com>, linux-nvme@...ts.infradead.org,
 linux-kernel@...r.kernel.org
Subject: Re: [PATCH RFC 3/3] nvme: delay failover by command quiesce timeout


>> What I meant was that the user can no longer set kato to be arbitrarily
>> long when we
>> now introduce failover dependency on it.
>>
>> We need to set a sane maximum value that will failover in a reasonable
>> timeframe.
>> In other words, kato cannot be allowed to be set by the user to 60
>> minutes. While we didn't
>> care about it before, now it means that failover may take 60+ minutes.
>>
>> Hence, my request to set kato to a max absolute value of seconds. My
>> vote was 10 (2x of the default),
>> but we can also go with 30.
> Adding a maximum value for KATO makes a lot of sense to me.  This will
> help keep us away from a hung task timeout when the full delay is
> taken into account.  30 makes sense to me from the perspective that
> the maximum should be long enough to handle non-ideal situations
> functionally, but not a value that you expect people to use regularly.
>
> I think CQT should have a maximum allowed value for similar reasons.
> If we do clamp down on the CQT, we could be opening ourselves to the
> target not completely cleaning up, but it keeps us from a hung task
> timeout, and _any_ delay will help most of the time.

CQT comes from the controller, and if it is high, it effectively means 
that the
controller cannot handle faster failover reliably. So I think we should 
leave it
as is. It is the vendor problem.

>
> CCR will not address arbitrarily long times for either because:
> 1. It is optional.
> 2. It may fail.
> 3. We still need a ceiling on the recovery time we can handle.

Yes, makes sense. if it fails, we need to wait until something expires, 
which would
be CQT.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ