netdev - Re: [PATCH] net/mlx5: Added cond_resched() to crdump collection

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZsjJwcKAaKRyEHuU@ceto>
Date: Fri, 23 Aug 2024 10:41:21 -0700
From: Mohamed Khalfella <mkhalfella@...estorage.com>
To: Moshe Shemesh <moshe@...dia.com>
Cc: Przemek Kitszel <przemyslaw.kitszel@...el.com>,
	Yuanyuan Zhong <yzhong@...estorage.com>,
	Saeed Mahameed <saeedm@...dia.com>,
	Leon Romanovsky <leon@...nel.org>, Tariq Toukan <tariqt@...dia.com>,
	"David S. Miller" <davem@...emloft.net>,
	Eric Dumazet <edumazet@...gle.com>,
	Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
	Shay Drori <shayd@...dia.com>, netdev@...r.kernel.org,
	linux-rdma@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] net/mlx5: Added cond_resched() to crdump collection

On 2024-08-23 08:16:32 +0300, Moshe Shemesh wrote:
> 
> 
> On 8/23/2024 7:08 AM, Przemek Kitszel wrote:
> > 
> > On 8/22/24 19:08, Mohamed Khalfella wrote:
> >> On 2024-08-22 09:40:21 +0300, Moshe Shemesh wrote:
> >>>
> >>>
> >>> On 8/21/2024 1:27 AM, Mohamed Khalfella wrote:
> >>>>
> >>>> On 2024-08-20 12:09:37 +0200, Przemek Kitszel wrote:
> >>>>> On 8/19/24 23:42, Mohamed Khalfella wrote:
> > 
> > 
> >>>>
> >>>> Putting a cond_resched() every 16 register reads, similar to
> >>>> mlx5_vsc_wait_on_flag(), should be okay. With the numbers above, this
> >>>> will result in cond_resched() every ~0.56ms, which is okay IMO.
> >>>
> >>> Sorry for the late response, I just got back from vacation.
> >>> All your measures looks right.
> >>> crdump is the devlink health dump of mlx5 FW fatal health reporter.
> >>> In the common case since auto-dump and auto-recover are default for this
> >>> health reporter, the crdump will be collected on fatal error of the mlx5
> >>> device and the recovery flow waits for it and run right after crdump
> >>> finished.
> >>> I agree with adding cond_resched(), but I would reduce the frequency,
> >>> like once in 1024 iterations of register read.
> >>> mlx5_vsc_wait_on_flag() is a bit different case as the usleep there is
> >>> after 16 retries waiting for the value to change.
> >>> Thanks.
> >>
> >> Thanks for taking a look. Once in every 1024 iterations approximately
> >> translates to 35284.4ns * 1024 ~= 36.1ms, which is relatively long time
> >> IMO. How about any power-of-two <= 128 (~4.51ms)?
> 
> OK
> > 
> > Such tune-up would matter for interactive use of the machine with very
> > little cores, is that the case? Otherwise I see no point [to make it
> > overall a little slower, as that is the tradeoff].
> 
> Yes, as I see it, the point here is host with very few cores.

It should make a difference for systems with few cores. Add to that the
numbers above is what I was able to get from the lab. It has been seen
in the field that collecting crdump takes more than 5 seconds causing
issues. If this makes sense I will submit v2 with the updated commit
message and cond_resched() every 128 iterations of register read.