lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Message-ID: <Y7Etsai7gb5jv+LA@unreal> Date: Sun, 1 Jan 2023 08:52:33 +0200 From: Leon Romanovsky <leon@...nel.org> To: Saeed Mahameed <saeed@...nel.org> Cc: "David S. Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, Eric Dumazet <edumazet@...gle.com>, Saeed Mahameed <saeedm@...dia.com>, netdev@...r.kernel.org, Tariq Toukan <tariqt@...dia.com>, Shay Drory <shayd@...dia.com>, Moshe Shemesh <moshe@...dia.com> Subject: Re: [net 04/12] net/mlx5: Avoid recovery in probe flows On Thu, Dec 29, 2022 at 10:29:58AM -0800, Saeed Mahameed wrote: > On 29 Dec 08:33, Leon Romanovsky wrote: > > On Wed, Dec 28, 2022 at 11:43:23AM -0800, Saeed Mahameed wrote: > > > From: Shay Drory <shayd@...dia.com> > > > > > > Currently, recovery is done without considering whether the device is > > > still in probe flow. > > > This may lead to recovery before device have finished probed > > > successfully. e.g.: while mlx5_init_one() is running. Recovery flow is > > > using functionality that is loaded only by mlx5_init_one(), and there > > > is no point in running recovery without mlx5_init_one() finished > > > successfully. > > > > > > Fix it by waiting for probe flow to finish and checking whether the > > > device is probed before trying to perform recovery. > > > > > > Fixes: 51d138c2610a ("net/mlx5: Fix health error state handling") > > > Signed-off-by: Shay Drory <shayd@...dia.com> > > > Reviewed-by: Moshe Shemesh <moshe@...dia.com> > > > Signed-off-by: Saeed Mahameed <saeedm@...dia.com> > > > --- > > > drivers/net/ethernet/mellanox/mlx5/core/health.c | 6 ++++++ > > > 1 file changed, 6 insertions(+) > > > > > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/health.c b/drivers/net/ethernet/mellanox/mlx5/core/health.c > > > index 86ed87d704f7..96417c5feed7 100644 > > > --- a/drivers/net/ethernet/mellanox/mlx5/core/health.c > > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/health.c > > > @@ -674,6 +674,12 @@ static void mlx5_fw_fatal_reporter_err_work(struct work_struct *work) > > > dev = container_of(priv, struct mlx5_core_dev, priv); > > > devlink = priv_to_devlink(dev); > > > > > > + mutex_lock(&dev->intf_state_mutex); > > > + if (test_bit(MLX5_DROP_NEW_HEALTH_WORK, &health->flags)) { > > > + mlx5_core_err(dev, "health works are not permitted at this stage\n"); > > > + return; > > > + } > > <...> > > Or another solution is to start health polling only when init complete. > > > > Also very complex and very risky to do in rc. > Health poll should be running on dynamic driver reloads, > for example devlink reload, but not on first probe.. if we are going to > start after probe then we will have to stop (sync) any > health work before .remove, which is a locking nightmare.. we've been there > before. I afraid that my proposed solution distracted you. The real issue is that this patch can't be correct. Let's focus on MLX5_DROP_NEW_HEALTH_WORK bit. It is checked while holding different locks, so one of the locks is wrong and not needed. If MLX5_DROP_NEW_HEALTH_WORK bit can't be changed after/during queuing the work, the newly added check in mlx5_fw_fatal_reporter_err_work will be redundant. If MLX5_DROP_NEW_HEALTH_WORK bit can be changed after queuing the work. the check is racy and can have different results immediately after releasing intf_state_mutex. Thanks
Powered by blists - more mailing lists