lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YXgpgr/BFpbdMLJp@unreal>
Date:   Tue, 26 Oct 2021 19:14:58 +0300
From:   Leon Romanovsky <leon@...nel.org>
To:     Ido Schimmel <idosch@...sch.org>
Cc:     "David S . Miller" <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>,
        Ido Schimmel <idosch@...lanox.com>,
        Jiri Pirko <jiri@...lanox.com>, linux-kernel@...r.kernel.org,
        netdev@...r.kernel.org,
        syzbot+93d5accfaefceedf43c1@...kaller.appspotmail.com
Subject: Re: [PATCH net-next] netdevsim: Register and unregister devlink
 traps on probe/remove device

On Tue, Oct 26, 2021 at 05:09:47PM +0300, Ido Schimmel wrote:
> On Tue, Oct 26, 2021 at 10:18:12AM +0300, Leon Romanovsky wrote:
> > On Tue, Oct 26, 2021 at 09:51:13AM +0300, Ido Schimmel wrote:
> > 
> > <...>
> > 
> > > > 
> > > > Can you please explain why is it so important to touch devlink SW
> > > > objects, reallocate them again and again on every reload in mlxsw?
> > > 
> > > Because that's how reload was defined and implemented. A complete
> > > reload. We are not changing the semantics 4 years later.
> > 
> > Please put your emotions aside and explain me technically why are you
> > must to do it?
> 
> Already did. The current semantics are "devlink-reload provides
> mechanism to reinit driver entities, applying devlink-params and
> devlink-resources new values. It also provides mechanism to activate
> firmware."

Right, it doesn't mean that devlink should reregister itself.

> 
> And this is exactly what netdevsim and mlxsw are doing. Driver entities
> are re-initialized. Your patch breaks that as entities are not
> re-initialized, which results in user space breakage. You simply cannot
> introduce such regressions.

Again, and again. I don't want to introduce regression, and I'll fix it.
However, let's try to reach a conclusion on how to fix the current regression
properly.

> 
> > 
> > The proposed semantics was broken for last 4 years, it can even seen as
> > dead on arrival,
> 
> Again with the bombastic statements. It was "dead on arrival" like the
> notifications were "impossible"?

No, it was dead-on-arrival, because of deadlocks and kernel panics.

My search in internal bug tracker shows more than 500 bugs opened
by various teams where devlink reload is involved. It is hard to tell
which are pure devlink reload related, but when I started to work on
this series, close to 20 such bugs were assigned to me.

> 
> > because it never worked for us in real production.
> 
> Who is "us"? mlx5 that apparently decided to do its own thing?

Yes, mlx5. I don't see why you think that us not calling devlink
recursively means "own thing".

> 
> We are using reload in mlxsw on a daily basis and users are using it to
> re-partition ASIC resources and activate firmware. 

Are you really compare mlx5 deployment scale and broad of use with mlxsw?

> There are tests over netdevsim implementation that anyone can run for
> testing purposes. We also made sure to integrate it into syzkaller:
> 
> https://github.com/google/syzkaller/commit/5b49e1f605a770e8f8fcdcbd1a8ff85591fc0c8e
> https://github.com/google/syzkaller/commit/04ca72cd45348daab9d896bbec8ea4c2d13455ac
> https://github.com/google/syzkaller/commit/6930bbef3b671ae21f74007f9e59efb9b236b93f
> https://github.com/google/syzkaller/commit/d45a4d69d83f40579e74fb561e1583db1be0e294
> https://github.com/google/syzkaller/commit/510951950dc0ee69cfdaf746061d3dbe31b49fd8
> 
> Which is why the regressions you introduced were discovered so quickly.

No, it was caught because I explicitly added assertions to find misuse.

> 
> > 
> > So I'm fixing bugs without relation to when they were introduced.
> 
> We all do

Great

> 
> > 
> > For example, this fix from Jiri [1] for basic design flow was merged almost
> > two years later after devlink reload was introduced [2], or this patch from
> > Parav [3] that fixed an issue introduced year before [4].
> 
> What is your point? That code has bugs?

My point is that devlink should be decoupled from mlxsw, so mlxsw won't
call recursively to devlink when executes devlink API.

This is incorrect and for me it is a bug.

> 
> By now I have spent more time arguing with you than you spent testing
> your patches and it's clear this discussion is not going anywhere.
> 
> Are you going to send a revert or I will? This is the fourth time I'm
> asking you.

I understand your temptation to send revert, at the end it is the
easiest solution. However, I prefer to finish this discussion with
decision on how the end result in mlxsw will look like.

Let's hear Jiri and Jakub before we are rushing to revert something that
is correct in my opinion. We have whole week till merge window, and
revert takes less than 5 minutes, so no need to rush and do it before
direction is clear.

Thanks

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ