[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1752803672.0477452-1-xuanzhuo@linux.alibaba.com>
Date: Fri, 18 Jul 2025 09:54:32 +0800
From: Xuan Zhuo <xuanzhuo@...ux.alibaba.com>
To: Andrew Lunn <andrew@...n.ch>
Cc: netdev@...r.kernel.org,
Andrew Lunn <andrew+netdev@...n.ch>,
"David S. Miller" <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>,
Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>,
Wen Gu <guwen@...ux.alibaba.com>,
Philo Lu <lulie@...ux.alibaba.com>,
Lorenzo Bianconi <lorenzo@...nel.org>,
Lukas Bulwahn <lukas.bulwahn@...hat.com>,
Parthiban Veerasooran <Parthiban.Veerasooran@...rochip.com>,
Geert Uytterhoeven <geert+renesas@...der.be>,
Alexander Duyck <alexanderduyck@...com>,
Dust Li <dust.li@...ux.alibaba.com>
Subject: Re: [PATCH net-next] eea: Add basic driver framework for Alibaba Elastic Ethernet Adaptor
On Thu, 17 Jul 2025 19:27:03 +0200, Andrew Lunn <andrew@...n.ch> wrote:
> > > That is not a very good explanation. Do you see any other system in
> > > Linux were the firmware works around bug in Linux drivers using the
> > > kernel version?
> >
> > Actually, there is one, we noticed that the ena driver has a similar mechanism.
> >
> > struct ena_admin_host_info
> >
> > >
> > > You also need to think about enterprise kernels, like RedHat,
> > > Oracle. They don't give a truthful kernel version, they have thousands
> > > of patches on top fixing, and creating bugs. How will you handle that?
> > >
> > > Please drop all this, and just fix the bugs in the driver.
> >
> >
> > Fixing bugs in Linux is, of course, the necessary work. However, if certain bugs
> > already exist and customers are using such drivers, there is a risk involved. We
> > can record these buggy versions in the DPU, and notify users via dmesg when they
> > initialize the driver.
>
> This then references the next point. What does 5.4.296 actually mean?
> It is mainline 5.4.296? Is it Debian 5.4.296 with just a few patches
> on top? Is it Redhat with 1000s of patches on top? Is it a vendor
> patch which broke it, or is mainline broken? If the vendor broke it,
> are you going to apply workarounds in your DPU for mainline which is
> not broken? Does you DPU tell the world it is applying a workaround,
> so somebody trying to debug the issue knows the DPU is working against
> them?
>
> As you pointed out, there might be one driver amongst hundreds which
> reports the kernel version to the firmware. Does ENA actually do
> anything with it? I don't know. But since less an 1% of drivers
> actually do this, it cannot be a useful feature, because others would
> already be do it.
We have our own distribution "Anolis". I think we can distinguish different
distributions based on utsname()->release, and by combining it with the kernel
version and driver version, we can locate the actual driver implementation.
Actually, this is just a reserved mechanism, designed in advance for possible
future needs. If you want me to guarantee that this method will definitely work,
I can't answer that. Although I personally think this mechanism should
work. And as an attempt, I don't think there is a big problem with it.
>
> > However, once we've identified the problem, we would prefer for the operation to
> > time out and exit, so that we can reload the new .ko module. In this process, we
> > may adjust the module parameters to reduce the originally large timeout value,
> > forcing it to exit faster. This use case is actually very helpful during our
> > development process and significantly improves our efficiency.
>
> No module parameters. You are doing development work, just use $EDITOR
> and change the timeout.
Our use case has already been explained. We will set a long timeout to help with
issue diagnosis, and once the problem is identified, we will immediately adjust
the timeout to let the driver exit quickly. Honestly, this is a very useful
feature for us during the development process. Of course, it seems that you are
strongly opposed to it, so we will remove it in the next version.
>
> > > So you will be submitting a patch for GregKH for every single stable
> > > kernel? That will be around 5 patches, every two weeks, for the next
> > > 30 years?
> >
> > Of course we won't be doing that. Our plan is that whenever we update the code
> > — for example, fixing a bug and updating the version from 1.0.0 to 1.0.1, or
> > introducing a new feature and bumping the version to 1.0.2 — then when this
> > change is backported to stable releases, the version should also be backported
> > accordingly.
>
> So the version is useless. This has long been agreed, and we have been
> NACKing such versions for years.
OK, we will change it in next version.
Thanks.
>
> Andrew
>
Powered by blists - more mailing lists