[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250715055707.GC5882@unreal>
Date: Tue, 15 Jul 2025 08:57:07 +0300
From: Leon Romanovsky <leon@...nel.org>
To: Jason Gunthorpe <jgg@...dia.com>
Cc: Catalin Marinas <catalin.marinas@....com>,
Alexander Gordeev <agordeev@...ux.ibm.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Christian Borntraeger <borntraeger@...ux.ibm.com>,
Borislav Petkov <bp@...en8.de>,
Dave Hansen <dave.hansen@...ux.intel.com>,
"David S. Miller" <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>,
Gerald Schaefer <gerald.schaefer@...ux.ibm.com>,
Vasily Gorbik <gor@...ux.ibm.com>,
Heiko Carstens <hca@...ux.ibm.com>,
"H. Peter Anvin" <hpa@...or.com>,
Justin Stitt <justinstitt@...gle.com>,
Jakub Kicinski <kuba@...nel.org>, linux-rdma@...r.kernel.org,
linux-s390@...r.kernel.org, llvm@...ts.linux.dev,
Ingo Molnar <mingo@...hat.com>, Bill Wendling <morbo@...gle.com>,
Nathan Chancellor <nathan@...nel.org>,
Nick Desaulniers <ndesaulniers@...gle.com>, netdev@...r.kernel.org,
Paolo Abeni <pabeni@...hat.com>,
Salil Mehta <salil.mehta@...wei.com>,
Sven Schnelle <svens@...ux.ibm.com>,
Thomas Gleixner <tglx@...utronix.de>, x86@...nel.org,
Yisen Zhuang <yisen.zhuang@...wei.com>,
Arnd Bergmann <arnd@...db.de>, linux-arch@...r.kernel.org,
linux-arm-kernel@...ts.infradead.org,
Mark Rutland <mark.rutland@....com>,
Michael Guralnik <michaelgur@...lanox.com>, patches@...ts.linux.dev,
Niklas Schnelle <schnelle@...ux.ibm.com>,
Jijie Shao <shaojijie@...wei.com>, Will Deacon <will@...nel.org>
Subject: Re: [PATCH v3 6/6] IB/mlx5: Use __iowrite64_copy() for write
combining stores
On Mon, Jul 14, 2025 at 06:55:04PM -0300, Jason Gunthorpe wrote:
> On Thu, Apr 11, 2024 at 01:46:19PM -0300, Jason Gunthorpe wrote:
> > mlx5 has a built in self-test at driver startup to evaluate if the
> > platform supports write combining to generate a 64 byte PCIe TLP or
> > not. This has proven necessary because a lot of common scenarios end up
> > with broken write combining (especially inside virtual machines) and there
> > is other way to learn this information.
> >
> > This self test has been consistently failing on new ARM64 CPU
> > designs (specifically with NVIDIA Grace's implementation of Neoverse
> > V2). The C loop around writeq() generates some pretty terrible ARM64
> > assembly, but historically this has worked on a lot of existing ARM64 CPUs
> > till now.
> >
> > We see it succeed about 1 time in 10,000 on the worst effected
> > systems. The CPU architects speculate that the load instructions
> > interspersed with the stores makes the WC buffers statistically flush too
> > often and thus the generation of large TLPs becomes infrequent. This makes
> > the boot up test unreliable in that it indicates no write-combining,
> > however userspace would be fine since it uses a ST4 instruction.
>
> Hi Catalin,
>
> After a year of testing this in real systems it turns out that still
> some systems are not good enough with the unrolled 8 byte store loop.
> In my view the CPUs are quite bad here and this WC performance
> optimization is not working very well.
>
> There are only two more options to work around this issue, use the
> unrolled 16 byte STP or the single Neon instruction 64 byte store.
>
> Since STP was rejected alread we've only tested the Neon version. It
> does make a huge improvement, but it still somehow fails to combine
> rarely sometimes. The CPU is really bad at this :(
>
> So we want to make mlx5 use the single 64 byte neon store instruction
> like userspace has been using for a long time for this testing
> algorithm.
>
> It is simple enough, but the question has come up where to put the
> code. Do you want to somehow see the neon option to be in the
> arch/arm64 code or should we stick it in the driver under a #ifdef?
>
> The entry/exit from neon is slow enough I don't think any driver doing
> performance work would want to use neon instead of __iowrite64_copy(),
> so I do not think it should be hidden inside __iowrite64_copy(). Nor
> have I thought of a name for an arch generic function..
__iowrite64_slow_copy() ????
>
> Thanks,
> Jason
Powered by blists - more mailing lists