[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <18248cc6f411441c8a68a55f68416150@AcuMS.aculab.com>
Date: Fri, 23 Feb 2024 13:52:37 +0000
From: David Laight <David.Laight@...LAB.COM>
To: 'Jason Gunthorpe' <jgg@...dia.com>
CC: 'Niklas Schnelle' <schnelle@...ux.ibm.com>, Alexander Gordeev
<agordeev@...ux.ibm.com>, Andrew Morton <akpm@...ux-foundation.org>,
Christian Borntraeger <borntraeger@...ux.ibm.com>, Borislav Petkov
<bp@...en8.de>, Dave Hansen <dave.hansen@...ux.intel.com>, "David S. Miller"
<davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>, Gerald Schaefer
<gerald.schaefer@...ux.ibm.com>, Vasily Gorbik <gor@...ux.ibm.com>, "Heiko
Carstens" <hca@...ux.ibm.com>, "H. Peter Anvin" <hpa@...or.com>, Justin Stitt
<justinstitt@...gle.com>, Jakub Kicinski <kuba@...nel.org>, Leon Romanovsky
<leon@...nel.org>, "linux-rdma@...r.kernel.org" <linux-rdma@...r.kernel.org>,
"linux-s390@...r.kernel.org" <linux-s390@...r.kernel.org>,
"llvm@...ts.linux.dev" <llvm@...ts.linux.dev>, Ingo Molnar
<mingo@...hat.com>, Bill Wendling <morbo@...gle.com>, Nathan Chancellor
<nathan@...nel.org>, Nick Desaulniers <ndesaulniers@...gle.com>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>, Paolo Abeni
<pabeni@...hat.com>, Salil Mehta <salil.mehta@...wei.com>, Jijie Shao
<shaojijie@...wei.com>, Sven Schnelle <svens@...ux.ibm.com>, Thomas Gleixner
<tglx@...utronix.de>, "x86@...nel.org" <x86@...nel.org>, Yisen Zhuang
<yisen.zhuang@...wei.com>, Arnd Bergmann <arnd@...db.de>, Catalin Marinas
<catalin.marinas@....com>, Leon Romanovsky <leonro@...lanox.com>,
"linux-arch@...r.kernel.org" <linux-arch@...r.kernel.org>,
"linux-arm-kernel@...ts.infradead.org"
<linux-arm-kernel@...ts.infradead.org>, Mark Rutland <mark.rutland@....com>,
Michael Guralnik <michaelgur@...lanox.com>, "patches@...ts.linux.dev"
<patches@...ts.linux.dev>, Will Deacon <will@...nel.org>
Subject: RE: [PATCH 4/6] arm64/io: Provide a WC friendly __iowriteXX_copy()
From: Jason Gunthorpe
> Sent: 23 February 2024 13:03
>
> On Fri, Feb 23, 2024 at 12:19:24PM +0000, David Laight wrote:
>
> > Since writes get 'posted' all over the place.
> > How many writes do you need to do before write-combining makes a
> > difference?
>
> The issue is that the HW can optimize if the entire transaction is
> presented in one TLP, if it has to reassemble the transaction it takes
> a big slow path hit.
Ah, so you aren't optimising to reduce the number of TLP for
(effectively) a write to a memory buffer, but have a pcie slave
that really want to see (for example) the writes for a ring buffer
entry in a single TLP?
So you really want something that (should) generate a 16 (or 32)
byte TLP? Rather than abusing the function that is expected to
generate multiple 8 byte TLP to generate larger TLP.
I'm guessing that on arm64 the ldp/stp instructions will generate
a single 16 byte TLP regardless of write combining?
They would definitely help memcpy_fromio().
Are they enough for arm64?
Getting but TLP on x86 is probably harder.
(Unless you use AVX512 registers and aligned accesses.)
It is rather a shame that there isn't an efficient way to get
access to a couple of large SIMD registers.
(eg save on stack and have the fpu code where they are for
a lazy fpu switch.)
There is quite a bit of code that would benefit, but kernel_fpu_begin()
is just too expensive.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Powered by blists - more mailing lists