[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-Id: <83b60736-3856-4804-b25d-d0c8cbf9b185@app.fastmail.com>
Date: Fri, 31 Jan 2025 10:37:55 +0100
From: "Arnd Bergmann" <arnd@...db.de>
To: "Segher Boessenkool" <segher@...nel.crashing.org>,
"Julian Vetter" <julian@...er-limits.org>
Cc: "Madhavan Srinivasan" <maddy@...ux.ibm.com>,
"Michael Ellerman" <mpe@...erman.id.au>,
"Nicholas Piggin" <npiggin@...il.com>,
"Christophe Leroy" <christophe.leroy@...roup.eu>,
"Naveen N Rao" <naveen@...nel.org>, linuxppc-dev@...ts.ozlabs.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] powerpc: Remove eieio() in PowerPC IO functions
On Wed, Jan 29, 2025, at 12:14, Segher Boessenkool wrote:
> On Wed, Jan 29, 2025 at 10:45:10AM +0100, Julian Vetter wrote:
>> Remove the eieio() calls in IO functions for PowerPC. While other
>> architectures permit prefetching, combining, and reordering, the eieio()
>> calls on PowerPC prevent such optimizations.
>
> Yes, and it is crucial to prevent combining, it is part of the semantics
> of these functions. This is a much bigger problem on PowerPC than on
> architectures which optimise memory accesses much less. So most other
> archs can get away with it much easier (but it is still completely wrong
> there).
Unfortunately it's not well documented what the actual behavior is
supposed to be across architectures, so part of the work that Julian
is doing is to make them behave consistently. My impression is that
we actually do want combining (and reordering) here in memcpy_fromio,
unless there are specific drivers that depend on the non-combining
behavior. My earlier observations towards this are:
- memcpy_fromio() is almost always used on RAM behind a bus, not MMIO
registers.
- there has been a push towards using instruction sequences on arm64
make sure we get the most (write) combining for the I/O string functions
to get better performance
- There is only an eieio in powerpc memcpy_fromio() but no barrier
inside the memcpy_toio() or memset_io() loops. If it was necessary
to prevent combining, this would likely be for both load and store
loops here.
- I tried to trace the history of memcpy_fromio() and found that
it was originally just a loop around readl(), which at the time
was eieio() and a pointer dereference and byteswap. The readl()
definition has changed many times after that, but memcpy_fromio()
never changed again, which makes it most likely that this was
just forgotten in the conversion rather than intentional.
If you can think of callers of memcpy_fromio() that depend on the
eieio(), we probably need to fix other architectures as well and
go back to Julian's original idea of adding a new barrier into the
common code and have an architecture specific definition for that
barrier.
For the insb()/insw()/insl() case, I think you are correct that
the eieio is required on powerpc and likely others as well, since
the CPU combining multiple reads on a single address into a single
one would clearly break the concept.
On Arm and other architectures that prevent combining of MMIO
reads based on page table flags, we don't need a barrier here,
but there is a good chance that these barriers are in fact
missing on some alpha and some mips implementations of
readsl() etc: The alpha ioread32_rep() and insl() have barriers
inside, but it also uses the generic readsl() which does not.
> You are keeping the trap;isync things, which a) have a way bigger
> performance impact, and b) are merely a debugging aid (if some i/o
> access kills the system, it will be clear where that came from). And
> that isn't even the biggest thing of course, there is a heavyweight
> sync in there as well.
The barriers around memcpy_toio()/memcpy_fromio() are a separate
question that we should address as well. I somehow thought that
other architectures had the same barriers as readl/writel around
the string functions, but it does seem like it's only powerpc
at this point.
Intuitively I would have expected that a memcpy_toio() etc
has the same barriers as a writel() around it for consistency, on
the other hand it's only powerpc that has these, and if the
functions are indeed only meant to work on RAM behind MMIO,
they would not not have to serialize against DMA the same way
that writel does because there are no side-effects.
I'm still looking for more insights on that, but if we can
agree that memcpy_{from,to}io don't need the outer barriers,
it seems best to address this at the same time as the internal
eieio() in memcpy_fromio(), provided we agree on removing
those as well.
Arnd
Powered by blists - more mailing lists