[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.LRH.2.02.1808070939320.6020@file01.intranet.prod.int.rdu2.redhat.com>
Date: Tue, 7 Aug 2018 10:07:28 -0400 (EDT)
From: Mikulas Patocka <mpatocka@...hat.com>
To: David Laight <David.Laight@...LAB.COM>
cc: "'Ard Biesheuvel'" <ard.biesheuvel@...aro.org>,
Ramana Radhakrishnan <ramana.gcc@...glemail.com>,
Florian Weimer <fweimer@...hat.com>,
Thomas Petazzoni <thomas.petazzoni@...e-electrons.com>,
GNU C Library <libc-alpha@...rceware.org>,
Andrew Pinski <pinskia@...il.com>,
Catalin Marinas <catalin.marinas@....com>,
Will Deacon <will.deacon@....com>,
Russell King <linux@...linux.org.uk>,
LKML <linux-kernel@...r.kernel.org>,
linux-arm-kernel <linux-arm-kernel@...ts.infradead.org>
Subject: RE: framebuffer corruption due to overlapping stp instructions on
arm64
On Mon, 6 Aug 2018, David Laight wrote:
> From: Mikulas Patocka
> > Sent: 05 August 2018 15:36
> > To: David Laight
> ...
> > There's an instruction movntdqa (and vmovntdqa) that can actually do
> > prefetch on write-combining memory type. It's the only instruction that
> > can do it.
> >
> > It this instruction is used on non-write-combining memory type, it behaves
> > like movdqa.
> >
> ...
> > I benchmarked it on a processor with ERMS - for writes to the framebuffer,
> > there's no difference between memcpy, 8-byte writes, rep stosb, rep stosq,
> > mmx, sse, avx - all this method achieve 16-17 GB/s
>
> The combination of write-combining, posted writes and a fast PCIe slave
> are probably why there is little difference.
>
> > For reading from the framebuffer:
> > 323 MB/s - memcpy (using avx2)
> > 91 MB/s - explicit 8-byte reads
> > 249 MB/s - rep movsq
> > 307 MB/s - rep movsb
>
> You must be getting the ERMS hardware optimised 'rep movsb'.
>
> > 90 MB/s - mmx
> > 176 MB/s - sse
> > 4750 MB/s - sse movntdqa
> > 330 MB/s - avx
>
> avx512 is probably faster still.
>
> > 5369 MB/s - avx vmovntdqa
> >
> > So - it may make sense to introduce a function memcpy_from_framebuffer()
> > that uses movntdqa or vmovntdqa on CPUs that support it.
>
> For kernel space it ought to be just memcpy_fromio().
I meant for userspace. Unaccelerated scrolling is still painfully slow
even on modern computers because of slow framebuffer read. If glibc
provided a function memcpy_from_framebuffer() that used movntdqa and the
fbdev Xorg driver used it, it would help the users who use unaccelerated
drivers for some reason.
> Can you easily repeat the tests using a non-write-combining map of the
> same PCIe slave?
I mapped the framebuffer as uncached and these are the results:
reading from the framebuffer:
318 MB/s - memcpy
74 MB/s - explicit 8-byte reads
73 MB/s - rep movsq
11 MB/s - rep movsb
87 MB/s - mmx
173 MB/s - sse
173 MB/s - sse movntdqa
323 MB/s - avx
284 MB/s - avx vmovntdqa
zeroing the framebuffer:
19 MB/s - memset
154 MB/s - explicit 8-byte writes
152 MB/s - rep stosq
19 MB/s - rep stosb
152 MB/s - mmx
306 MB/s - sse
621 MB/s - avx
copying data to the framebuffer:
618 MB/s - memcpy (using avx2)
152 MB/s - explicit 8-byte writes
139 MB/s - rep movsq
17 MB/s - rep movsb
154 MB/s - mmx
305 MB/s - sse
306 MB/s - sse movntdqa
619 MB/s - avx
619 MB/s - avx movntdqa
> I can probably run the same measurements against our rather leisurely
> FPGA based PCIe slave.
> IIRC PCIe reads happen every 128 clocks of the cards 62.5MHz clock,
> increasing the size of the registers makes a significant different.
> I've not tried mapping write-combining and using (v)movntdaq.
> I'm not sure what effect write-combining would have if the whole BAR
> were mapped that way - so I'll either have to map the physical addresses
> twice or add in another BAR.
>
> David
Mikulas
Powered by blists - more mailing lists