lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <20110225105552.GB3666@debian-wegner1.datadisplay.de>
Date:	Fri, 25 Feb 2011 11:55:52 +0100
From:	Wolfgang Wegner <ww-ml@....de>
To:	linux-kernel@...r.kernel.org
Subject: mmap()ed PCI write performance question/problem

Hi list,

I am having some problems getting decent write performance from an mmap()ed
PCI device. The device in question is an FPGA core connected to an ARM
Kirkwood CPU via 88SB2211 PCIe->PCI bridge, but to get an idea about what
is going on I also tested some plain old graphics cards in an x86 PC which
also showed "strange" (i.e. I have no explanation yet) behaviour.

My tests:
load a simple, i.e. unaccelerated, frame buffer driver and use a very
basic test program which does nothing more than map the framebuffer
memory and write to it, measuring the time passing by during the write.
I used different write methods, but apart from the STMIA for ARM which
forces a burst write I could not see a difference.
Unfortunately I did not get the FPGA evaluation board to run in a PC,
not even using the bridge evaluation board, somehow the BIOS did not
like it at all and got stuck. Most PCI graphics cards I found had the
same performance (2 Matrox PCI, 1 Matrox PCIe with internal PCIe->PCI
bridge, one ATI Radeon), only a very old ATI card was much faster
for - for me - no apparent reason. And, a really new 16-lane PCIe
card was also faster (around 165 MBytes/s), but this one obviously is
not really useful for comparison.

What puzzles me:
- why do different graphics cards behave so different when there is
  no obvious difference in lspci and both use the same, framebuffer-
  internal, mmap function?
- why is it possible to write-combine(?) the single writes in the
  PC and not on the ARM platform?
- what does the old (I am tempted to call it stone-age) ATI card do
  to get this great write performance?
- or, last but not least, is my measurement crap?

Below are my tests, I hope I did not leave any relevant information
out - sorry it is so huge. As i did not see any difference in lspci
parameters I posted only one of the "slow" examples.
The values on the ARM are ideal values only with fixed register values
to write. As soon as I have memory accesses in between to read "real"
data, the performance drops down to around half in case of STM.

(I already asked on the ARM kernel mailing list which gave me the
hint of using STM, but no background on what might happen on the
PCI, and I did not yet have the PC comparison at that time:
http://comments.gmane.org/gmane.linux.ports.arm.kernel/89561 )

Regards,
Wolfgang


Test program:
[open, mmap, ...]
	gettimeofday(&tv1, NULL);
#if 1
  for (i = MEMSIZE/32; i; i--) {
    *(fbp++) = fill_val;
    *(fbp++) = fill_val;
    *(fbp++) = fill_val;
    *(fbp++) = fill_val;
    *(fbp++) = fill_val;
    *(fbp++) = fill_val;
    *(fbp++) = fill_val;
    *(fbp++) = fill_val;
  }
#elif 0
  memset(fbp, 0xff, MEMSIZE);
#else
  /* ARM only! */
  {
        register long __r0 asm("r0") = fill_val;
        register long __r1 asm("r1") = fill_val;
        register long __r2 asm("r2") = fill_val;
        register long __r3 asm("r3") = fill_val;
        register long __r4 asm("r4") = fill_val;
        register long __r5 asm("r5") = fill_val;
        register long __r6 asm("r6") = fill_val;
        register long __r7 asm("r7") = fill_val;
        for (i = MEMSIZE/32; i; i--) {
                asm volatile(
                        "stmia %0!, {r0 - r7}"
                        : "+r" (fbp)
                        : "r" (__r0), "r" (__r1), "r" (__r2), "r" (__r3), "r" (__r4), "r" (__r5), "r" (__r6), "r" (__r7));
        }
  }
#endif
	gettimeofday(&tv2, NULL);
	diff = tv2.tv_sec - tv1.tv_sec;
	diff *= 1000000;
	diff += tv2.tv_usec - tv1.tv_usec;
	printf("wrote %d bytes in %d microseconds, %d Bytes/s\n",
		MEMSIZE, diff, (int)((double)MEMSIZE * 1000000.0 / (double)diff));


Test results, first for ARM and then for PC (x86):

ARM (Kirkwood, OpenRD-base, Marvell 88SB2211 PCIe->PCI eval board,
     Lattice EC/ECP standard evaluation board Rev. B,
     kernel 2.6.36)
00:01.0 Class 0604: Device 11ab:2211 (rev 01)
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 32 bytes
        Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
        I/O behind bridge: 00001000-00004fff
        Memory behind bridge: e0000000-e3ffffff
        Secondary status: 66MHz+ FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- <SERR- <PERR-
        BridgeCtl: Parity+ SERR- NoISA- VGA- MAbort+ >Reset- FastB2B-
                PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
        Capabilities: [40] Power Management version 2
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
                Bridge: PM- B3+
        Capabilities: [48] Express (v1) PCI/PCI-X Bridge, MSI 00
                DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- BrConfRtry-
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- UncorrErr+ FatalErr- UnsuppReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s, Latency L0 <256ns, L1 unlimited
                        ClockPM- Surprise- LLActRep- BwNot-
                LnkCtl: ASPM Disabled; Disabled- Retrain- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        Capabilities: [100] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr+ BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-

01:08.0 Class ff00: Device 1731:0101 (rev 10)
        Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
        Status: Cap- 66MHz- UDF- FastB2B- ParErr+ DEVSEL=slow >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Region 0: Memory at e0000000 (32-bit, non-prefetchable) [size=64M]
        Region 1: I/O ports at <unassigned> [disabled]
        Kernel driver in use: ArtistaNET-III frame buffer driver

Driver: simple framebuffer driver, using framebuffer's internal mmap()

STM: around 42 MBytes/s
	wrote 8388608 bytes in 201079 microseconds, 41717971 Bytes/s
memset: around 6.2 MBytes/s
	wrote 8388608 bytes in 1346533 microseconds, 6229782 Bytes/s
for (i = MEMSIZE/32; i; i--): around 6.2 MBytes/s
	wrote 8388608 bytes in 1346277 microseconds, 6230967 Bytes/s




PC (Gigabyte GA-P35-S3G, Intel Core2 Duo CPU E6850  @ 3.00GHz,
    Debian 6.0, kernel 2.6.32-5-686)

root@...-test:~# lspci -vvx -s 04:00.0
04:00.0 VGA compatible controller: ATI Technologies Inc 3D Rage Pro 215GP (rev 5c) (prog-if 00 [VGA controller])
	Subsystem: ATI Technologies Inc Rage Pro Turbo
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping+ SERR- FastB2B- DisINTx-
	Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 32 (2000ns min), Cache Line Size: 32 bytes
	Region 0: Memory at f9000000 (32-bit, prefetchable) [size=16M]
	Region 1: I/O ports at d000 [size=256]
	Region 2: Memory at f8000000 (32-bit, non-prefetchable) [size=4K]
	[virtual] Expansion ROM at f7000000 [disabled] [size=128K]
	Kernel driver in use: atyfb
00: 02 10 50 47 87 00 80 02 5c 00 00 03 08 20 00 00
10: 08 00 00 f9 01 d0 00 00 00 00 00 f8 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 02 10 50 47
30: 00 00 00 00 00 00 00 00 00 00 00 00 ff 00 08 00

memset: around 106 MBytes/s
	wrote 2097152 bytes in 19811 microseconds, 105857957 Bytes/s
for (i = MEMSIZE/32; i; i--): around 106 MBytes/s
	wrote 2097152 bytes in 19673 microseconds, 106600518 Bytes/s


root@...-test:~# lspci -vvx -s 04:00.0
04:00.0 VGA compatible controller: Matrox Graphics, Inc. MGA 1064SG [Mystique] (rev 02) (prog-if 00 [VGA controller])
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping+ SERR- FastB2B- DisINTx-
	Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 32
	Interrupt: pin A routed to IRQ 12
	Region 0: Memory at f3000000 (32-bit, non-prefetchable) [size=16K]
	Region 1: Memory at f6000000 (32-bit, prefetchable) [size=8M]
	Region 2: Memory at f4000000 (32-bit, non-prefetchable) [size=8M]
	[virtual] Expansion ROM at f6800000 [disabled] [size=64K]
	Kernel driver in use: matroxfb
00: 2b 10 1a 05 87 00 80 02 02 00 00 03 00 20 00 00
10: 00 00 00 f3 08 00 00 f6 00 00 00 f4 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 ff ff ff ff
30: 00 00 00 00 00 00 00 00 00 00 00 00 0c 01 00 00

memset: around 26.5 MBytes/s
	wrote 2097152 bytes in 78682 microseconds, 26653516 Bytes/s
for (i = MEMSIZE/32; i; i--): around 26.5 MBytes/s
	wrote 2097152 bytes in 78677 microseconds, 26655210 Bytes/s


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ