linux-kernel - RE: latency

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <3b914a515b1d4e749e58d3b46cf12b26@AcuMS.aculab.com>
Date:   Sat, 4 Dec 2021 14:34:04 +0000
From:   David Laight <David.Laight@...LAB.COM>
To:     'Subhashini Rao Beerisetty' <subhashbeerisetty@...il.com>,
        "linux-pci@...r.kernel.org" <linux-pci@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        kernelnewbies <kernelnewbies@...nelnewbies.org>
Subject: RE: latency

From: Subhashini Rao Beerisetty <subhashbeerisetty@...il.com>
> Sent: 03 December 2021 17:01
> 
>  [ Please keep me in CC as I'm not subscribed to the list]
> 
> Hi all,
> 
> We are using the Linux OS on an x86_64 machine. I need to measure the
> PCIe latency on my system, does kernel have any latency measurement
> module for the PCIe bus?

Slower than you expect :-)

Writes are asynchronous so really only limited by the actual speed
of the PCIe link and the rate the slave can process them.
So the actual latency of writes doesn't matter and the throughput
is reasonable.

Reads are much more problematic.
While the PCIe bus allows multiple outstanding read requests the
Intel x86 I've tested will only generate one outstanding request
for each cpu core.
So buffer reads are particularly slow.

The delays between on read completing and the next read TLP being
sent are (probably) negligible compared to the other delays.
So the latency of a read is just the time the two TLP take to
be transmitted over the wire (including delays for PCIe bridges)
plus the time the slave takes to generate the response TLP.
On the fpga slaves we are using that is (from memory) about 128
cycles of the 62.5MHz clock - ie absolutely ages.

For reads you definitely need to use the largest register size
possible - each read instruction (even misaligned ones) generates
exactly one read TLP.

If you are designing an interface for an fpga then consider using
writes from both sides for everything except bulk data.

You can (probably) measure the latency of your actual system using:
	x = rdtsc();
	v = readl();
	lfence;
	elapsed = rdtsc() - x;
However the TSC values depend on the current cpu frequency (which
will change 'randomly').
Or put the readl() into a loop and do enough that the high-res
system time delts makes sense.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)