netdev - Just one more byte, it is wafer thin...

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <4E2764A0.90003@hp.com>
Date:	Wed, 20 Jul 2011 16:28:32 -0700
From:	Rick Jones <rick.jones2@...com>
To:	netdev@...r.kernel.org
Subject: Just one more byte, it is wafer thin...

One of the netperf scripts I run from time to time is the 
packet_byte_script (doc/examples/packet_byte_script in the netperf 
source tree, though I tweaked it locally to use omni output selectors). 
  The goal of that script is to measure the incremental cost of sending 
another byte and/or another TCP segment.  Among other things, it runs RR 
tests where the request or response size is incremented.  It starts at 1 
byte, doubles until it would exceed the MSS, then does 1MSS, 1MSS+1, 
2MSS, 2MSS+1 and 3MSS, 3MSS+1.

I recently ran it between a pair of dual-processor X5650 based systems 
with 10GbE NICs based on Mellanox MT26438 running as a 10GbE interface. 
The kernel is 2.6.38-8-server (maverick) and the driver info is:

# ethtool -i eth2
driver: mlx4_en (HP_0200000003)
version: 1.5.1.6 (August 2010)
firmware-version: 2.7.9294
bus-info: 0000:05:00.0

(yes, that HP_mumble does broach the possibility of a local fubar. i'd 
try a pure upstream myself but the systems at my disposal are somewhat 
locked-down, i'm hoping someone with a "pure" environment can reproduce 
the result, or not)

The full output can be seen at:

ftp://ftp.netperf.org/netperf/misc/sl390_NC543i_mlx4_en_1.5.1.6_Ubuntu_11.04_A5800_56C_to_same_pab_1500mtu_20110719.csv

I wasn't entirely sure what TSO and LRO/GRO would mean for the script, 
at first I thought I wouldn't get the +1 trip down the stack, but the 
transaction rates all looked reasonably "sane" until the 3MSS to 3MSS+1 
transition, when the transaction rate dropped by something like 70%. And 
stayed there as the request size was increased further in other testing. 
I looked at a tcpdump trace on the sending and receiving side - LRO/GRO 
had coalesced segments into the full request size.  On the sending side 
though, I was seeing one segment of 3MSS and one of one byte.  At first 
I thought that perhaps something was fubar with cwnd, but looking at 
traces for 2MSS(+1) and 1MSS(+1) I saw that is just what TSO does - only 
send integer multiples of the MSS as TSO.  So, while that does 
interesting things to the service demand for a given transaction size, 
it probably wasn't the culprit.


It would seem that the adaptive-rx was.  Previously, the coalescing 
settings on the receiver (netserver side) were:

# ethtool -c eth2
Coalesce parameters for eth2:
Adaptive RX: on  TX: off
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 400000
pkt-rate-high: 450000

rx-usecs: 16
rx-frames: 44
rx-usecs-irq: 0
rx-frames-irq: 0

tx-usecs: 0
tx-frames: 0
tx-usecs-irq: 0
tx-frames-irq: 0

rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0

rx-usecs-high: 128
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0

and netperf would look like:
# HDR="-P 1";for r in 4344 4345; do netperf -H mumble.3.21 -t TCP_RR 
$HDR -- -r ${r},1; HDR="-P 0"; done
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET 
to mumble.3.21 (mumble.3.21) port 0 AF_INET : histogram : first burst 0
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

16384  87380  4344     1       10.00    10030.37
16384  87380
16384  87380  4345     1       10.00    3406.62
16384  87380

when I switched adaptive rx off via ethtool, the drop largely went away:

# HDR="-P 1";for r in 4344 4345; do netperf -H mumble.3.21 -t TCP_RR 
$HDR -- -r ${r},1; HDR="-P 0"; done
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET 
to mumble.3.21 (mumble.3.21) port 0 AF_INET : histogram : first burst 0
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

16384  87380  4344     1       10.00    11167.48
16384  87380
16384  87380  4345     1       10.00    10460.02
16384  87380

Now, at 11000 transactions per second, even with the request being 4 
packets, that is still < 55000 packets per second, so presumably 
everything should have stayed at "_low" right?  Just for grins, I put 
adaptive coalescing on again and set rx-usecs-high to 64 and ran those 
two points again:

# HDR="-P 1";for r in 4344 4345; do netperf -H mumble.3.21 -t TCP_RR 
$HDR -- -r ${r},1; HDR="-P 0"; done
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET 
to mumble.3.21 (mumble.3.21) port 0 AF_INET : histogram : first burst 0
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

16384  87380  4344     1       10.00    11143.07
16384  87380
16384  87380  4345     1       10.00    5790.48
16384  87380

and just to be completely pedantic about it, set rx-usecs-high to 0:

# HDR="-P 1";for r in 4344 4345; do netperf -H mumble.3.21 -t TCP_RR 
$HDR -- -r ${r},1; HDR="-P 0"; done
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET 
to mumble.3.21 (mumble.3.21) port 0 AF_INET : histogram : first burst 0
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

16384  87380  4344     1       10.00    14274.03
16384  87380
16384  87380  4345     1       10.00    13697.11
16384  87380

and got a somewhat unexpected result - I've no idea why then they both 
went up - perhaps it was sensing "high" occasionally even in the 4344 
byte request case.  Still, is this suggesting that perhaps the adaptive 
bits are being a bit to aggressive about sensing high?  Over what 
interval is that measurement supposed to be happening?

rick jones
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html