netdev - RE: [RFC net-next] tcp: allow larger TSO to be built under overload

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <218fd4946208411b90ac77cfcf7aa643@AcuMS.aculab.com>
Date:   Tue, 8 Mar 2022 22:12:06 +0000
From:   David Laight <David.Laight@...LAB.COM>
To:     'Eric Dumazet' <edumazet@...gle.com>
CC:     Jakub Kicinski <kuba@...nel.org>, netdev <netdev@...r.kernel.org>,
        "Willem de Bruijn" <willemb@...gle.com>,
        Neal Cardwell <ncardwell@...gle.com>,
        "Yuchung Cheng" <ycheng@...gle.com>
Subject: RE: [RFC net-next] tcp: allow larger TSO to be built under overload

From: Eric Dumazet
> Sent: 08 March 2022 19:54
..
> > Which is the common side of that max_t() ?
> > If it is mon_tso_segs it might be worth avoiding the
> > divide by coding as:
> >
> >         return bytes > mss_now * min_tso_segs ? bytes / mss_now : min_tso_segs;
> >
> 
> I think the common case is when the divide must happen.
> Not sure if this really matters with current cpus.

Last document I looked at still quoted considerable latency
for integer divide on x86-64.
If you get a cmov then all the instructions will just get
queued waiting for the divide to complete.
But a branch could easily get mispredicted.
That is likely to hit ppc - which I don't think has a cmov?

OTOH if the divide is in the ?: bit nothing probably depends
on it for a while - so the latency won't matter.

Latest figures I have are for skylakeX
         u-ops            latency 1/throughput
DIV   r8 10 10 p0 p1 p5 p6  23        6
DIV  r16 10 10 p0 p1 p5 p6  23        6
DIV  r32 10 10 p0 p1 p5 p6  26        6
DIV  r64 36 36 p0 p1 p5 p6 35-88    21-83
IDIV  r8 11 11 p0 p1 p5 p6  24        6
IDIV r16 10 10 p0 p1 p5 p6  23        6
IDIV r32 10 10 p0 p1 p5 p6  26        6
IDIV r64 57 57 p0 p1 p5 p6 42-95    24-90

Broadwell is a bit slower.
Note that 64bit divide is really horrid.

I think that one will be 32bit - so 'only' 26 clocks
latency.

AMD Ryzen is a lot better for 64bit divides:
               ltncy  1/thpt
DIV   r8/m8  1 13-16 13-16
DIV  r16/m16 2 14-21 14-21
DIV  r32/m32 2 14-30 14-30
DIV  r64/m64 2 14-46 14-45
IDIV  r8/m8  1 13-16 13-16
IDIV r16/m16 2 13-21 14-22
IDIV r32/m32 2 14-30 14-30
IDIV r64/m64 2 14-47 14-45
But less pipelining for 32bit ones.

Quite how those tables actually affect real code
is another matter - but they are guidelines about
what is possible (if you can get the u-ops executed
on the right ports).

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)