netdev - Re: [RFC net-next] tcp: allow larger TSO to be built under overload

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Tue, 8 Mar 2022 14:26:01 -0800
From:   Eric Dumazet <edumazet@...gle.com>
To:     David Laight <David.Laight@...lab.com>
Cc:     Jakub Kicinski <kuba@...nel.org>, netdev <netdev@...r.kernel.org>,
        Willem de Bruijn <willemb@...gle.com>,
        Neal Cardwell <ncardwell@...gle.com>,
        Yuchung Cheng <ycheng@...gle.com>
Subject: Re: [RFC net-next] tcp: allow larger TSO to be built under overload

On Tue, Mar 8, 2022 at 2:12 PM David Laight <David.Laight@...lab.com> wrote:
>
> From: Eric Dumazet
> > Sent: 08 March 2022 19:54
> ..
> > > Which is the common side of that max_t() ?
> > > If it is mon_tso_segs it might be worth avoiding the
> > > divide by coding as:
> > >
> > >         return bytes > mss_now * min_tso_segs ? bytes / mss_now : min_tso_segs;
> > >
> >
> > I think the common case is when the divide must happen.
> > Not sure if this really matters with current cpus.
>
> Last document I looked at still quoted considerable latency
> for integer divide on x86-64.
> If you get a cmov then all the instructions will just get
> queued waiting for the divide to complete.
> But a branch could easily get mispredicted.
> That is likely to hit ppc - which I don't think has a cmov?
>
> OTOH if the divide is in the ?: bit nothing probably depends
> on it for a while - so the latency won't matter.
>
> Latest figures I have are for skylakeX
>          u-ops            latency 1/throughput
> DIV   r8 10 10 p0 p1 p5 p6  23        6
> DIV  r16 10 10 p0 p1 p5 p6  23        6
> DIV  r32 10 10 p0 p1 p5 p6  26        6
> DIV  r64 36 36 p0 p1 p5 p6 35-88    21-83
> IDIV  r8 11 11 p0 p1 p5 p6  24        6
> IDIV r16 10 10 p0 p1 p5 p6  23        6
> IDIV r32 10 10 p0 p1 p5 p6  26        6
> IDIV r64 57 57 p0 p1 p5 p6 42-95    24-90
>
> Broadwell is a bit slower.
> Note that 64bit divide is really horrid.
>
> I think that one will be 32bit - so 'only' 26 clocks
> latency.
>
> AMD Ryzen is a lot better for 64bit divides:
>                ltncy  1/thpt
> DIV   r8/m8  1 13-16 13-16
> DIV  r16/m16 2 14-21 14-21
> DIV  r32/m32 2 14-30 14-30
> DIV  r64/m64 2 14-46 14-45
> IDIV  r8/m8  1 13-16 13-16
> IDIV r16/m16 2 13-21 14-22
> IDIV r32/m32 2 14-30 14-30
> IDIV r64/m64 2 14-47 14-45
> But less pipelining for 32bit ones.
>
> Quite how those tables actually affect real code
> is another matter - but they are guidelines about
> what is possible (if you can get the u-ops executed
> on the right ports).
>

Thanks, I think I will make sure that we use the 32bit divide then,
because compiler might not be smart enough to detect both operands are < ~0U