lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Mon, 20 Mar 2017 07:51:01 -0700
From:   Eric Dumazet <eric.dumazet@...il.com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Herbert Xu <herbert@...dor.apana.org.au>,
        David Miller <davem@...emloft.net>, elena.reshetova@...el.com,
        keescook@...omium.org, netdev@...r.kernel.org,
        bridge@...ts.linux-foundation.org, linux-kernel@...r.kernel.org,
        kuznet@....inr.ac.ru, jmorris@...ei.org, kaber@...sh.net,
        stephen@...workplumber.org, ishkamiel@...il.com,
        dwindsor@...il.com, akpm@...ux-foundation.org
Subject: Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to
 refcount_t

On Mon, 2017-03-20 at 14:40 +0100, Peter Zijlstra wrote:
> On Mon, Mar 20, 2017 at 09:27:13PM +0800, Herbert Xu wrote:
> > On Mon, Mar 20, 2017 at 02:23:57PM +0100, Peter Zijlstra wrote:
> > >
> > > So what bench/setup do you want ran?
> > 
> > You can start by counting how many cycles an atomic op takes
> > vs. how many cycles this new code takes.
> 
> On what uarch?
> 
> I think I tested hand coded asm version and it ended up about double the
> cycles for a cmpxchg loop vs the direct instruction on an IVB-EX (until
> the memory bus saturated, at which point they took the same). Newer
> parts will of course have different numbers,
> 
> Can't we run some iperf on a 40gbe fiber loop or something? It would be
> very useful to have an actual workload we can run.

If atomic ops are converted one by one, it is likely that results will
be noise.

We can not start a global conversion without having a way to have
selective debugging ?

Then, adopting this fine infra would really not be a problem.

Some arches have efficient atomic_inc() ( no full barriers ) while load
+ test + atomic_cmpxchg() + test + loop" is more expensive.

PowerPC has no efficient atomic_inc() and this definitely shows on
network intensive workloads involving concurrent cores/threads.

atomic_cmpxchg() on PowerPC is horribly more expensive because of the
added two SYNC instructions.

networking performance is quite poor on PowerPC as of today.


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ