lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Fri, 26 Sep 2014 13:06:35 +0200
From:	Hannes Frederic Sowa <hannes@...essinduktion.org>
To:	Eric Dumazet <eric.dumazet@...il.com>,
	Tom Herbert <therbert@...gle.com>
Cc:	Jesper Dangaard Brouer <brouer@...hat.com>,
	Linux Netdev List <netdev@...r.kernel.org>,
	"David S. Miller" <davem@...emloft.net>,
	Alexander Duyck <alexander.h.duyck@...el.com>,
	Toke Høiland-Jørgensen <toke@...e.dk>,
	Florian Westphal <fw@...len.de>,
	Jamal Hadi Salim <jhs@...atatu.com>,
	Dave Taht <dave.taht@...il.com>,
	John Fastabend <john.r.fastabend@...el.com>,
	Daniel Borkmann <dborkman@...hat.com>
Subject: Re: [PATCH net-next] dql: dql_queued() should write first to reduce bus
 transactions

On Fri, Sep 26, 2014, at 08:04, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@...gle.com>
> 
> While doing high throughput test on a BQL enabled NIC,
> I found a very high cost in ndo_start_xmit() when accessing BQL data.
> 
> It turned out the problem was caused by compiler trying to be
> smart, but involving a bad MESI transaction :
> 
>   0.05 │  mov    0xc0(%rax),%edi    // LOAD dql->num_queued
>   0.48 │  mov    %edx,0xc8(%rax)    // STORE dql->last_obj_cnt = count
>  58.23 │  add    %edx,%edi
>   0.58 │  cmp    %edi,0xc4(%rax)
>   0.76 │  mov    %edi,0xc0(%rax)    // STORE dql->num_queued += count
>   0.72 │  js     bd8
> 
> 
> I got an incredible 10 % gain [1] by making sure cpu do not attempt
> to get the cache line in Shared mode, but directly requests for
> ownership.
> 
> New code :
> 	mov    %edx,0xc8(%rax)  // STORE dql->last_obj_cnt = count
> 	add    %edx,0xc0(%rax)  // RMW   dql->num_queued += count
> 	mov    0xc4(%rax),%ecx  // LOAD dql->adj_limit
> 	mov    0xc0(%rax),%edx  // LOAD dql->num_queued
> 	cmp    %edx,%ecx
> 
> The TX completion was running from another cpu, with high interrupts
> rate.
> 
> Note that I am using barrier() as a soft hint, as mb() here could be
> too heavy cost.
> 
> [1] This was a netperf TCP_STREAM with TSO disabled, but GSO enabled.
> 
> Signed-off-by: Eric Dumazet <edumazet@...gle.com>
> ---
>  include/linux/dynamic_queue_limits.h |   12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/dynamic_queue_limits.h
> b/include/linux/dynamic_queue_limits.h
> index 5621547d631b..a4be70398ce1 100644
> --- a/include/linux/dynamic_queue_limits.h
> +++ b/include/linux/dynamic_queue_limits.h
> @@ -73,14 +73,22 @@ static inline void dql_queued(struct dql *dql,
> unsigned int count)
>  {
>  	BUG_ON(count > DQL_MAX_OBJECT);
>  
> -       dql->num_queued += count;
>  	dql->last_obj_cnt = count;
> +
> +       /* We want to force a write first, so that cpu do not attempt
> +        * to get cache line containing last_obj_cnt, num_queued,
> adj_limit
> +        * in Shared state, but directly does a Request For Ownership
> +        * It is only a hint, we use barrier() only.
> +        */
> +       barrier();
> +
> +       dql->num_queued += count;
>  }

I thought that prefetchw() would be the canonical way to solve write
stalls in CPUs and prepare the specific cache line to be written to.
Interesting, thanks Eric.

Thanks,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ