[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1411729595.1776139.172032293.6FCC3EBB@webmail.messagingengine.com>
Date: Fri, 26 Sep 2014 13:06:35 +0200
From: Hannes Frederic Sowa <hannes@...essinduktion.org>
To: Eric Dumazet <eric.dumazet@...il.com>,
Tom Herbert <therbert@...gle.com>
Cc: Jesper Dangaard Brouer <brouer@...hat.com>,
Linux Netdev List <netdev@...r.kernel.org>,
"David S. Miller" <davem@...emloft.net>,
Alexander Duyck <alexander.h.duyck@...el.com>,
Toke Høiland-Jørgensen <toke@...e.dk>,
Florian Westphal <fw@...len.de>,
Jamal Hadi Salim <jhs@...atatu.com>,
Dave Taht <dave.taht@...il.com>,
John Fastabend <john.r.fastabend@...el.com>,
Daniel Borkmann <dborkman@...hat.com>
Subject: Re: [PATCH net-next] dql: dql_queued() should write first to reduce bus
transactions
On Fri, Sep 26, 2014, at 08:04, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@...gle.com>
>
> While doing high throughput test on a BQL enabled NIC,
> I found a very high cost in ndo_start_xmit() when accessing BQL data.
>
> It turned out the problem was caused by compiler trying to be
> smart, but involving a bad MESI transaction :
>
> 0.05 │ mov 0xc0(%rax),%edi // LOAD dql->num_queued
> 0.48 │ mov %edx,0xc8(%rax) // STORE dql->last_obj_cnt = count
> 58.23 │ add %edx,%edi
> 0.58 │ cmp %edi,0xc4(%rax)
> 0.76 │ mov %edi,0xc0(%rax) // STORE dql->num_queued += count
> 0.72 │ js bd8
>
>
> I got an incredible 10 % gain [1] by making sure cpu do not attempt
> to get the cache line in Shared mode, but directly requests for
> ownership.
>
> New code :
> mov %edx,0xc8(%rax) // STORE dql->last_obj_cnt = count
> add %edx,0xc0(%rax) // RMW dql->num_queued += count
> mov 0xc4(%rax),%ecx // LOAD dql->adj_limit
> mov 0xc0(%rax),%edx // LOAD dql->num_queued
> cmp %edx,%ecx
>
> The TX completion was running from another cpu, with high interrupts
> rate.
>
> Note that I am using barrier() as a soft hint, as mb() here could be
> too heavy cost.
>
> [1] This was a netperf TCP_STREAM with TSO disabled, but GSO enabled.
>
> Signed-off-by: Eric Dumazet <edumazet@...gle.com>
> ---
> include/linux/dynamic_queue_limits.h | 12 ++++++++++--
> 1 file changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/dynamic_queue_limits.h
> b/include/linux/dynamic_queue_limits.h
> index 5621547d631b..a4be70398ce1 100644
> --- a/include/linux/dynamic_queue_limits.h
> +++ b/include/linux/dynamic_queue_limits.h
> @@ -73,14 +73,22 @@ static inline void dql_queued(struct dql *dql,
> unsigned int count)
> {
> BUG_ON(count > DQL_MAX_OBJECT);
>
> - dql->num_queued += count;
> dql->last_obj_cnt = count;
> +
> + /* We want to force a write first, so that cpu do not attempt
> + * to get cache line containing last_obj_cnt, num_queued,
> adj_limit
> + * in Shared state, but directly does a Request For Ownership
> + * It is only a hint, we use barrier() only.
> + */
> + barrier();
> +
> + dql->num_queued += count;
> }
I thought that prefetchw() would be the canonical way to solve write
stalls in CPUs and prepare the specific cache line to be written to.
Interesting, thanks Eric.
Thanks,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists