[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <063D6719AE5E284EB5DD2968C1650D6D1749E538@AcuExch.aculab.com>
Date: Fri, 26 Sep 2014 13:16:00 +0000
From: David Laight <David.Laight@...LAB.COM>
To: David Laight <David.Laight@...LAB.COM>,
'Eric Dumazet' <eric.dumazet@...il.com>,
Tom Herbert <therbert@...gle.com>
CC: Jesper Dangaard Brouer <brouer@...hat.com>,
Linux Netdev List <netdev@...r.kernel.org>,
"David S. Miller" <davem@...emloft.net>,
"Alexander Duyck" <alexander.h.duyck@...el.com>,
Toke Høiland-Jørgensen <toke@...e.dk>,
Florian Westphal <fw@...len.de>,
Jamal Hadi Salim <jhs@...atatu.com>,
Dave Taht <dave.taht@...il.com>,
John Fastabend <john.r.fastabend@...el.com>,
"Daniel Borkmann" <dborkman@...hat.com>,
Hannes Frederic Sowa <hannes@...essinduktion.org>
Subject: RE: [net-next PATCH 1/1 V4] qdisc: bulk dequeue support for qdiscs
with TCQ_F_ONETXQUEUE
From: David Laight
> From: Eric Dumazet
> > On Wed, 2014-09-24 at 19:12 -0700, Eric Dumazet wrote:
> ...
> > It turned out the problem I noticed was caused by compiler trying to be
> > smart, but involving a bad MESI transaction.
> >
> > 0.05 mov 0xc0(%rax),%edi // LOAD dql->num_queued
> > 0.48 mov %edx,0xc8(%rax) // STORE dql->last_obj_cnt = count
> > 58.23 add %edx,%edi
> > 0.58 cmp %edi,0xc4(%rax)
> > 0.76 mov %edi,0xc0(%rax) // STORE dql->num_queued += count
> > 0.72 js bd8
> >
> >
> > I get an incredible 10 % gain by making sure cpu wont get the cache line
> > in Shared mode.
>
> That is a stunning difference between requesting 'exclusive' access
> and upgrading 'shared' to exclusive.
> Stinks of a cpu bug?
>
> Or is the reported stall a side effect of waiting for the earlier
> 'cache line read' to complete in order to issue the 'upgrade to exclusive'.
> In which case gcc's instruction scheduler probably needs to be taught
> to schedule writes before reads.
Thinking further.
gcc is probably moving memory reads before writes under the assumption that
the cpu might stall waiting for the read to complete but that the write
can be buffered by the hardware.
That assumption is true for simple cpus (like the Nios2), but for x86 with
its multiple instructions in flight (etc) may make little difference if
the memory is in the cache.
OTOH it looks as though there is a big hit if you read then write
a non-present cache line.
(This may even depend on which instructions get executed in parallel,
minor changes to the code could easily change that.)
I wonder how easy it would be to modify gcc to remove (or even reverse)
that memory ordering 'optimisation'.
David
Powered by blists - more mailing lists