[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150210224601.GA6170@node.dhcp.inet.fi>
Date: Wed, 11 Feb 2015 00:46:01 +0200
From: "Kirill A. Shutemov" <kirill@...temov.name>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
"Wang, Yalin" <Yalin.Wang@...ymobile.com>,
"viro@...iv.linux.org.uk" <viro@...iv.linux.org.uk>,
"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"Gao, Neil" <Neil.Gao@...ymobile.com>
Subject: Re: [RFC V2] test_bit before clear files_struct bits
On Tue, Feb 10, 2015 at 12:49:46PM -0800, Linus Torvalds wrote:
> On Tue, Feb 10, 2015 at 12:22 PM, Andrew Morton
> <akpm@...ux-foundation.org> wrote:
> >
> > The patch is good but I'm still wondering if any CPUs can do this
> > speedup for us. The CPU has to pull in the target word to modify the
> > bit and what it *could* do is to avoid dirtying the cacheline if it
> > sees that the bit is already in the desired state.
>
> Sadly, no CPU I know of actually does this. Probably because it would
> take a bit more core resources, and conditional writes to memory are
> not normally part of an x86 core (it might be more natural for
> something like old-style ARM that has conditional writes).
>
> Also, even if the store were to be conditional, the cacheline would
> have been acquired in exclusive state, and in many cache protocols the
> state machine is from exclusive to dirty (since normally the only
> reason to get a cacheline for exclusive use is in order to write to
> it). So a "read, test, conditional write" ends up actually being more
> complicated than you'd think - because you *want* that
> exclusive->dirty state for the case where you really are going to
> change the bit, and to avoid extra cache protocol stages you don't
> generally want to read the cacheline into a shared read mode first
> (only to then have to turn it into exclusive/dirty as a second state)
That all sounds resonable.
But I still fail to understand why my micro-benchmark is faster with
branch before store comparing to plain store.
http://article.gmane.org/gmane.linux.kernel.cross-arch/26254
In this case we would not have intermidiate shared state, because we don't
have anybody else to share cache line with. So with branch we would have
the smae E->M and write-back to memory as without branch. But it doesn't
explain why branch makes code faster.
Any ideas?
--
Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists