lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Wed, 11 Feb 2015 00:46:01 +0200
From:	"Kirill A. Shutemov" <kirill@...temov.name>
To:	Linus Torvalds <torvalds@...ux-foundation.org>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	"Wang, Yalin" <Yalin.Wang@...ymobile.com>,
	"viro@...iv.linux.org.uk" <viro@...iv.linux.org.uk>,
	"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"Gao, Neil" <Neil.Gao@...ymobile.com>
Subject: Re: [RFC V2] test_bit before clear files_struct bits

On Tue, Feb 10, 2015 at 12:49:46PM -0800, Linus Torvalds wrote:
> On Tue, Feb 10, 2015 at 12:22 PM, Andrew Morton
> <akpm@...ux-foundation.org> wrote:
> >
> > The patch is good but I'm still wondering if any CPUs can do this
> > speedup for us.  The CPU has to pull in the target word to modify the
> > bit and what it *could* do is to avoid dirtying the cacheline if it
> > sees that the bit is already in the desired state.
> 
> Sadly, no CPU I know of actually does this.  Probably because it would
> take a bit more core resources, and conditional writes to memory are
> not normally part of an x86 core (it might be more natural for
> something like old-style ARM that has conditional writes).
> 
> Also, even if the store were to be conditional, the cacheline would
> have been acquired in exclusive state, and in many cache protocols the
> state machine is from exclusive to dirty (since normally the only
> reason to get a cacheline for exclusive use is in order to write to
> it). So a "read, test, conditional write" ends up actually being more
> complicated than you'd think - because you *want* that
> exclusive->dirty state for the case where you really are going to
> change the bit, and to avoid extra cache protocol stages you don't
> generally want to read the cacheline into a shared read mode first
> (only to then have to turn it into exclusive/dirty as a second state)

That all sounds resonable.

But I still fail to understand why my micro-benchmark is faster with
branch before store comparing to plain store.

http://article.gmane.org/gmane.linux.kernel.cross-arch/26254

In this case we would not have intermidiate shared state, because we don't
have anybody else to share cache line with. So with branch we would have
the smae E->M and write-back to memory as without branch. But it doesn't
explain why branch makes code faster.

Any ideas?

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ