linux-kernel - Re: CPA patchset

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20080111071936.GA16175@elte.hu>
Date:	Fri, 11 Jan 2008 08:19:36 +0100
From:	Ingo Molnar <mingo@...e.hu>
To:	Andi Kleen <ak@...e.de>
Cc:	linux-kernel@...r.kernel.org, Thomas Gleixner <tglx@...utronix.de>,
	"H. Peter Anvin" <hpa@...or.com>,
	Venki Pallipadi <venkatesh.pallipadi@...el.com>,
	suresh.b.siddha@...el.com, Arjan van de Ven <arjan@...radead.org>,
	Dave Jones <davej@...hat.com>
Subject: Re: CPA patchset

* Andi Kleen <ak@...e.de> wrote:

> > > > but that's not too smart: why dont they use WB plus cflush 
> > > > instead?
> > > 
> > > Because they need to access it WC for performance.
> > 
> > I think you have it fundamentally backwards: the best for 
> > performance is WB + cflush. What would WC offer for performance that 
> > cflush cannot do?
> 
> Cached requires the cache line to be read first before you can write 
> it.

nonsense, and you should know it. It is perfectly possible to construct 
fully written cachelines, without reading the cacheline first. MOVDQ is 
SSE1 so on basically in every CPU today - and it is 16 byte aligned and 
can generate full cacheline writes, _without_ filling in the cacheline 
first. Bulk ops (string ops, etc.) will do full cacheline writes too, 
without filling in the cacheline. Especially with high performance 3D 
ops we do _NOT_ need any funky reads from anywhere because 3D software 
can stream a lot of writes out: we construct a full frame or a portion 
of a frame, or upload vertices or shader scripts, textures, etc.

( also, _even_ when there is a cache fill pending on for a partially
  written cacheline, that might go on in parallel and it is not 
  necessarily holding up the CPU unless it has an actual data dependency 
  on that. )

but that's totally besides the point anyway. WC or WB accesses, if a 3D 
app or a driver does high-freq change_page_attr() calls, it will _lose_ 
the performance game:

> > also, it's irrelevant to change_page_attr() call frequency. Just map 
> > in everything from the card and use it. In graphics, if you remap 
> > anything on the fly and it's not a slowpath you've lost the 
> > performance game even before you began it.
> 
> The typical case would be lots of user space DRI clients supplying 
> their own buffers on the fly. There's not really a fixed pool in this 
> case, but it all varies dynamically. In some scenarios that could 
> happen quite often.

in what scenarios? Please give me in-tree examples of such high-freq 
change_page_attr() cases, where the driver authors would like to call it 
with high frequency but are unable to do it and see performance problems 
due to the WBINVD.

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/