linux-kernel - Re: CPA patchset

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20080111073314.GA24021@elte.hu>
Date:	Fri, 11 Jan 2008 08:33:15 +0100
From:	Ingo Molnar <mingo@...e.hu>
To:	Andi Kleen <ak@...e.de>
Cc:	linux-kernel@...r.kernel.org, Thomas Gleixner <tglx@...utronix.de>,
	"H. Peter Anvin" <hpa@...or.com>,
	Venki Pallipadi <venkatesh.pallipadi@...el.com>,
	suresh.b.siddha@...el.com, Arjan van de Ven <arjan@...radead.org>,
	Dave Jones <davej@...hat.com>
Subject: Re: CPA patchset

* Ingo Molnar <mingo@...e.hu> wrote:

> > > I think you have it fundamentally backwards: the best for 
> > > performance is WB + cflush. What would WC offer for performance 
> > > that cflush cannot do?
> > 
> > Cached requires the cache line to be read first before you can write 
> > it.
> 
> nonsense, and you should know it. It is perfectly possible to 
> construct fully written cachelines, without reading the cacheline 
> first. MOVDQ is SSE1 so on basically in every CPU today - and it is 16 
> byte aligned and can generate full cacheline writes, _without_ filling 
> in the cacheline first. Bulk ops (string ops, etc.) will do full 
> cacheline writes too, without filling in the cacheline. Especially 
> with high performance 3D ops we do _NOT_ need any funky reads from 
> anywhere because 3D software can stream a lot of writes out: we 
> construct a full frame or a portion of a frame, or upload vertices or 
> shader scripts, textures, etc.
> 
> ( also, _even_ when there is a cache fill pending on for a partially
>   written cacheline, that might go on in parallel and it is not 
>   necessarily holding up the CPU unless it has an actual data dependency 
>   on that. )

btw., that's not to suggest that WC does not have its place - but it's 
not where you think it is.

Write-Combining can be very useful for devices that are behind a slow or 
a high-latency transport, such as PCI, and which are mapped UnCached 
today. Such as a networking card (or storage card) descriptor rings. New 
entries in the TX ring are typically constructed on the CPU side without 
knowing anything about their contents (and hence no need to read them, 
the drivers typically shadow the ring on the host side already, it is 
required for good UC performance today), so the new descriptors can be 
streamed out efficiently with WC, instead of generating lots of small UC 
accesses. But these descriptor rings have constant size, they have a 
fixed place in the aperture and are ioremap()ed once during device init 
and that's it. No high-frequency change_page_attributes() usage is done 
or desired here.

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/