[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20080111073314.GA24021@elte.hu>
Date: Fri, 11 Jan 2008 08:33:15 +0100
From: Ingo Molnar <mingo@...e.hu>
To: Andi Kleen <ak@...e.de>
Cc: linux-kernel@...r.kernel.org, Thomas Gleixner <tglx@...utronix.de>,
"H. Peter Anvin" <hpa@...or.com>,
Venki Pallipadi <venkatesh.pallipadi@...el.com>,
suresh.b.siddha@...el.com, Arjan van de Ven <arjan@...radead.org>,
Dave Jones <davej@...hat.com>
Subject: Re: CPA patchset
* Ingo Molnar <mingo@...e.hu> wrote:
> > > I think you have it fundamentally backwards: the best for
> > > performance is WB + cflush. What would WC offer for performance
> > > that cflush cannot do?
> >
> > Cached requires the cache line to be read first before you can write
> > it.
>
> nonsense, and you should know it. It is perfectly possible to
> construct fully written cachelines, without reading the cacheline
> first. MOVDQ is SSE1 so on basically in every CPU today - and it is 16
> byte aligned and can generate full cacheline writes, _without_ filling
> in the cacheline first. Bulk ops (string ops, etc.) will do full
> cacheline writes too, without filling in the cacheline. Especially
> with high performance 3D ops we do _NOT_ need any funky reads from
> anywhere because 3D software can stream a lot of writes out: we
> construct a full frame or a portion of a frame, or upload vertices or
> shader scripts, textures, etc.
>
> ( also, _even_ when there is a cache fill pending on for a partially
> written cacheline, that might go on in parallel and it is not
> necessarily holding up the CPU unless it has an actual data dependency
> on that. )
btw., that's not to suggest that WC does not have its place - but it's
not where you think it is.
Write-Combining can be very useful for devices that are behind a slow or
a high-latency transport, such as PCI, and which are mapped UnCached
today. Such as a networking card (or storage card) descriptor rings. New
entries in the TX ring are typically constructed on the CPU side without
knowing anything about their contents (and hence no need to read them,
the drivers typically shadow the ring on the host side already, it is
required for good UC performance today), so the new descriptors can be
streamed out efficiently with WC, instead of generating lots of small UC
accesses. But these descriptor rings have constant size, they have a
fixed place in the aperture and are ioremap()ed once during device init
and that's it. No high-frequency change_page_attributes() usage is done
or desired here.
Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists