linux-kernel - Re: [PATCH] arch: Introduce read

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <54638003.3070801@redhat.com>
Date:	Wed, 12 Nov 2014 07:42:59 -0800
From:	Alexander Duyck <alexander.h.duyck@...hat.com>
To:	Will Deacon <will.deacon@....com>,
	Linus Torvalds <torvalds@...ux-foundation.org>
CC:	"alexander.duyck@...il.com" <alexander.duyck@...il.com>,
	"linux-arch@...r.kernel.org" <linux-arch@...r.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Michael Neuling <mikey@...ling.org>,
	Tony Luck <tony.luck@...el.com>,
	Mathieu Desnoyers <mathieu.desnoyers@...ymtl.ca>,
	Peter Zijlstra <peterz@...radead.org>,
	Benjamin Herrenschmidt <benh@...nel.crashing.org>,
	Heiko Carstens <heiko.carstens@...ibm.com>,
	Oleg Nesterov <oleg@...hat.com>,
	Michael Ellerman <michael@...erman.id.au>,
	Geert Uytterhoeven <geert@...ux-m68k.org>,
	Frederic Weisbecker <fweisbec@...il.com>,
	Martin Schwidefsky <schwidefsky@...ibm.com>,
	Russell King <linux@....linux.org.uk>,
	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
	Ingo Molnar <mingo@...nel.org>
Subject: Re: [PATCH] arch: Introduce read_acquire()


On 11/12/2014 02:10 AM, Will Deacon wrote:
> On Tue, Nov 11, 2014 at 07:40:22PM +0000, Linus Torvalds wrote:
>> On Tue, Nov 11, 2014 at 10:57 AM,  <alexander.duyck@...il.com> wrote:
>>> On reviewing the documentation and code for smp_load_acquire() it occured
>>> to me that implementing something similar for CPU <-> device interraction
>>> would be worth while.  This commit provides just the load/read side of this
>>> in the form of read_acquire().
>> So I don't hate the concept, but. there's a couple of reasons to think
>> this is broken.
>>
>> One is just the name. Why do we have "smp_load_acquire()", but then
>> call the non-smp version "read_acquire()"? That makes very little
>> sense to me. Why did "load" become "read"?
> [...]
>
>> But we do have a very real difference between "smp_rmb()" (inter-cpu
>> cache coherency read barrier) and "rmb()" (full memory barrier that
>> synchronizes with IO).
>>
>> And your patch is very confused about this. In *some* places you use
>> "rmb()", and in other places you just use "smp_load_acquire()". Have
>> you done extensive verification to check that this is actually ok?
>> Because the performance difference you quote very much seems to be
>> about your x86 testing now akipping the IO-synchronizing "rmb()", and
>> depending on DMA being ordered even without it.
>>
>> And I'm pretty sure that's actually fine on x86. The real
>> IO-synchronizing rmb() (which translates into a lfence) is only needed
>> for when you have uncached accesses (ie mmio) on x86. So I don't think
>> your code is wrong, I just want to verify that everybody understands
>> the issues. I'm not even sure DMA can ever really have weaker memory
>> ordering (I really don't see how you'd be able to do a read barrier
>> without DMA stores being ordered natively), so maybe I worry too much,
>> but the ppc people in particular should look at this, because the ppc
>> memory ordering rules and serialization are some completely odd ad-hoc
>> black magic....
> Right, so now I see what's going on here. This isn't actually anything
> to do with acquire/release (I don't know of any architectures that have
> a read-barrier-acquire instruction), it's all about DMA to main memory.

Actually it is sort of, I just hadn't realized it until I read some of 
the explanations of the C11 acquire/release memory order specifics, but 
I believe most network drivers are engaged in acquire/release logic 
because we are usually using something such as a lockless descriptor 
ring to pass data back and forth between the device and the system.  The 
net win for device drivers is that we can remove some of the 
heavy-weight barriers that are having to be used by making use of 
lighter barriers or primitives such as lwsync vs sync in PowerPC or ldar 
vs dsb(ld) on arm64.

> If a device is DMA'ing data *and* control information (e.g. 'descriptor
> valid') to memory, then it must be maintaining order between those writes
> with respect to memory. In that case, using the usual MMIO barriers can
> be overkill because we really just want to enforce read-ordering on the CPU
> side. In fact, I think you could even do this with a fake address dependency
> on ARM (although I'm not actually suggesting we do that).
>
> In light of that, it actually sounds like we want a new set of barrier
> macros that apply only to DMA buffer accesses by the CPU -- they wouldn't
> enforce ordering against things like MMIO registers. I wonder whether any
> architectures would implement them differently to the smp_* flavours?

My concern would be the cost of the barriers vs the acquire/release 
primitives.  In the case of arm64 I am assuming there is a reason for 
wanting to use ldar vs dsb instructions.  I would imagine the devices 
drivers would want to get the same kind of advantages.

>> But anything with non-cache-coherent DMA is obviously very suspect too.
> I think non-cache-coherent DMA should work too (at least, on ARM), but
> only for buffers mapped via dma_alloc_coherent (i.e. a non-cacheable
> mapping).
>
> Will

For now my plan is to focus on coherent memory only with this. 
Specifically it is only really intended for use with dma_alloc_coherent.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/