linux-kernel - Re: Interacting with coherent memory on external devices

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Thu, 23 Apr 2015 12:22:46 -0400
From:	Jerome Glisse <j.glisse@...il.com>
To:	Christoph Lameter <cl@...ux.com>
Cc:	Benjamin Herrenschmidt <benh@...nel.crashing.org>,
	paulmck@...ux.vnet.ibm.com, linux-kernel@...r.kernel.org,
	linux-mm@...ck.org, jglisse@...hat.com, mgorman@...e.de,
	aarcange@...hat.com, riel@...hat.com, airlied@...hat.com,
	aneesh.kumar@...ux.vnet.ibm.com,
	Cameron Buschardt <cabuschardt@...dia.com>,
	Mark Hairgrove <mhairgrove@...dia.com>,
	Geoffrey Gerfin <ggerfin@...dia.com>,
	John McKenna <jmckenna@...dia.com>, akpm@...ux-foundation.org
Subject: Re: Interacting with coherent memory on external devices

On Thu, Apr 23, 2015 at 09:20:55AM -0500, Christoph Lameter wrote:
> On Thu, 23 Apr 2015, Benjamin Herrenschmidt wrote:
> 
> > > There are hooks in glibc where you can replace the memory
> > > management of the apps if you want that.
> >
> > We don't control the app. Let's say we are doing a plugin for libfoo
> > which accelerates "foo" using GPUs.
> 
> There are numerous examples of malloc implementation that can be used for
> apps without modifying the app.

What about share memory pass btw process ? Or mmaped file ? Or
a library that is loaded through dlopen and thus had no way to
control any allocation that happen before it became active ?

> >
> > Now some other app we have no control on uses libfoo. So pointers
> > already allocated/mapped, possibly a long time ago, will hit libfoo (or
> > the plugin) and we need GPUs to churn on the data.
> 
> IF the GPU would need to suspend one of its computation thread to wait on
> a mapping to be established on demand or so then it looks like the
> performance of the parallel threads on a GPU will be significantly
> compromised. You would want to do the transfer explicitly in some fashion
> that meshes with the concurrent calculation in the GPU. You do not want
> stalls while GPU number crunching is ongoing.

You do not understand how GPU works. GPU have a pools of thread, and they
always try to have the pool as big as possible so that when a group of
thread is waiting for some memory access, there are others thread ready
to perform some operation. GPU are about hidding memory latency that's
what they are good at. But they only achieve that when they have more
thread in flight than compute unit. The whole thread scheduling is done
by hardware and barely control by the device driver.

So no having the GPU wait for a page fault is not as dramatic as you
think. If you use GPU as they are intended to use you might even never
notice the pagefault and reach close to the theoritical throughput of
the GPU nonetheless.

> 
> > The point I'm making is you are arguing against a usage model which has
> > been repeatedly asked for by large amounts of customer (after all that's
> > also why HMM exists).
> 
> I am still not clear what is the use case for this would be. Who is asking
> for this?

Everyone but you ? OpenCL 2.0 specific request it and have several level
of support about transparent address space. The lowest one is the one
implemented today in which application needs to use a special memory
allocator.

The most advance one imply integration with the kernel in which any
memory (mmaped file, share memory or anonymous memory) can be use by
the GPU and does not need to come from a special allocator.

Everyone in the industry is moving toward the most advance one. That
is the raison d'être of HMM, to provide this functionality on hw
platform that do not have things such as CAPI. Which is x86/arm.

So use case is all application using OpenCL or Cuda. So pretty much
everyone doing GPGPU wants this. I dunno how you can't see that.
Share address space is so much easier. Believe it or not most coders
do not have deep knowledge of how things work and if you can remove
the complexity of different memory allocation and different address
space from them they will be happy.

Cheers,
Jérôme
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/