linux-kernel - RE: [PATCH v2 00/25] AMDKFD kernel driver

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <D89D60253BB73A4E8C62F9FD18A939CA01066B4D@storexdag02.amd.com>
Date:	Wed, 23 Jul 2014 13:39:39 +0000
From:	"Bridgman, John" <John.Bridgman@....com>
To:	Christian König <deathsimple@...afone.de>,
	"Gabbay, Oded" <Oded.Gabbay@....com>,
	Jerome Glisse <j.glisse@...il.com>,
	David Airlie <airlied@...ux.ie>,
	Alex Deucher <alexdeucher@...il.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Joerg Roedel <joro@...tes.org>,
	"Lewycky, Andrew" <Andrew.Lewycky@....com>,
	"Daenzer, Michel" <Michel.Daenzer@....com>,
	"Goz, Ben" <Ben.Goz@....com>,
	"Skidanov, Alexey" <Alexey.Skidanov@....com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"dri-devel@...ts.freedesktop.org" <dri-devel@...ts.freedesktop.org>,
	linux-mm <linux-mm@...ck.org>, "Sellek, Tom" <Tom.Sellek@....com>
Subject: RE: [PATCH v2 00/25] AMDKFD kernel driver



>-----Original Message-----
>From: Christian König [mailto:deathsimple@...afone.de]
>Sent: Wednesday, July 23, 2014 3:04 AM
>To: Gabbay, Oded; Jerome Glisse; David Airlie; Alex Deucher; Andrew
>Morton; Bridgman, John; Joerg Roedel; Lewycky, Andrew; Daenzer, Michel;
>Goz, Ben; Skidanov, Alexey; linux-kernel@...r.kernel.org; dri-
>devel@...ts.freedesktop.org; linux-mm; Sellek, Tom
>Subject: Re: [PATCH v2 00/25] AMDKFD kernel driver
>
>Am 23.07.2014 08:50, schrieb Oded Gabbay:
>> On 22/07/14 14:15, Daniel Vetter wrote:
>>> On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote:
>>>> On 22/07/14 12:21, Daniel Vetter wrote:
>>>>> On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay
><oded.gabbay@....com>
>>>>> wrote:
>>>>>>> Exactly, just prevent userspace from submitting more. And if you
>>>>>>> have misbehaving userspace that submits too much, reset the gpu
>>>>>>> and tell it that you're sorry but won't schedule any more work.
>>>>>>
>>>>>> I'm not sure how you intend to know if a userspace misbehaves or
>>>>>> not. Can you elaborate ?
>>>>>
>>>>> Well that's mostly policy, currently in i915 we only have a check
>>>>> for hangs, and if userspace hangs a bit too often then we stop it.
>>>>> I guess you can do that with the queue unmapping you've describe in
>>>>> reply to Jerome's mail.
>>>>> -Daniel
>>>>>
>>>> What do you mean by hang ? Like the tdr mechanism in Windows (checks
>>>> if a gpu job takes more than 2 seconds, I think, and if so,
>>>> terminates the job).
>>>
>>> Essentially yes. But we also have some hw features to kill jobs
>>> quicker, e.g. for media workloads.
>>> -Daniel
>>>
>>
>> Yeah, so this is what I'm talking about when I say that you and Jerome
>> come from a graphics POV and amdkfd come from a compute POV, no
>> offense intended.
>>
>> For compute jobs, we simply can't use this logic to terminate jobs.
>> Graphics are mostly Real-Time while compute jobs can take from a few
>> ms to a few hours!!! And I'm not talking about an entire application
>> runtime but on a single submission of jobs by the userspace app. We
>> have tests with jobs that take between 20-30 minutes to complete. In
>> theory, we can even imagine a compute job which takes 1 or 2 days (on
>> larger APUs).
>>
>> Now, I understand the question of how do we prevent the compute job
>> from monopolizing the GPU, and internally here we have some ideas that
>> we will probably share in the next few days, but my point is that I
>> don't think we can terminate a compute job because it is running for
>> more than x seconds. It is like you would terminate a CPU process
>> which runs more than x seconds.
>
>Yeah that's why one of the first things I've did was making the timeout
>configurable in the radeon module.
>
>But it doesn't necessary needs be a timeout, we should also kill a running job
>submission if the CPU process associated with the job is killed.
>
>> I think this is a *very* important discussion (detecting a misbehaved
>> compute process) and I would like to continue it, but I don't think
>> moving the job submission from userspace control to kernel control
>> will solve this core problem.
>
>We need to get this topic solved, otherwise the driver won't make it
>upstream. Allowing userpsace to monopolizing resources either memory,
>CPU or GPU time or special things like counters etc... is a strict no go for a
>kernel module.
>
>I agree that moving the job submission from userpsace to kernel wouldn't
>solve this problem. As Daniel and I pointed out now multiple times it's rather
>easily possible to prevent further job submissions from userspace, in the
>worst case by unmapping the doorbell page.
>
>Moving it to an IOCTL would just make it a bit less complicated.

Hi Christian;

HSA uses usermode queues so that programs running on GPU can dispatch work to themselves or to other GPUs with a consistent dispatch mechanism for CPU and GPU code. We could potentially use s_msg and trap every GPU dispatch back through CPU code but that gets slow and ugly very quickly. 

>
>Christian.
>
>>
>>     Oded

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/