linux-kernel - Re: [PATCH] dma-buf: Fix missing excl fence waiting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <484ce316-55f2-b85e-a1e4-730db94f3fe3@amd.com>
Date:   Tue, 25 Feb 2020 20:11:51 +0100
From:   Christian König <christian.koenig@....com>
To:     "Pan, Xinhui" <Xinhui.Pan@....com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linaro-mm-sig@...ts.linaro.org" <linaro-mm-sig@...ts.linaro.org>,
        "dri-devel@...ts.freedesktop.org" <dri-devel@...ts.freedesktop.org>,
        "linux-media@...r.kernel.org" <linux-media@...r.kernel.org>,
        "sumit.semwal@...aro.org" <sumit.semwal@...aro.org>
Subject: Re: [PATCH] dma-buf: Fix missing excl fence waiting

Am 25.02.20 um 18:23 schrieb Daniel Vetter:
> On Sun, Feb 23, 2020 at 01:04:15PM +0100, Christian König wrote:
>> Am 23.02.20 um 12:56 schrieb Pan, Xinhui:
>>> If shared fence list is not empty, even we want to test all fences, excl fence is ignored.
>>> That is abviously wrong, so fix it.
>> Yeah that is a known issue and I completely agree with you, but other
>> disagree.
>>
>> See the shared fences are meant to depend on the exclusive fence. So all
>> shared fences must finish only after the exclusive one has finished as well.
>>
>> The problem now is that for error handling this isn't necessary true. In
>> other words when a shared fence completes with an error it is perfectly
>> possible that he does this before the exclusive fence is finished.
>>
>> I'm trying to convince Daniel that this is a problem for years :)
> I thought the consensus is that reasonable gpu schedulers and gpu reset
> code should try to make really, really sure it only completes stuff in
> sequence? That's at least my take away from the syncobj timeline
> discussion, where you convinced me we shouldn't just crash&burn.
>
> I think as long as your scheduler is competent and your gpu reset tries to
> limit damage (i.e. kill offending ctx terminally, mark everything else
> that didn't complete for re-running) we should end up with everything
> completing in sequence. I guess if you do kill a lot more stuff, then
> you'd have to push these through your scheduler as dummy jobs, i.e. they
> still wait for their dependencies, but then all they do is set the
> dma_fence error and complete it. Maybe something the common scheduler
> could do.

Yes, that's exactly how we currently implement it. But I still think 
that this is not necessary the best approach :)

Anyway Xinhui's problem turned out to be deeper. We somehow add an old 
stale fence to the dma_resv object sometimes and that can result in 
quite a bunch of problems.

I'm currently trying to hunt down what's going wrong here in more detail.

Regards,
Christian.

> -Daniel
>
>> Regards,
>> Christian.
>>
>>> Signed-off-by: xinhui pan <xinhui.pan@....com>
>>> ---
>>>    drivers/dma-buf/dma-resv.c | 9 +++++----
>>>    1 file changed, 5 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/dma-buf/dma-resv.c b/drivers/dma-buf/dma-resv.c
>>> index 4264e64788c4..44dc64c547c6 100644
>>> --- a/drivers/dma-buf/dma-resv.c
>>> +++ b/drivers/dma-buf/dma-resv.c
>>> @@ -632,14 +632,14 @@ static inline int dma_resv_test_signaled_single(struct dma_fence *passed_fence)
>>>     */
>>>    bool dma_resv_test_signaled_rcu(struct dma_resv *obj, bool test_all)
>>>    {
>>> -	unsigned seq, shared_count;
>>> +	unsigned int seq, shared_count, left;
>>>    	int ret;
>>>    	rcu_read_lock();
>>>    retry:
>>>    	ret = true;
>>>    	shared_count = 0;
>>> -	seq = read_seqcount_begin(&obj->seq);
>>> +	left = seq = read_seqcount_begin(&obj->seq);
>>>    	if (test_all) {
>>>    		unsigned i;
>>> @@ -647,7 +647,7 @@ bool dma_resv_test_signaled_rcu(struct dma_resv *obj, bool test_all)
>>>    		struct dma_resv_list *fobj = rcu_dereference(obj->fence);
>>>    		if (fobj)
>>> -			shared_count = fobj->shared_count;
>>> +			left = shared_count = fobj->shared_count;
>>>    		for (i = 0; i < shared_count; ++i) {
>>>    			struct dma_fence *fence = rcu_dereference(fobj->shared[i]);
>>> @@ -657,13 +657,14 @@ bool dma_resv_test_signaled_rcu(struct dma_resv *obj, bool test_all)
>>>    				goto retry;
>>>    			else if (!ret)
>>>    				break;
>>> +			left--;
>>>    		}
>>>    		if (read_seqcount_retry(&obj->seq, seq))
>>>    			goto retry;
>>>    	}
>>> -	if (!shared_count) {
>>> +	if (!left) {
>>>    		struct dma_fence *fence_excl = rcu_dereference(obj->fence_excl);
>>>    		if (fence_excl) {