[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <484ce316-55f2-b85e-a1e4-730db94f3fe3@amd.com>
Date: Tue, 25 Feb 2020 20:11:51 +0100
From: Christian König <christian.koenig@....com>
To: "Pan, Xinhui" <Xinhui.Pan@....com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linaro-mm-sig@...ts.linaro.org" <linaro-mm-sig@...ts.linaro.org>,
"dri-devel@...ts.freedesktop.org" <dri-devel@...ts.freedesktop.org>,
"linux-media@...r.kernel.org" <linux-media@...r.kernel.org>,
"sumit.semwal@...aro.org" <sumit.semwal@...aro.org>
Subject: Re: [PATCH] dma-buf: Fix missing excl fence waiting
Am 25.02.20 um 18:23 schrieb Daniel Vetter:
> On Sun, Feb 23, 2020 at 01:04:15PM +0100, Christian König wrote:
>> Am 23.02.20 um 12:56 schrieb Pan, Xinhui:
>>> If shared fence list is not empty, even we want to test all fences, excl fence is ignored.
>>> That is abviously wrong, so fix it.
>> Yeah that is a known issue and I completely agree with you, but other
>> disagree.
>>
>> See the shared fences are meant to depend on the exclusive fence. So all
>> shared fences must finish only after the exclusive one has finished as well.
>>
>> The problem now is that for error handling this isn't necessary true. In
>> other words when a shared fence completes with an error it is perfectly
>> possible that he does this before the exclusive fence is finished.
>>
>> I'm trying to convince Daniel that this is a problem for years :)
> I thought the consensus is that reasonable gpu schedulers and gpu reset
> code should try to make really, really sure it only completes stuff in
> sequence? That's at least my take away from the syncobj timeline
> discussion, where you convinced me we shouldn't just crash&burn.
>
> I think as long as your scheduler is competent and your gpu reset tries to
> limit damage (i.e. kill offending ctx terminally, mark everything else
> that didn't complete for re-running) we should end up with everything
> completing in sequence. I guess if you do kill a lot more stuff, then
> you'd have to push these through your scheduler as dummy jobs, i.e. they
> still wait for their dependencies, but then all they do is set the
> dma_fence error and complete it. Maybe something the common scheduler
> could do.
Yes, that's exactly how we currently implement it. But I still think
that this is not necessary the best approach :)
Anyway Xinhui's problem turned out to be deeper. We somehow add an old
stale fence to the dma_resv object sometimes and that can result in
quite a bunch of problems.
I'm currently trying to hunt down what's going wrong here in more detail.
Regards,
Christian.
> -Daniel
>
>> Regards,
>> Christian.
>>
>>> Signed-off-by: xinhui pan <xinhui.pan@....com>
>>> ---
>>> drivers/dma-buf/dma-resv.c | 9 +++++----
>>> 1 file changed, 5 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/dma-buf/dma-resv.c b/drivers/dma-buf/dma-resv.c
>>> index 4264e64788c4..44dc64c547c6 100644
>>> --- a/drivers/dma-buf/dma-resv.c
>>> +++ b/drivers/dma-buf/dma-resv.c
>>> @@ -632,14 +632,14 @@ static inline int dma_resv_test_signaled_single(struct dma_fence *passed_fence)
>>> */
>>> bool dma_resv_test_signaled_rcu(struct dma_resv *obj, bool test_all)
>>> {
>>> - unsigned seq, shared_count;
>>> + unsigned int seq, shared_count, left;
>>> int ret;
>>> rcu_read_lock();
>>> retry:
>>> ret = true;
>>> shared_count = 0;
>>> - seq = read_seqcount_begin(&obj->seq);
>>> + left = seq = read_seqcount_begin(&obj->seq);
>>> if (test_all) {
>>> unsigned i;
>>> @@ -647,7 +647,7 @@ bool dma_resv_test_signaled_rcu(struct dma_resv *obj, bool test_all)
>>> struct dma_resv_list *fobj = rcu_dereference(obj->fence);
>>> if (fobj)
>>> - shared_count = fobj->shared_count;
>>> + left = shared_count = fobj->shared_count;
>>> for (i = 0; i < shared_count; ++i) {
>>> struct dma_fence *fence = rcu_dereference(fobj->shared[i]);
>>> @@ -657,13 +657,14 @@ bool dma_resv_test_signaled_rcu(struct dma_resv *obj, bool test_all)
>>> goto retry;
>>> else if (!ret)
>>> break;
>>> + left--;
>>> }
>>> if (read_seqcount_retry(&obj->seq, seq))
>>> goto retry;
>>> }
>>> - if (!shared_count) {
>>> + if (!left) {
>>> struct dma_fence *fence_excl = rcu_dereference(obj->fence_excl);
>>> if (fence_excl) {
Powered by blists - more mailing lists