[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1f0ffb6b-eb85-311b-90d1-702d36cb3826@amd.com>
Date: Fri, 14 Jul 2023 12:29:21 +0200
From: Christian König <christian.koenig@....com>
To: Asahi Lina <lina@...hilina.net>,
Luben Tuikov <luben.tuikov@....com>,
David Airlie <airlied@...il.com>,
Daniel Vetter <daniel@...ll.ch>,
Sumit Semwal <sumit.semwal@...aro.org>
Cc: Faith Ekstrand <faith.ekstrand@...labora.com>,
Alyssa Rosenzweig <alyssa@...enzweig.io>,
dri-devel@...ts.freedesktop.org, linux-kernel@...r.kernel.org,
linux-media@...r.kernel.org, asahi@...ts.linux.dev
Subject: Re: [PATCH 2/3] drm/scheduler: Fix UAF in
drm_sched_fence_get_timeline_name
Am 14.07.23 um 12:07 schrieb Asahi Lina:
> On 14/07/2023 18.51, Christian König wrote:
>> Am 14.07.23 um 11:44 schrieb Asahi Lina:
>>> On 14/07/2023 17.43, Christian König wrote:
>>>> Am 14.07.23 um 10:21 schrieb Asahi Lina:
>>>>> A signaled scheduler fence can outlive its scheduler, since fences
>>>>> are
>>>>> independencly reference counted. Therefore, we can't reference the
>>>>> scheduler in the get_timeline_name() implementation.
>>>>>
>>>>> Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
>>>>> dma-bufs reference fences from GPU schedulers that no longer exist.
>>>>>
>>>>> Signed-off-by: Asahi Lina <lina@...hilina.net>
>>>>> ---
>>>>> drivers/gpu/drm/scheduler/sched_entity.c | 7 ++++++-
>>>>> drivers/gpu/drm/scheduler/sched_fence.c | 4 +++-
>>>>> include/drm/gpu_scheduler.h | 5 +++++
>>>>> 3 files changed, 14 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> index b2bbc8a68b30..17f35b0b005a 100644
>>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> @@ -389,7 +389,12 @@ static bool
>>>>> drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>>>>> /*
>>>>> * Fence is from the same scheduler, only need to wait
>>>>> for
>>>>> - * it to be scheduled
>>>>> + * it to be scheduled.
>>>>> + *
>>>>> + * Note: s_fence->sched could have been freed and
>>>>> reallocated
>>>>> + * as another scheduler. This false positive case is okay,
>>>>> as if
>>>>> + * the old scheduler was freed all of its jobs must have
>>>>> + * signaled their completion fences.
>>>>
>>>> This is outright nonsense. As long as an entity for a scheduler exists
>>>> it is not allowed to free up this scheduler.
>>>>
>>>> So this function can't be called like this.
>>>>
>>>>> */
>>>>> fence = dma_fence_get(&s_fence->scheduled);
>>>>> dma_fence_put(entity->dependency);
>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_fence.c
>>>>> b/drivers/gpu/drm/scheduler/sched_fence.c
>>>>> index ef120475e7c6..06a0eebcca10 100644
>>>>> --- a/drivers/gpu/drm/scheduler/sched_fence.c
>>>>> +++ b/drivers/gpu/drm/scheduler/sched_fence.c
>>>>> @@ -68,7 +68,7 @@ static const char
>>>>> *drm_sched_fence_get_driver_name(struct dma_fence *fence)
>>>>> static const char *drm_sched_fence_get_timeline_name(struct
>>>>> dma_fence *f)
>>>>> {
>>>>> struct drm_sched_fence *fence = to_drm_sched_fence(f);
>>>>> - return (const char *)fence->sched->name;
>>>>> + return (const char *)fence->sched_name;
>>>>> }
>>>>> static void drm_sched_fence_free_rcu(struct rcu_head *rcu)
>>>>> @@ -216,6 +216,8 @@ void drm_sched_fence_init(struct drm_sched_fence
>>>>> *fence,
>>>>> unsigned seq;
>>>>> fence->sched = entity->rq->sched;
>>>>> + strlcpy(fence->sched_name, entity->rq->sched->name,
>>>>> + sizeof(fence->sched_name));
>>>>> seq = atomic_inc_return(&entity->fence_seq);
>>>>> dma_fence_init(&fence->scheduled,
>>>>> &drm_sched_fence_ops_scheduled,
>>>>> &fence->lock, entity->fence_context, seq);
>>>>> diff --git a/include/drm/gpu_scheduler.h
>>>>> b/include/drm/gpu_scheduler.h
>>>>> index e95b4837e5a3..4fa9523bd47d 100644
>>>>> --- a/include/drm/gpu_scheduler.h
>>>>> +++ b/include/drm/gpu_scheduler.h
>>>>> @@ -305,6 +305,11 @@ struct drm_sched_fence {
>>>>> * @lock: the lock used by the scheduled and the finished
>>>>> fences.
>>>>> */
>>>>> spinlock_t lock;
>>>>> + /**
>>>>> + * @sched_name: the name of the scheduler that owns this
>>>>> fence. We
>>>>> + * keep a copy here since fences can outlive their scheduler.
>>>>> + */
>>>>> + char sched_name[16];
>>>>
>>>> This just mitigates the problem, but doesn't fix it.
>>>
>>> Could you point out any remaining issues so we can fix them? Right now
>>> this absolutely *is* broken and this fixes the breakage I observed. If
>>> there are other bugs remaining, I'd like to know what they are so I
>>> can fix them.
>>>
>>>> The real issue is that the hw fence is kept around much longer than
>>>> that.
>>>
>>> As far as I know, the whole point of scheduler fences is to decouple
>>> the hw fences from the consumers.
>>
>> Well yes and no. The decoupling is for the signaling, it's not
>> decoupling the lifetime.
>
> When I spoke with Daniel I understood the intent was also to decouple
> the lifetime.
Yes, we discussed that a long long time ago as well.
We *wanted* to decouple the dma_fence lifetime, it's just not done that
way because of problems with that approach.
>
>>> I already talked with Daniel about this. The current behavior is
>>> broken. These fences can live forever. It is broken to require that
>>> they outlive the driver that produced them.
>>>
>>>> Additional to that I'm not willing to increase the scheduler fence
>>>> size
>>>> like that just to decouple them from the scheduler.
>>>
>>> Did you read my explanation on the cover letter as to how this is just
>>> broken right now? We need to fix this. If you have a better suggestion
>>> I'll take it. Doing nothing is not an option.
>>
>> Well this isn't broken at all. This works exactly like intended, you
>> just want to use it for something it wasn't made for.
>>
>> That scheduler fences could be changed to outlive the scheduler which
>> issued them is possible, but this is certainly a new requirement.
>>
>> Especially since we need to grab additional references to make sure that
>> the module isn't unloaded in such a case.
>
> Yes, that's a remaining issue. The fences need to grab a module
> reference to make sure drm_sched doesn't get unloaded until they're
> all really gone. I can add that in v2.
You also need to come up with an idea to prevent races with the deadline
handling. See drm_sched_fence_set_deadline_finished().
>
> It would also be desirable to drop the hw fence as soon as it signals,
> instead of keeping a reference to it forever.
Yeah, agree. Problem here again is that this is easier said than done in
a non-racy way.
Christian.
>
> ~~ Lina
>
Powered by blists - more mailing lists