linux-kernel - Re: [PATCH 2/3] drm/scheduler: Fix UAF in drm_sched_fence_get_timeline

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1f0ffb6b-eb85-311b-90d1-702d36cb3826@amd.com>
Date:   Fri, 14 Jul 2023 12:29:21 +0200
From:   Christian König <christian.koenig@....com>
To:     Asahi Lina <lina@...hilina.net>,
        Luben Tuikov <luben.tuikov@....com>,
        David Airlie <airlied@...il.com>,
        Daniel Vetter <daniel@...ll.ch>,
        Sumit Semwal <sumit.semwal@...aro.org>
Cc:     Faith Ekstrand <faith.ekstrand@...labora.com>,
        Alyssa Rosenzweig <alyssa@...enzweig.io>,
        dri-devel@...ts.freedesktop.org, linux-kernel@...r.kernel.org,
        linux-media@...r.kernel.org, asahi@...ts.linux.dev
Subject: Re: [PATCH 2/3] drm/scheduler: Fix UAF in
 drm_sched_fence_get_timeline_name

Am 14.07.23 um 12:07 schrieb Asahi Lina:
> On 14/07/2023 18.51, Christian König wrote:
>> Am 14.07.23 um 11:44 schrieb Asahi Lina:
>>> On 14/07/2023 17.43, Christian König wrote:
>>>> Am 14.07.23 um 10:21 schrieb Asahi Lina:
>>>>> A signaled scheduler fence can outlive its scheduler, since fences 
>>>>> are
>>>>> independencly reference counted. Therefore, we can't reference the
>>>>> scheduler in the get_timeline_name() implementation.
>>>>>
>>>>> Fixes oopses on `cat /sys/kernel/debug/dma_buf/bufinfo` when shared
>>>>> dma-bufs reference fences from GPU schedulers that no longer exist.
>>>>>
>>>>> Signed-off-by: Asahi Lina <lina@...hilina.net>
>>>>> ---
>>>>>     drivers/gpu/drm/scheduler/sched_entity.c | 7 ++++++-
>>>>>     drivers/gpu/drm/scheduler/sched_fence.c  | 4 +++-
>>>>>     include/drm/gpu_scheduler.h              | 5 +++++
>>>>>     3 files changed, 14 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> index b2bbc8a68b30..17f35b0b005a 100644
>>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> @@ -389,7 +389,12 @@ static bool
>>>>> drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>>>>>                /*
>>>>>              * Fence is from the same scheduler, only need to wait 
>>>>> for
>>>>> -         * it to be scheduled
>>>>> +         * it to be scheduled.
>>>>> +         *
>>>>> +         * Note: s_fence->sched could have been freed and 
>>>>> reallocated
>>>>> +         * as another scheduler. This false positive case is okay,
>>>>> as if
>>>>> +         * the old scheduler was freed all of its jobs must have
>>>>> +         * signaled their completion fences.
>>>>
>>>> This is outright nonsense. As long as an entity for a scheduler exists
>>>> it is not allowed to free up this scheduler.
>>>>
>>>> So this function can't be called like this.
>>>>
>>>>>              */
>>>>>             fence = dma_fence_get(&s_fence->scheduled);
>>>>>             dma_fence_put(entity->dependency);
>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_fence.c
>>>>> b/drivers/gpu/drm/scheduler/sched_fence.c
>>>>> index ef120475e7c6..06a0eebcca10 100644
>>>>> --- a/drivers/gpu/drm/scheduler/sched_fence.c
>>>>> +++ b/drivers/gpu/drm/scheduler/sched_fence.c
>>>>> @@ -68,7 +68,7 @@ static const char
>>>>> *drm_sched_fence_get_driver_name(struct dma_fence *fence)
>>>>>     static const char *drm_sched_fence_get_timeline_name(struct
>>>>> dma_fence *f)
>>>>>     {
>>>>>         struct drm_sched_fence *fence = to_drm_sched_fence(f);
>>>>> -    return (const char *)fence->sched->name;
>>>>> +    return (const char *)fence->sched_name;
>>>>>     }
>>>>>        static void drm_sched_fence_free_rcu(struct rcu_head *rcu)
>>>>> @@ -216,6 +216,8 @@ void drm_sched_fence_init(struct drm_sched_fence
>>>>> *fence,
>>>>>         unsigned seq;
>>>>>            fence->sched = entity->rq->sched;
>>>>> +    strlcpy(fence->sched_name, entity->rq->sched->name,
>>>>> +        sizeof(fence->sched_name));
>>>>>         seq = atomic_inc_return(&entity->fence_seq);
>>>>>         dma_fence_init(&fence->scheduled,
>>>>> &drm_sched_fence_ops_scheduled,
>>>>>                    &fence->lock, entity->fence_context, seq);
>>>>> diff --git a/include/drm/gpu_scheduler.h 
>>>>> b/include/drm/gpu_scheduler.h
>>>>> index e95b4837e5a3..4fa9523bd47d 100644
>>>>> --- a/include/drm/gpu_scheduler.h
>>>>> +++ b/include/drm/gpu_scheduler.h
>>>>> @@ -305,6 +305,11 @@ struct drm_sched_fence {
>>>>>              * @lock: the lock used by the scheduled and the finished
>>>>> fences.
>>>>>              */
>>>>>         spinlock_t            lock;
>>>>> +        /**
>>>>> +         * @sched_name: the name of the scheduler that owns this
>>>>> fence. We
>>>>> +     * keep a copy here since fences can outlive their scheduler.
>>>>> +         */
>>>>> +    char sched_name[16];
>>>>
>>>> This just mitigates the problem, but doesn't fix it.
>>>
>>> Could you point out any remaining issues so we can fix them? Right now
>>> this absolutely *is* broken and this fixes the breakage I observed. If
>>> there are other bugs remaining, I'd like to know what they are so I
>>> can fix them.
>>>
>>>> The real issue is that the hw fence is kept around much longer than
>>>> that.
>>>
>>> As far as I know, the whole point of scheduler fences is to decouple
>>> the hw fences from the consumers.
>>
>> Well yes and no. The decoupling is for the signaling, it's not
>> decoupling the lifetime.
>
> When I spoke with Daniel I understood the intent was also to decouple 
> the lifetime.

Yes, we discussed that a long long time ago as well.

We *wanted* to decouple the dma_fence lifetime, it's just not done that 
way because of problems with that approach.

>
>>> I already talked with Daniel about this. The current behavior is
>>> broken. These fences can live forever. It is broken to require that
>>> they outlive the driver that produced them.
>>>
>>>> Additional to that I'm not willing to increase the scheduler fence 
>>>> size
>>>> like that just to decouple them from the scheduler.
>>>
>>> Did you read my explanation on the cover letter as to how this is just
>>> broken right now? We need to fix this. If you have a better suggestion
>>> I'll take it. Doing nothing is not an option.
>>
>> Well this isn't broken at all. This works exactly like intended, you
>> just want to use it for something it wasn't made for.
>>
>> That scheduler fences could be changed to outlive the scheduler which
>> issued them is possible, but this is certainly a new requirement.
>>
>> Especially since we need to grab additional references to make sure that
>> the module isn't unloaded in such a case.
>
> Yes, that's a remaining issue. The fences need to grab a module 
> reference to make sure drm_sched doesn't get unloaded until they're 
> all really gone. I can add that in v2.

You also need to come up with an idea to prevent races with the deadline 
handling. See drm_sched_fence_set_deadline_finished().

>
> It would also be desirable to drop the hw fence as soon as it signals, 
> instead of keeping a reference to it forever.

Yeah, agree. Problem here again is that this is easier said than done in 
a non-racy way.

Christian.

>
> ~~ Lina
>