linux-kernel - Re: drm/vkms: deadlock between dev->event

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <87pm2lzsqi.ffs@tglx>
Date:   Wed, 13 Sep 2023 23:08:21 +0200
From:   Thomas Gleixner <tglx@...utronix.de>
To:     Linus Torvalds <torvalds@...ux-foundation.org>,
        Tetsuo Handa <penguin-kernel@...ove.sakura.ne.jp>
Cc:     Rodrigo Siqueira <rodrigosiqueiramelo@...il.com>,
        Melissa Wen <melissa.srw@...il.com>,
        Maira Canal <mairacanal@...eup.net>,
        Haneen Mohammed <hamohammed.sa@...il.com>,
        Daniel Vetter <daniel@...ll.ch>,
        David Airlie <airlied@...il.com>,
        DRI <dri-devel@...ts.freedesktop.org>,
        syzkaller@...glegroups.com, LKML <linux-kernel@...r.kernel.org>,
        Hillf Danton <hdanton@...a.com>,
        Sanan Hasanov <Sanan.Hasanov@....edu>
Subject: Re: drm/vkms: deadlock between dev->event_lock and timer

On Wed, Sep 13 2023 at 09:47, Linus Torvalds wrote:
> On Wed, 13 Sept 2023 at 07:21, Tetsuo Handa
> <penguin-kernel@...ove.sakura.ne.jp> wrote:
>>
>> Hello. A deadlock was reported in drivers/gpu/drm/vkms/ .
>> It looks like this locking pattern remains as of 6.6-rc1. Please fix.
>>
>> void drm_crtc_vblank_off(struct drm_crtc *crtc) {
>>   spin_lock_irq(&dev->event_lock);
>>   drm_vblank_disable_and_save(dev, pipe) {
>>     __disable_vblank(dev, pipe) {
>>       crtc->funcs->disable_vblank(crtc) == vkms_disable_vblank {
>>         hrtimer_cancel(&out->vblank_hrtimer) { // Retries with dev->event_lock held until lock_hrtimer_base() succeeds.
>>           ret = hrtimer_try_to_cancel(timer) {
>>             base = lock_hrtimer_base(timer, &flags); // Fails forever because vkms_vblank_simulate() is in progress.
>
> Heh. Ok. This is clearly a bug, but it does seem to be limited to just
> the vkms driver, and literally only to the "simulate vblank" case.
>
> The worst part about it is that it's so subtle and not obvious.
>
> Some light grepping seems to show that amdgpu has almost the exact
> same pattern in its own vkms thing, except it uses
>
>         hrtimer_try_to_cancel(&amdgpu_crtc->vblank_timer);
>
> directly, which presumably fixes the livelock, but means that the
> cancel will fail if it's currently running.
>
> So just doing the same thing in the vkms driver probably fixes things.
>
> Maybe the vkms people need to add a flag to say "it's canceled" so
> that it doesn't then get re-enabled?  Or maybe it doesn't matter
> and/or already happens for some reason I didn't look into.

Maybe the VKMS people need to understand locking in the first place. The
first thing I saw in this code is:

static enum hrtimer_restart vkms_vblank_simulate(struct hrtimer *timer)
{
   ...
   mutex_unlock(&output->enabled_lock);

What?

Unlocking a mutex in the context of a hrtimer callback is simply
violating all mutex locking rules.

How has this code ever survived lock debugging without triggering a big
fat warning?

Thanks,

        tglx