lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Fri, 31 Jul 2020 14:36:26 -0300
From:   Leandro Ribeiro <leandro.ribeiro@...labora.com>
To:     Melissa Wen <melissa.srw@...il.com>,
        Sidong Yang <realwakka@...il.com>
Cc:     Haneen Mohammed <hamohammed.sa@...il.com>,
        Rodrigo Siqueira <rodrigosiqueiramelo@...il.com>,
        David Airlie <airlied@...ux.ie>,
        Rodrigo Siqueira <Rodrigo.Siqueira@....com>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        dri-devel <dri-devel@...ts.freedesktop.org>,
        kernel-usp@...glegroups.com
Subject: Re: [PATCH] drm/vkms: add missing drm_crtc_vblank_put to the get/put
 pair on flush

Hello everybody!

I'm currently working on a writeback connector screenshooter for Weston. 
In order
to test it, I'm using VKMS with Rodrigo's writeback connector patch:
https://lkml.org/lkml/2020/5/11/449

Here is the link with the MR in Weston with more details of how I've 
tested it:
https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/458

The reason why I'm writing this is that in the first writeback connector 
screenshot
VKMSgets stuck. And I believe (from what I've tried to debug) that what 
happens is
that thewriteback job gets stuck in the queue waiting for a vsync 
signal. Then from
the second screenshot on everything works fine. So I believe this is 
related to this
issue somehow.

Melissa's idea to add drm_crtc_vblank_put(crtc) made it work, although 
VKMS started
to print this warn message:

WARNING: CPU: 0 PID: 168 at drivers/gpu/drm/vkms/vkms_crtc.c:21 
vkms_vblank_simulate+0x101/0x110

I've decided to share this info with you, as it may help you somehow. 
I'm also
investigating to help understand what is happening.

Thanks,
Leandro Ribeiro

On 7/31/20 1:47 PM, Melissa Wen wrote:
> On 07/31, Sidong Yang wrote:
>> On Fri, Jul 31, 2020 at 11:08:34AM +0200, daniel@...ll.ch wrote:
>>> On Thu, Jul 30, 2020 at 07:09:25AM -0300, Melissa Wen wrote:
>>>> On 07/29, Daniel Vetter wrote:
>>>>> On Wed, Jul 29, 2020 at 9:09 PM Melissa Wen <melissa.srw@...il.com> wrote:
>>>>>> Melissa Wen
>>>>>>
>>>>>> On Sat, Jul 25, 2020 at 3:12 PM Daniel Vetter <daniel@...ll.ch> wrote:
>>>>>>> On Sat, Jul 25, 2020 at 7:45 PM Melissa Wen <melissa.srw@...il.com> wrote:
>>>>>>>> On 07/25, Daniel Vetter wrote:
>>>>>>>>> On Sat, Jul 25, 2020 at 5:12 AM Sidong Yang <realwakka@...il.com> wrote:
>>>>>>>>>> On Wed, Jul 22, 2020 at 05:17:05PM +0200, Daniel Vetter wrote:
>>>>>>>>>>> On Wed, Jul 22, 2020 at 4:06 PM Melissa Wen <melissa.srw@...il.com> wrote:
>>>>>>>>>>>> On 07/22, daniel@...ll.ch wrote:
>>>>>>>>>>>>> On Wed, Jul 22, 2020 at 08:04:11AM -0300, Melissa Wen wrote:
>>>>>>>>>>>>>> This patch adds a missing drm_crtc_vblank_put op to the pair
>>>>>>>>>>>>>> drm_crtc_vblank_get/put (inc/decrement counter to guarantee vblanks).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It clears the execution of the following kms_cursor_crc subtests:
>>>>>>>>>>>>>> 1. pipe-A-cursor-[size,alpha-opaque, NxN-(on-screen, off-screen, sliding,
>>>>>>>>>>>>>>     random, fast-moving])] - successful when running individually.
>>>>>>>>>>>>>> 2. pipe-A-cursor-dpms passes again
>>>>>>>>>>>>>> 3. pipe-A-cursor-suspend also passes
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The issue was initially tracked in the sequential execution of IGT
>>>>>>>>>>>>>> kms_cursor_crc subtest: when running the test sequence or one of its
>>>>>>>>>>>>>> subtests twice, the odd execs complete and the pairs get stuck in an
>>>>>>>>>>>>>> endless wait. In the IGT code, calling a wait_for_vblank before the start
>>>>>>>>>>>>>> of CRC capture prevented the busy-wait. But the problem persisted in the
>>>>>>>>>>>>>> pipe-A-cursor-dpms and -suspend subtests.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Checking the history, the pipe-A-cursor-dpms subtest was successful when,
>>>>>>>>>>>>>> in vkms_atomic_commit_tail, instead of using the flip_done op, it used
>>>>>>>>>>>>>> wait_for_vblanks. Another way to prevent blocking was wait_one_vblank when
>>>>>>>>>>>>>> enabling crtc. However, in both cases, pipe-A-cursor-suspend persisted
>>>>>>>>>>>>>> blocking in the 2nd start of CRC capture, which may indicate that
>>>>>>>>>>>>>> something got stuck in the step of CRC setup. Indeed, wait_one_vblank in
>>>>>>>>>>>>>> the crc setup was able to sync things and free all kms_cursor_crc
>>>>>>>>>>>>>> subtests.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Tracing and comparing a clean run with a blocked one:
>>>>>>>>>>>>>> - in a clean one, vkms_crtc_atomic_flush enables vblanks;
>>>>>>>>>>>>>> - when blocked, only in next op, vkms_crtc_atomic_enable, the vblanks
>>>>>>>>>>>>>> started. Moreover, a series of vkms_vblank_simulate flow out until
>>>>>>>>>>>>>> disabling vblanks.
>>>>>>>>>>>>>> Also watching the steps of vkms_crtc_atomic_flush, when the very first
>>>>>>>>>>>>>> drm_crtc_vblank_get returned an error, the subtest crashed. On the other
>>>>>>>>>>>>>> hand, when vblank_get succeeded, the subtest completed. Finally, checking
>>>>>>>>>>>>>> the flush steps: it increases counter to hold a vblank reference (get),
>>>>>>>>>>>>>> but there isn't a op to decreased it and release vblanks (put).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Cc: Daniel Vetter <daniel@...ll.ch>
>>>>>>>>>>>>>> Cc: Rodrigo Siqueira <rodrigosiqueiramelo@...il.com>
>>>>>>>>>>>>>> Cc: Haneen Mohammed <hamohammed.sa@...il.com>
>>>>>>>>>>>>>> Signed-off-by: Melissa Wen <melissa.srw@...il.com>
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>   drivers/gpu/drm/vkms/vkms_crtc.c | 1 +
>>>>>>>>>>>>>>   1 file changed, 1 insertion(+)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/vkms/vkms_crtc.c b/drivers/gpu/drm/vkms/vkms_crtc.c
>>>>>>>>>>>>>> index ac85e17428f8..a99d6b4a92dd 100644
>>>>>>>>>>>>>> --- a/drivers/gpu/drm/vkms/vkms_crtc.c
>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/vkms/vkms_crtc.c
>>>>>>>>>>>>>> @@ -246,6 +246,7 @@ static void vkms_crtc_atomic_flush(struct drm_crtc *crtc,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>              spin_unlock(&crtc->dev->event_lock);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> +           drm_crtc_vblank_put(crtc);
>>>>>>>>>>>>> Uh so I reviewed this a bit more carefully now, and I dont think this is
>>>>>>>>>>>>> the correct bugfix. From the kerneldoc of drm_crtc_arm_vblank_event():
>>>>>>>>>>>>>
>>>>>>>>>>>>>   * Caller must hold a vblank reference for the event @e acquired by a
>>>>>>>>>>>>>   * drm_crtc_vblank_get(), which will be dropped when the next vblank arrives.
>>>>>>>>>>>>>
>>>>>>>>>>>>> So when we call drm_crtc_arm_vblank_event then the vblank_put gets called
>>>>>>>>>>>>> for us. And that's the only case where we successfully acquired a vblank
>>>>>>>>>>>>> interrupt reference since on failure of drm_crtc_vblank_get (0 indicates
>>>>>>>>>>>>> success for that function, failure negative error number) we directly send
>>>>>>>>>>>>> out the event.
>>>>>>>>>>>>>
>>>>>>>>>>>>> So something else fishy is going on, and now I'm totally confused why this
>>>>>>>>>>>>> even happens.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We also have a pile of WARN_ON checks in drm_crtc_vblank_put to make sure
>>>>>>>>>>>>> we don't underflow the refcount, so it's also not that I think (except if
>>>>>>>>>>>>> this patch creates more WARNING backtraces).
>>>>>>>>>>>>>
>>>>>>>>>>>>> But clearly it changes behaviour somehow ... can you try to figure out
>>>>>>>>>>>>> what changes? Maybe print out the vblank->refcount at various points in
>>>>>>>>>>>>> the driver, and maybe also trace when exactly the fake vkms vblank hrtimer
>>>>>>>>>>>>> is enabled/disabled ...
>>>>>>>>>>>> :(
>>>>>>>>>>>>
>>>>>>>>>>>> I can check these, but I also have other suspicions. When I place the
>>>>>>>>>>>> drm_crct_vblank_put out of the if (at the end of flush), it not only solve
>>>>>>>>>>>> the issue of blocking on kms_cursor_crc, but also the WARN_ON on kms_flip
>>>>>>>>>>>> doesn't appear anymore (a total cleanup). Just after:
>>>>>>>>>>>>
>>>>>>>>>>>> vkms_output->composer_state = to_vkms_crtc_state(crtc->state);
>>>>>>>>>>>>
>>>>>>>>>>>> looks like there is something stuck around here.
>>>>>>>>>>> Hm do you have the full WARNING for this? Maybe this gives me an idea
>>>>>>>>>>> what's going wrong.
>>>>>>>>>>>
>>>>>>>>>>>> Besides, there is a lock at atomic_begin:
>>>>>>>>>>>>
>>>>>>>>>>>>    /* This lock is held across the atomic commit to block vblank timer
>>>>>>>>>>>>     * from scheduling vkms_composer_worker until the composer is updated
>>>>>>>>>>>>     */
>>>>>>>>>>>>    spin_lock_irq(&vkms_output->lock);
>>>>>>>>>>>>
>>>>>>>>>>>> that seems to be released on atomic_flush and make me suspect something
>>>>>>>>>>>> missing on the composer update.
>>>>>>>>>>> atomic_begin/atomic_flush are symmetric functions an always called
>>>>>>>>>>> around all the plane updates. So having the spin_lock in _begin and
>>>>>>>>>>> the spin_unlock in _flush should be symmetric and correct.
>>>>>>>>>>>
>>>>>>>>>>> If you want to make sure, recompile with CONFIG_PROVE_LOCKING, which
>>>>>>>>>>> should immmediately give you a huge splat in dmesg if there's anything
>>>>>>>>>>> unbalanced with locking.
>>>>>>>>>>>
>>>>>>>>>>>> I'll check all these things and come back with news (hope) :)
>>>>>>>>>>> Have fun chasing stuff :-)
>>>>>>>>>>>
>>>>>>>>>>> Cheers, Daniel
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> Melissa
>>>>>>>>>>>>> I'm totally confused about what's going on here now.
>>>>>>>>>>>>> -Daniel
>>>>>>>>>> Hi Daniel, Melissa.
>>>>>>>>>> I found something about this problem.
>>>>>>>>>> I traced vblank->refcount that it's important in the problem.
>>>>>>>>>> In normal case, first test run calls commit_tail() and enable vblank in
>>>>>>>>>> atomic_flush(). in drm_vblank_get(), it enable vblank when refcount was zero.
>>>>>>>>>>
>>>>>>>>>> in first test run, it disable crtc for cleanup test. drm_crtc_vblank_off() was
>>>>>>>>>> called by atomic_disable. in this function vblank's refcount was increased for
>>>>>>>>>> prevent subsequent drm_vblank_get() from re-enabling the vblank interrupt.
>>>>>>>>>> and refcount goes one not zero for next test run.
>>>>>>>>>>
>>>>>>>>>> and next test run, drm_vblank_get() was called but it didn't enable vblank
>>>>>>>>>> because refcount was already one. drm_crtc_vblank_on() was called in next. but
>>>>>>>>>> it didn't enable vblank but just increase refcount only.
>>>>>>>>>>
>>>>>>>>>> I think this is why this problem happen. don't know how to fix this correctly.
>>>>>>>>>> should we force to enable vblank after enabling crtc?
>>>>>>>>> Hm, between drm_crtc_vblank_off and drm_crtc_vblank_on
>>>>>>>>> drm_crtc_vblank_get should fail (and leave the refcount unchanged).
>>>>>>>>> It's convoluted logic, but the check for vblank->enabled should catch
>>>>>>>>> that and return -EINVAL for this case. Does that not happen?
>>>>>>>>>
>>>>>>>>> It would indeed explain the bug (I think, I've been wrong way too many
>>>>>>>>> times with this).
>>>>>>>>> -Daniel
>>>>>>>>>
>>>>>>>> Hi Daniel and Sidong,
>>>>>>>>
>>>>>>>> I don't know if it will be confusing, but I will try to explain in a
>>>>>>>> little more detail (and newbie way) what I saw in this behavior of the
>>>>>>>> refcount (similar to what Sidong evaluated).
>>>>>>>>
>>>>>>>> 1. Starting with the loading of vkms is:
>>>>>>>> In vkms_init:
>>>>>>>> After drm_vblank_init (refcount=0), it calls:
>>>>>>>> vkms_modeset_init
>>>>>>>> --> vkms_output_init
>>>>>>>> ----> drm_mode_config_reset
>>>>>>>> -------> vkms_atomic_crtc_reset
>>>>>>>> (even more inside)--> drm_crtc_vblank_reset that bumps the refcount to
>>>>>>>> prevent vblank_get to enable vblank (refcount=1)
>>>>>>>>
>>>>>>>> 2. So, when we start a subtest, vblank is still disabled and in
>>>>>>>> commit_tail, commit_planes triggers a atomic_begin/flush->vblank_get that
>>>>>>>> return -EINVAL because !vblank->enabled (refcount ends 1) and send_vblank;
>>>>>>>> however the test fails before atomic_enable decrements refcount to 0 and
>>>>>>>> reset timestamp.
>>>>>>>> ** This warning also appears in this very first running:
>>>>>>>> WARNING: CPU: 0 PID: 708 at drivers/gpu/drm/vkms/vkms_crtc.c:91 vkms_get_vblank_timestamp+0x41/0x50 [vkms]
>>>>>>> Hm yeah I guess that's something we should paper over a bit, but maybe
>>>>>>> the bugfix will take care of that.
>>>>>>>
>>>>>>>> In the end, this sequence modeset_disable -> atomic_begin ->
>>>>>>>> atomic_flush: refcount going from 0 to 1 and than drm_vblank_enable
>>>>>>>> prepares to everything going well in the next subtest (because
>>>>>>>> atomic_disable is not called).
>>>>>>>>
>>>>>>>> 3. It could be nice, but in the next subtest (with success), as refcount +
>>>>>>>> vblank_enabled ok, after doind its job, it calls
>>>>>>>> atomic_disable->vblank_off and here refcount ends 1 and vblank disabled
>>>>>>>> (the problem returns).
>>>>>>>> So, we have a kind of good turn and bad turn.
>>>>>>>>
>>>>>>>> I tried different things, but the only relatively stable result was
>>>>>>>> putting the sequence modeset_disable + modeset_enables + commit_planes in
>>>>>>>> the commit_tail. That didn't convince me and then I keep trying things.
>>>>>>> This actually sounds like a good idea, I had the same one. Doing it
>>>>>>> this way should also resolve the WARNING you've pointed out I think?
>>>>>> Hi Daniel,
>>>>>>
>>>>>> My uncertainty in this idea was related to a subtest, the cursor-suspend.
>>>>>> Although the reordering solves most of the blocking in kms-cursor-crc, the
>>>>>> suspend subtest fails because when vkms suspends, it disables vblank,
>>>>>> and when it resumes, vblank is not enabled in time. In this subtest, there is
>>>>>> a pipe-crc-start and adding a igt_wait_for_vblank solves... but again,
>>>>>> I know it is not the real fix.
>>>>>>
>>>>>> Would be the case to develop a specific feature of suspend/resume in vkms?
>>>>>> I mean, something to enable vblank when resume. I am trying to figure out how
>>>>>> to develop it, but still without success.
>>>>> Hm since it's all software I expected that the hrtimer will simply
>>>>> continue to run as if nothing happened. For real hw we'd need to use
>>>>> drm_mode_config_helper_suspend/resume, but for vkms I dont think
>>>>> that's required. Is the vblank hrtimer not working after resume? Or is
>>>>> it simply reporting a garbage timestamp and that's why the testcase
>>>>> fails?
>>>> The testcase fails for the same timeout in waiting the first crc
>>>> (already applying the change in the commit_tail sequence):
>>>>
>>>> (kms_cursor_crc:732) igt_aux-DEBUG: Test requirement passed: (power_dir = open("/sys/power", O_RDONLY)) >= 0
>>>> (kms_cursor_crc:732) igt_aux-DEBUG: Test requirement passed: get_supported_suspend_states(power_dir) & (1 << state)
>>>> (kms_cursor_crc:732) igt_aux-DEBUG: Test requirement passed: test == SUSPEND_TEST_NONE || faccessat(power_dir, "pm_test", R_OK | W_OK, 0) == 0
>>>> (kms_cursor_crc:732) igt_aux-DEBUG: Test requirement passed: !(state == SUSPEND_STATE_DISK && !intel_get_total_swap_mb())
>>>> (kms_cursor_crc:732) igt_aux-DEBUG: Test requirement passed: ret == 0
>>>> (kms_cursor_crc:732) igt_core-INFO: [cmd] rtcwake: wakeup from "mem" using /dev/rtc0 at Thu Jul 30 09:23:59 2020
>>>> (kms_cursor_crc:732) igt_debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
>>>> (kms_cursor_crc:732) igt_core-INFO: Timed out: Opening crc fd, and poll for first CRC.
>>>>
>>>> What I could check was, when suspend, vblanks are disabled (calling
>>>> vkms_disable_vblank), and when resume, the testcase fails and only after
>>>> the failure vblanks are enabled (vkms_enable_vblank) and
>>>> hrtimer_init/starts.
>>> Hm, what is disabling the vblank there? Can you grab a full backtrace for
>>> that? I have no idea why that's even happening ...
>>>
>>>> If I "force" enabling vblanks via testcase (adding a
>>>> igt_wait_for_vblank before igt_pipe_crc_start), things work fine.
>>>> This is why I thought about anticipating the restarting of activities by
>>>> placing a vblank "wakeup" in a resume function. Or perhaps prevent vblank
>>>> from being disabled when suspended, since that last attempt to place a
>>>> vblank_put at the end of the flush made this test case (suspend) work,
>>>> because it prevented the vblank from being disabled.
>>>>
>>>> This failure in suspend subtest is so closer to the previous ones, that I
>>>> was unsure if my attempt in fix by reordering commit_tail would be enough.
>>>> But maybe they are different situations that deserve different treats.
>>>> Do you think restarting a vblank in resume can make sense for vkms?
>>> tbh I'm just really confused what's going on :-/
>> Hi Daniel, Mellisa.
>> I made up the situation Melissa said, and thought about what's going on now.
>> What pipe-A-cursor-suspend different from size-change is that it has some
>> suspend command just before disabling cursor in test. usually vblank is enabled
>> by drm_vblank_get() and it's handled by vblank interrupt (vkms_vblank_simulate
>> in vkms). by calling drm_crtc_handle_vblank(), drm_vblank_put() is called and it
>> disable vblank with disable timer. the timer will disable vblank after 5 seconds
>> (drm_vblank_offdelay) later in default. that time is enough to execute crc
>> command in simple size-change test. but in suspending situation, if suspend and
>> wakup command takes more than 5 seconds, vblank will disabled and also crc
>> command is failed from polling crc file. the test are passed if I make my
>> environment wake up earlier.
> Oh, nice! I tested what you say extending the offdelay, and the test goes well.
> But what would be the right way to fix?
>
> Melissa
>
>> In the same way, if there is the code delaying crc command in igt test, it make
>> same problem even if it's in simple size-change.
>>
>> -Sidong
>>
>>> -Daniel
>>>
>>>> Melissa
>>>>> Not sure how to wire it up for fake drivers like vkms, but maybe doing
>>>>> the suspend/resume like for real drivers helps. I think ideally we'd
>>>>> try to attach a platform driver to our platform device we create (but
>>>>> not sure how to do that).
>>>>> -Daniel
>>>>>
>>>>>> Melissa
>>>>>>> But I'm still wondering why after step 3 we don't get -EINVAL from
>>>>>>> vblank_get() - after vblank_off() vblank->enabled should be false
>>>>>>> again, getting us back to the same state as after 1. Is that not
>>>>>>> happening?
>>>>>>> -Daniel
>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> -Sidong
>>>>>>>>>>
>>>>>>>>>>>>>>              crtc->state->event = NULL;
>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> 2.27.0
>>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Daniel Vetter
>>>>>>>>>>>>> Software Engineer, Intel Corporation
>>>>>>>>>>>>> http://blog.ffwll.ch
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Daniel Vetter
>>>>>>>>>>> Software Engineer, Intel Corporation
>>>>>>>>>>> http://blog.ffwll.ch
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> dri-devel mailing list
>>>>>>>>>>> dri-devel@...ts.freedesktop.org
>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Daniel Vetter
>>>>>>>>> Software Engineer, Intel Corporation
>>>>>>>>> http://blog.ffwll.ch
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Daniel Vetter
>>>>>>> Software Engineer, Intel Corporation
>>>>>>> http://blog.ffwll.ch
>>>>>
>>>>>
>>>>> -- 
>>>>> Daniel Vetter
>>>>> Software Engineer, Intel Corporation
>>>>> http://blog.ffwll.ch
>>> -- 
>>> Daniel Vetter
>>> Software Engineer, Intel Corporation
>>> http://blog.ffwll.ch
> _______________________________________________
> dri-devel mailing list
> dri-devel@...ts.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

Powered by blists - more mailing lists