[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <56A696DD.2000004@gmail.com>
Date: Mon, 25 Jan 2016 22:42:53 +0100
From: Mario Kleiner <mario.kleiner.de@...il.com>
To: Ville Syrjälä <ville.syrjala@...ux.intel.com>,
Michel Dänzer <michel@...nzer.net>,
Alex Deucher <alexander.deucher@....com>,
Vlastimil Babka <vbabka@...e.cz>,
LKML <linux-kernel@...r.kernel.org>,
dri-devel@...ts.freedesktop.org,
Christian König <christian.koenig@....com>
Subject: Re: linux-4.4 bisected: kwin5 stuck on kde5 loading screen with
radeon
On 01/25/2016 09:32 PM, Daniel Vetter wrote:
> On Mon, Jan 25, 2016 at 08:30:14PM +0100, Mario Kleiner wrote:
>>
>>
>> On 01/25/2016 07:51 PM, Daniel Vetter wrote:
>>> On Mon, Jan 25, 2016 at 05:38:30PM +0100, Mario Kleiner wrote:
>>>> Readding Daniel, which somehow got dropped from the cc.
>>>>
>>>> On 01/25/2016 03:53 PM, Ville Syrjälä wrote:
>>>>> On Mon, Jan 25, 2016 at 02:44:53PM +0100, Mario Kleiner wrote:
>>>>>>
>>>>>>
>>>>>> On 01/25/2016 02:23 PM, Ville Syrjälä wrote:
>>>>>>> On Mon, Jan 25, 2016 at 02:16:45PM +0100, Mario Kleiner wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 01/25/2016 05:15 AM, Michel Dänzer wrote:
>>>>>>>>> On 23.01.2016 00:18, Ville Syrjälä wrote:
>>>>>>>>>> On Fri, Jan 22, 2016 at 12:06:00PM +0900, Michel Dänzer wrote:
>>>>>>>>>>>
>>>>>>>>>>> [ Trimming KDE folks from Cc ]
>>>>>>>>>>>
>>>>>>>>>>> On 21.01.2016 19:09, Daniel Vetter wrote:
>>>>>>>>>>>> On Thu, Jan 21, 2016 at 05:36:46PM +0900, Michel Dänzer wrote:
>>>>>>>>>>>>> On 21.01.2016 16:58, Daniel Vetter wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Can you please point me at the vblank on/off jump bug please?
>>>>>>>>>>>>>
>>>>>>>>>>>>> AFAIR I originally reported it in response to
>>>>>>>>>>>>> http://lists.freedesktop.org/archives/dri-devel/2015-August/087841.html
>>>>>>>>>>>>> , but I can't find that in the archives, so maybe that was just on IRC.
>>>>>>>>>>>>> See
>>>>>>>>>>>>> http://lists.freedesktop.org/archives/dri-devel/2016-January/099122.html
>>>>>>>>>>>>> . Basically, I ran into the bug fixed by your patch because the counter
>>>>>>>>>>>>> jumped forward on every DPMS off, so it hit the 32-bit boundary after
>>>>>>>>>>>>> just a few days.
>>>>>>>>>>>>
>>>>>>>>>>>> Ok, so just uncovered the overflow bug.
>>>>>>>>>>>
>>>>>>>>>>> Not sure what you mean by "just", but to be clear: The drm_vblank_on/off
>>>>>>>>>>> counter jumping bug (similar to the bug this thread is about), which
>>>>>>>>>>> exposed the overflow bug, is still alive and kicking in 4.5. It seems
>>>>>>>>>>> to happen when turning off the CRTC:
>>>>>>>>>>>
>>>>>>>>>>> [drm:drm_update_vblank_count] updating vblank count on crtc 0: current=218104694, diff=0, hw=916 hw_last=916
>>>>>>>>>>> [drm:radeon_get_vblank_counter_kms] crtc 0: dist from vblank start 3
>>>>>>>>>>> [drm:drm_calc_vbltimestamp_from_scanoutpos] crtc 0 : v 0x7 p(2199,-45)@ 7304.307354 -> 7304.308006 [e 0 us, 0 rep]
>>>>>>>>>>> [drm:radeon_get_vblank_counter_kms] crtc 0: dist from vblank start 3
>>>>>>>>>>> [drm:drm_update_vblank_count] updating vblank count on crtc 0: current=218104694, diff=16776301, hw=1 hw_last=916
>>>>>>>>>>
>>>>>>>>>> Not sure what bug we're talking about here, but here the hw counter
>>>>>>>>>> clearly jumps backwards.
>>>>>>>>>>
>>>>>>>>>>> [drm:radeon_get_vblank_counter_kms] Query failed! stat 3
>>>>>>>>>>> [drm:radeon_get_vblank_counter_kms] Query failed! stat 3
>>>>>>>>>>> [drm:drm_update_vblank_count] updating vblank count on crtc 1: current=0, diff=0, hw=0 hw_last=0
>>>>>>>>>>> [drm:radeon_get_vblank_counter_kms] Query failed! stat 3
>>>>>>>>>>> [drm:radeon_get_vblank_counter_kms] Query failed! stat 3
>>>>>>>>>>> [drm:drm_update_vblank_count] updating vblank count on crtc 2: current=0, diff=0, hw=0 hw_last=0
>>>>>>>>>>> [drm:radeon_get_vblank_counter_kms] Query failed! stat 3
>>>>>>>>>>> [drm:radeon_get_vblank_counter_kms] Query failed! stat 3
>>>>>>>>>>> [drm:drm_update_vblank_count] updating vblank count on crtc 3: current=0, diff=0, hw=0 hw_last=0
>>>>>>>>>>> [drm:radeon_get_vblank_counter_kms] Query failed! stat 1
>>>>>>>>>>> [drm:drm_calc_vbltimestamp_from_scanoutpos] crtc 0 : v 0x1 p(0,0)@ 7304.317140 -> 7304.317140 [e 0 us, 0 rep]
>>>>>>>>>>> [drm:radeon_get_vblank_counter_kms] Query failed! stat 1
>>>>>>>>>>> [drm:drm_update_vblank_count] updating vblank count on crtc 0: current=234880995, diff=16777215, hw=0 hw_last=1
>>>>>>>>>>
>>>>>>>>>> Same here.
>>>>>>>>>
>>>>>>>>> At least one of the jumps is expected, because this is around turning
>>>>>>>>> off the CRTC for DPMS off. Don't know yet why there are two jumps back
>>>>>>>>> though.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> These things just don't happen on i915 because drm_vblank_off() and
>>>>>>>>>> drm_vblank_on() are always called around the times when the hw counter
>>>>>>>>>> might get reset. Or at least that's how it should be.
>>>>>>>>>
>>>>>>>>> Which is of course the idea of Daniel's patch (which is what I'm getting
>>>>>>>>> the above with) or Mario's patch as well, but clearly something's still
>>>>>>>>> wrong. It's certainly possible that it's something in the driver, but
>>>>>>>>> since calling drm_vblank_pre/post_modeset from the same places seems to
>>>>>>>>> work fine (ignoring the regression discussed in this thread)... Do
>>>>>>>>> drm_vblank_on/off require something else to handle this correctly?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> I suspect it is because vblank_disable_and_save calls
>>>>>>>> drm_update_vblank_count() unconditionally, even if vblank irqs are
>>>>>>>> already off.
>>>>>>>>
>>>>>>>> So on a manual display disable -> reenable you get something like
>>>>>>>>
>>>>>>>> At disable:
>>>>>>>>
>>>>>>>> Call to dpms-off --> atombios_crtc_dpms(DPMS_OFF) --> drm_vblank_off ->
>>>>>>>> vblank_disable_and_save -> irqs off, drm_update_vblank_count() computes
>>>>>>>> final count.
>>>>>>>>
>>>>>>>> Then the crtc is shut down and its hw counter resets to zero.
>>>>>>>>
>>>>>>>> At reenable:
>>>>>>>>
>>>>>>>> Modesetting -> drm_crtc_helper_set_mode -> crtc_funcs->prepare(crtc) ->
>>>>>>>> atombios_crtc_prepare() -> atombios_crtc_dpms(DPMS_OFF) ->
>>>>>>>> drm_vblank_off -> vblank_disable_and_save -> A pointless
>>>>>>>> drm_update_vblank_count() while the hw counter is already reset to zero
>>>>>>>> --> Unwanted counter jump.
>>>>>>>>
>>>>>>>>
>>>>>>>> The problem doesn't happen on a pure modeset to a different video
>>>>>>>> resolution/refresh rate, as then we only have one call into
>>>>>>>> atombios_crtc_dpms(DPMS_OFF).
>>>>>>>>
>>>>>>>> I think the fix is to fix vblank_disable_and_save() to only call
>>>>>>>> drm_update_vblank_count() if vblank irqs get actually disabled, not on
>>>>>>>> no-op calls. I will try that now.
>>>>>>>
>>>>>>> It does that on purpose. Otherwise the vblank counter would appear to
>>>>>>> have stalled while the interrupt was off.
>>>>>>>
>>>>>>
>>>>>> Ok, that's what the comments there say, although i don't see atm. why
>>>>>> that perceived stall would be a big problem. I checked all callers of
>>>>>> vblank_disable_and_save(). They are all careful to not call that
>>>>>> function if vblanks are already disabled. The only exception is
>>>>>> drm_vblank_off(). If drm_vblank_off/on is supposed to protect kms
>>>>>> drivers which have resetting hw counters or other problematic behaviour
>>>>>> during modesets etc. then this will break. E.g., calling the vblank
>>>>>> timestamping stuff is also not safe/well-defined during modesets when
>>>>>> the timestamping constants are not (yet) updated to reflect the new mode
>>>>>> timing of the modeset in progress.
>>>>>
>>>>> The idea is to maintain the appearance that the counter ticks all the
>>>>> time as long as the crtc is active. While that may not be really
>>>>> required in case if no one is currently interested in the vblank
>>>>> counter, I think it's a nice thing to have just to make the behaviour
>>>>> of the counter consistent.
>>>>>
>>>>> As far as calling drm_vblank_off() after the hw counter got reset, well,
>>>>> that not correct. It should be called before the reset.
>>>>
>>>> What radeon does is calling drm_vblank_off at beginning of DPMS_OFF. The
>>>> first call to DMPS_OFF will call drm_vblank_off() and really disable
>>>> vblank-irqs if they were running, updating the counts/ts a last time. But
>>>> then the dpms off will reset the hw counter to zero. When one reenables the
>>>> display, a second call to DPMS_OFF will now call drm_vblank_off again when
>>>> it apparently shouldn't.
>>>>
>>>> I just tested this patch, which fixes the counter jumps on radeon-kms with
>>>> my or Daniel's drm_vblank_off patches to radeon:
>>>
>>> This might be due to the legacy helpers, which just love to redundantly
>>> disable stuff that's off already. The problem I see with no-oping these
>>> out is that for atomic drivers (which really should get this right) this
>>> might paper over bugs: E.g. when you forget to call _off() when disabling
>>> the crtc, then calling _on() twice in a row is indeed a serious bug.
>>> Similar when you forget to call _on() and have multiple _off() calls in a
>>> row.
>>>
>>> So not sure what to do here.
>>> -Daniel
>>>
>>
>> Yes, the legacy helpers cause two calls to dpms off if one disables a
>> display. First during display disable as intended. Then when one reenables
>> the display during modesetting as part of crtc_funcs->prepare() - at least
>> on radeon.
>>
>> Maybe the minimum thing that would help is to just check for
>> vblank->inmodeset in drm_vblank_off(). If that would be the case we'd know
>> it is a redundant call and could no-op it and do a
>> WARN_ON(vblank->inmodeset)?
>
> I have that here locally, blows up all over the place on radoen. And also
> would blow up everywhere else.
>
You mean the WARN_ON causes ugliness? The no-op on redundant calls to
drm_vblank_off would hopefully not blow up anything but prevent blow ups?
The problem is that there should not ever be a call to the
drm_update_vblank_count() function once a crtc is in modeset/dpms off
etc., not only because of the hw vblank counters being reset to zero,
but also because vblank timestamps computed may be wrong, going
backwards etc. That could again confuse clients.
> I was thinking of adding the vblank->inmodeset check to radeon/amdgpu,
> with a note why it's needed (legacy crtc helpers just suck).
Maybe you could do that check in radeon/amdgpu, but still also leave it
in drm_vblank_off()? If all kms drivers properly avoid redundant calls
as part of legacy paths then the WARN_ON and no-op in drm_vblank_off
should not ever trigger, unless there is a real bug, right? In which
case it should hopefully prevent worse things like a hanging composited
desktop, or login, and instead just make noise in the kernel log?
>
>> drm_vblank_on() i don't know how to treat, but that one calls
>> drm_reset_vblank_timestamp() which should be less problematic if called
>> redundantly.
>
> I think even legacy crtc helpers don't enable stuff again if it's not been
> disabled before. So on drm_vblank_on() we might be able to put a WARN_ON
> in place ...
Hm, logging thist stuff here during modesets and display dis/enable i
see lots of drm_vblank_on that come in "pairs" only about two dozen
msecs apart.
>
>> Now the patch i want to try next to fix the drm_vblank_pre/post_modeset
>> regression in Linux 4.4/4.5 is to add a ...
>>
>> if ((diff > 1) && vblank->inmodeset) diff = 1;
>>
>> ... to the bottom of drm_update_vblank_count(). That should hopefully
>> restore the pre/post_modeset behavior as close to the original behavior as
>> possible. As a side effect it would also prevent the counter jump caused by
>> redundant calls to drm_vblank_off().
>
> Hm, can we just frob pre/post_modeset only with some checks? I'd like to
> not put that kind of "I have no idea about my hw state" hacks into the new
> helpers. Otherwise not even atomic drivers can start to gain WARN_ONs to
> enforce correct usage, which would be a real bummer imo.
> -Daniel
>
We could check for only (vblank->inmodeset & 0x2) to only apply it to
the legacy pre/post path, trusting that the drm_vblank_off/on path will
be made robust in a different way, e.g., by the stuff discussed above
and careful implementation in each kms driver that uses those. Atm.
radeon doesn't use off/on, so your enablement patch set can make sure it
does the right thing from the beginning.
rockchip-kms may need similar treatment to radeon to avoid redundant calls.
Btw. how the patch to drm_update_vblank_count() close to the bottom
would actually look is more like:
if ((diff > 1) &&
((vblank->inmodeset & 0x2) || (flags & DRM_CALLED_FROM_VBLIRQ)))
diff = 1;
Another problem we have is that the implementation of the vblank
timestamp doublebuffer with our custom sequence locking is only save
against concurrent readers as long as the increment for store_vblank()
is +1, ie. diff = 1 in that code.
Other diff values like 2 would cause us to write to the other timestamp
slot that is potentially read by some concurrent readers at the same
time and they would get corrupted timestamps.
Now outside vblank irq drm_update_vblank_count() is only supposed to be
called while there aren't any concurrent readers ie., vblank->refcount
== 0, and the way drm_vblank_on/off are implemented this seems to be the
case, so anything goes.
if called from drm_handle_vblank() aka (flags & DRM_CALLED_FROM_VBLIRQ),
we have to assume that there are concurrent readers, because vblank irqs
are usually only kept active if vblank->refcount is > 0. This means only
+1 increments are allowed. In the past this was always the case, but
with the new implementation since Linux 4.4, it could happen that we get
diff > 1 if the vblank irq gets deferred by more than 1 video refresh
cycle, e.g., due to long irq off periods, maybe preemption on a
RT_PREEMPT kernel, long held locks, firmware triggered SMI's etc. That
would cause bad corruption of timestamps.
So unless or until we also rewrite the timestamp caching, we need that
extra protection against diff > 1 in vblank irq.
-mario
Powered by blists - more mailing lists