[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e122f276-5d19-4c88-bf41-4a4633ea2c46@intel.com>
Date: Thu, 29 Jan 2026 09:02:58 -0800
From: Jacob Keller <jacob.e.keller@...el.com>
To: Thomas Zimmermann <tzimmermann@...e.de>, Dave Airlie <airlied@...hat.com>,
Jocelyn Falempe <jfalempe@...hat.com>, Maarten Lankhorst
<maarten.lankhorst@...ux.intel.com>, Maxime Ripard <mripard@...nel.org>,
Simona Vetter <simona@...ll.ch>
CC: Pasi Vaananen <pvaanane@...hat.com>, <dri-devel@...ts.freedesktop.org>,
<linux-kernel@...r.kernel.org>, <stable@...r.kernel.org>
Subject: Re: [PATCH] drm/mgag200: sleep instead of busy wait for BMC
On 1/29/2026 12:15 AM, Thomas Zimmermann wrote:
> Hi,
>
> thanks for the patch. If I'm not mistaken, you have fixed a long-
> standing bug. See my comments below.
>
> Please double check my comments and, if valid, integrate them into the
> commit description.
>
> Am 28.01.26 um 21:41 schrieb Jacob Keller:
>> The mgag200_bmc_stop_scanout() function is called by
>> the .atomic_disable()
>> handler for the MGA G200 VGA BMC encoder. This function performs a few
>> register writes to inform the BMC of an upcoming mode change, and then
>> polls to wait until the BMC actually stops.
>>
>> The polling is implemented using a busy loop with udelay() and an
>> iteration
>> timeout of 300, resulting in the function blocking for 300 milliseconds.
>>
>> The function gets called ultimately by the output_poll_execute work
>> thread
>> for the DRM output change polling thread of the mgag200 driver:
>>
>> kworker/0:0-mm_ 3528 [000] 4555.315364:
>> ffffffffaa0e25b3 delay_halt.part.0+0x33
>> ffffffffc03f6188 mgag200_bmc_stop_scanout+0x178
>> ffffffffc087ae7a disable_outputs+0x12a
>> ffffffffc087c12a drm_atomic_helper_commit_tail+0x1a
>> ffffffffc03fa7b6
>> mgag200_mode_config_helper_atomic_commit_tail+0x26
>> ffffffffc087c9c1 commit_tail+0x91
>> ffffffffc087d51b drm_atomic_helper_commit+0x11b
>> ffffffffc0509694 drm_atomic_commit+0xa4
>> ffffffffc05105e8 drm_client_modeset_commit_atomic+0x1e8
>> ffffffffc0510ce6 drm_client_modeset_commit_locked+0x56
>> ffffffffc0510e24 drm_client_modeset_commit+0x24
>> ffffffffc088a743
>> __drm_fb_helper_restore_fbdev_mode_unlocked+0x93
>> ffffffffc088a683 drm_fb_helper_hotplug_event+0xe3
>> ffffffffc050f8aa drm_client_dev_hotplug+0x9a
>> ffffffffc088555a output_poll_execute+0x29a
>> ffffffffa9b35924 process_one_work+0x194
>> ffffffffa9b364ee worker_thread+0x2fe
>> ffffffffa9b3ecad kthread+0xdd
>> ffffffffa9a08549 ret_from_fork+0x29
>>
>> On a server running ptp4l with the mgag200 driver loaded, we found that
>> ptp4l would sometimes get blocked from execution because of this busy
>> waiting loop.
>>
>> Every so often, approximately once every 20 minutes though with large
>> variance, the output_poll_execute() thread would detect some sort of
>> change
>> that required performing a hotplug event which results in attempting to
>> stop the BMC scanout, resulting in a 300msec delay on one CPU.
>>
>> On this system, ptp4l was pinned to a single CPU. When the
>> output_poll_execute() thread ran on that CPU, it blocked ptp4l from
>> executing for its 300millisecond duration.
>>
>> This resulted in PTP service disruptions such as failure to send a SYNC
>> message on time, failure to handle ANNOUNCE messages on time, and clock
>> check warnings from the application. All of this despite the application
>> being configured with FIFO_RT and a higher priority than the background
>> workqueue tasks. (However, note that the kernel did not use
>> CONFIG_PREEMPT...)
>>
>> It is unclear if the event is due to a faulty VGA connection, another
>> bug,
>> or actual events causing a change in the connection. At least on the
>> system
>> under test it is not a one-time event and consistently causes
>> disruption to
>> the time sensitive applications.
>
> I'm not surprised. The drivers for some of the server chipsets have been
> imported from old Xorg's user-space code, so that there's something to
> work with in the kernel. We've since been fixing them to better
> integrate with the kernel.
>
> Not busy waiting for the BMC is also a little improvement to any other
> workload.
>
Right.
>>
>> Reading through other DRM driver implementations, it does not appear that
>> the .atomic_enable or .atomic_disable handlers need to delay instead of
>> sleep. For example, the ast_astdp_encoder_helper_atomic_disable()
>> function
>> calls ast_dp_set_phy_sleep() which uses msleep(). The "atomic" in the
>> name
>> is referring to the atomic modesetting support, which is the support to
>> enable atomic configuration from userspace, and not to the "atomic
>> context"
>> of the kernel.
>
> Yes. "Atomic" means "apply all changes to hardware, or none".
>
>>
>> Replace the busy wait with a sleeping loop based on read_poll_timeout().
>> This ensures that other time sensitive operations are not blocked from
>> executing while the work thread is waiting for the BMC hardware.
>>
>> Note the RREG_DAC is implemented using a statement expression to allow
>> working properly with the read_poll_timeout family of functions. The
>> other
>> RREG_<TYPE> macros ought to be cleaned up to have better semantics, and
>> several places in the mgag200 driver could make use of RREG_DAC or
>> similar
>> RREG_* macros should likely be cleaned up for better semantics as
>> well, but
>> that task has been left as a future cleanup for a non-bugfix.
>>
>> Fixes: 414c45310625 ("mgag200: initial g200se driver (v2)")
>> Signed-off-by: Jacob Keller <jacob.e.keller@...el.com>
>
> Reviewed-by: Thomas Zimmermann <tzimmermann@...e.de>
>
>> ---
>> We still do not know if the reconfiguration is caused by a different
>> bug or by a faulty VGA connector or something else. However, there is no
>> reason that this function should be spinning instead of sleeping while
>> waiting for the BMC scan to stop.
>>
>> It is known that removing the mgag200 module avoids the issue. It is also
>> likely that use of CONFIG_PREEMPT (or CONFIG_PREEMPT_RT) could allow the
>> high priority process to preempt the kernel thread even while it is
>> delaying. However, it is better to let the process sleep() so that other
>> tasks can execute even if these steps are not taken.
>>
>> There are multiple other udelay() which likely could safely be
>> converted to
>> usleep_range(). However they are all short, and I felt that the smallest
>> targeted fix made the most sense. They could perhaps be cleaned up in a
>> non-fix commit or series along with other improvements like fixing the
>> other RREG_* macros.
>> ---
>> drivers/gpu/drm/mgag200/mgag200_drv.h | 6 ++++++
>> drivers/gpu/drm/mgag200/mgag200_bmc.c | 27 ++++++++++-----------------
>> 2 files changed, 16 insertions(+), 17 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/mgag200/mgag200_drv.h b/drivers/gpu/drm/
>> mgag200/mgag200_drv.h
>> index f4bf40cd7c88..a875c4bf8cbe 100644
>> --- a/drivers/gpu/drm/mgag200/mgag200_drv.h
>> +++ b/drivers/gpu/drm/mgag200/mgag200_drv.h
>> @@ -111,6 +111,12 @@
>> #define DAC_INDEX 0x3c00
>> #define DAC_DATA 0x3c0a
>> +#define RREG_DAC(reg) \
>> + ({ \
>> + WREG8(DAC_INDEX, reg); \
>> + RREG8(DAC_DATA); \
>> + }) \
>> +
>> #define WREG_DAC(reg, v) \
>> do { \
>> WREG8(DAC_INDEX, reg); \
>> diff --git a/drivers/gpu/drm/mgag200/mgag200_bmc.c b/drivers/gpu/drm/
>> mgag200/mgag200_bmc.c
>> index a689c71ff165..599b710bab9b 100644
>> --- a/drivers/gpu/drm/mgag200/mgag200_bmc.c
>> +++ b/drivers/gpu/drm/mgag200/mgag200_bmc.c
>> @@ -1,6 +1,7 @@
>> // SPDX-License-Identifier: GPL-2.0-only
>> #include <linux/delay.h>
>> +#include <linux/iopoll.h>
>> #include <drm/drm_atomic_helper.h>
>> #include <drm/drm_edid.h>
>> @@ -12,7 +13,7 @@
>> void mgag200_bmc_stop_scanout(struct mga_device *mdev)
>> {
>> u8 tmp;
>> - int iter_max;
>> + int ret;
>> /*
>> * 1 - The first step is to inform the BMC of an upcoming mode
>> @@ -44,28 +45,20 @@ void mgag200_bmc_stop_scanout(struct mga_device
>> *mdev)
>> * 3a- The third step is to verify if there is an active scan.
>> * We are waiting for a 0 on remhsyncsts <XSPAREREG<0>).
>> */
>
> Either these comments or the original test seems incorrect.
>
> The test below is supposed to detect whether the BMC is scanning out
> from the framebuffer. While it reads a horizontal scanline the bit
> should be 0. That's what the test is for, but it gets the condition wrong.
>
Ahh... You're right, I didn't read the change carefully enough.
>> - iter_max = 300;
>> - while (!(tmp & 0x1) && iter_max) {
>> - WREG8(DAC_INDEX, MGA1064_SPAREREG);
>> - tmp = RREG8(DAC_DATA);
>> - udelay(1000);
>> - iter_max--;
>> - }
>> + ret = read_poll_timeout(RREG_DAC, tmp, !(tmp & 0x1),
>> + 1000, 300000, false,
>> + MGA1064_SPAREREG);
>
> The original while loop ran as long as "!(tmp & 0x1)". And now the test
> stops if "!(tmp & 0x1)" AFAICT. This (accidentally?) fixes the test and
> makes the comment correct.
>
Right, but that changes the functionality. I'd need to know whether that
swap is actually correct.
>
>> + if (ret == -ETIMEDOUT)
>> + return;
>> /*
>> * 3b- This step occurs only if the remove is actually
>
> Since you're at it, maybe fix this comment to say
>
> '... only if the remote BMC is ...'
>
>> * scanning. We are waiting for the end of the frame which is
>> * a 1 on remvsyncsts (XSPAREREG<1>)
>> */
>> - if (iter_max) {
>> - iter_max = 300;
>> - while ((tmp & 0x2) && iter_max) {
>> - WREG8(DAC_INDEX, MGA1064_SPAREREG);
>> - tmp = RREG8(DAC_DATA);
>> - udelay(1000);
>> - iter_max--;
>> - }
>> - }
>> + (void)read_poll_timeout(RREG_DAC, tmp, (tmp & 0x2),
>> + 1000, 300000, false,
>> + MGA1064_SPAREREG);
>
> Again, the comment and original code disagree and the original test
> condition appears to be inverted. It whats to test of the BMC has
> finished scanning out the final frame. The bit should turn 1. Instead it
> tests if the bit is already 1, which is likely true. Hence that's
> probably where your 300 msec delays comes from.
Either that or the first case was checking for a 0, and looping the
entire time because it was already 0.
I'll need to double check which is true and make a v2. I'd be more
comfortable keeping the logic identical, which is my mistake on not
inverting the condition with the change to read_poll_timeout...
>
> Best regards
> Thomas
>
Thanks for the careful review.
>> }
>> void mgag200_bmc_start_scanout(struct mga_device *mdev)
>>
>> ---
>> base-commit: e535c23513c63f02f67e3e09e0787907029efeaf
>> change-id: 20260127-jk-mgag200-fix-bad-udelay-409133777e3a
>>
>> Best regards,
>> --
>> Jacob Keller <jacob.e.keller@...el.com>
>>
>
Powered by blists - more mailing lists