[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <338ff7cf-1c7d-48da-b1b8-37aac440fed0@suse.de>
Date: Thu, 29 Jan 2026 09:15:51 +0100
From: Thomas Zimmermann <tzimmermann@...e.de>
To: Jacob Keller <jacob.e.keller@...el.com>, Dave Airlie
<airlied@...hat.com>, Jocelyn Falempe <jfalempe@...hat.com>,
Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>,
Maxime Ripard <mripard@...nel.org>, Simona Vetter <simona@...ll.ch>
Cc: Pasi Vaananen <pvaanane@...hat.com>, dri-devel@...ts.freedesktop.org,
linux-kernel@...r.kernel.org, stable@...r.kernel.org
Subject: Re: [PATCH] drm/mgag200: sleep instead of busy wait for BMC
Hi,
thanks for the patch. If I'm not mistaken, you have fixed a
long-standing bug. See my comments below.
Please double check my comments and, if valid, integrate them into the
commit description.
Am 28.01.26 um 21:41 schrieb Jacob Keller:
> The mgag200_bmc_stop_scanout() function is called by the .atomic_disable()
> handler for the MGA G200 VGA BMC encoder. This function performs a few
> register writes to inform the BMC of an upcoming mode change, and then
> polls to wait until the BMC actually stops.
>
> The polling is implemented using a busy loop with udelay() and an iteration
> timeout of 300, resulting in the function blocking for 300 milliseconds.
>
> The function gets called ultimately by the output_poll_execute work thread
> for the DRM output change polling thread of the mgag200 driver:
>
> kworker/0:0-mm_ 3528 [000] 4555.315364:
> ffffffffaa0e25b3 delay_halt.part.0+0x33
> ffffffffc03f6188 mgag200_bmc_stop_scanout+0x178
> ffffffffc087ae7a disable_outputs+0x12a
> ffffffffc087c12a drm_atomic_helper_commit_tail+0x1a
> ffffffffc03fa7b6 mgag200_mode_config_helper_atomic_commit_tail+0x26
> ffffffffc087c9c1 commit_tail+0x91
> ffffffffc087d51b drm_atomic_helper_commit+0x11b
> ffffffffc0509694 drm_atomic_commit+0xa4
> ffffffffc05105e8 drm_client_modeset_commit_atomic+0x1e8
> ffffffffc0510ce6 drm_client_modeset_commit_locked+0x56
> ffffffffc0510e24 drm_client_modeset_commit+0x24
> ffffffffc088a743 __drm_fb_helper_restore_fbdev_mode_unlocked+0x93
> ffffffffc088a683 drm_fb_helper_hotplug_event+0xe3
> ffffffffc050f8aa drm_client_dev_hotplug+0x9a
> ffffffffc088555a output_poll_execute+0x29a
> ffffffffa9b35924 process_one_work+0x194
> ffffffffa9b364ee worker_thread+0x2fe
> ffffffffa9b3ecad kthread+0xdd
> ffffffffa9a08549 ret_from_fork+0x29
>
> On a server running ptp4l with the mgag200 driver loaded, we found that
> ptp4l would sometimes get blocked from execution because of this busy
> waiting loop.
>
> Every so often, approximately once every 20 minutes though with large
> variance, the output_poll_execute() thread would detect some sort of change
> that required performing a hotplug event which results in attempting to
> stop the BMC scanout, resulting in a 300msec delay on one CPU.
>
> On this system, ptp4l was pinned to a single CPU. When the
> output_poll_execute() thread ran on that CPU, it blocked ptp4l from
> executing for its 300millisecond duration.
>
> This resulted in PTP service disruptions such as failure to send a SYNC
> message on time, failure to handle ANNOUNCE messages on time, and clock
> check warnings from the application. All of this despite the application
> being configured with FIFO_RT and a higher priority than the background
> workqueue tasks. (However, note that the kernel did not use
> CONFIG_PREEMPT...)
>
> It is unclear if the event is due to a faulty VGA connection, another bug,
> or actual events causing a change in the connection. At least on the system
> under test it is not a one-time event and consistently causes disruption to
> the time sensitive applications.
I'm not surprised. The drivers for some of the server chipsets have been
imported from old Xorg's user-space code, so that there's something to
work with in the kernel. We've since been fixing them to better
integrate with the kernel.
Not busy waiting for the BMC is also a little improvement to any other
workload.
>
> Reading through other DRM driver implementations, it does not appear that
> the .atomic_enable or .atomic_disable handlers need to delay instead of
> sleep. For example, the ast_astdp_encoder_helper_atomic_disable() function
> calls ast_dp_set_phy_sleep() which uses msleep(). The "atomic" in the name
> is referring to the atomic modesetting support, which is the support to
> enable atomic configuration from userspace, and not to the "atomic context"
> of the kernel.
Yes. "Atomic" means "apply all changes to hardware, or none".
>
> Replace the busy wait with a sleeping loop based on read_poll_timeout().
> This ensures that other time sensitive operations are not blocked from
> executing while the work thread is waiting for the BMC hardware.
>
> Note the RREG_DAC is implemented using a statement expression to allow
> working properly with the read_poll_timeout family of functions. The other
> RREG_<TYPE> macros ought to be cleaned up to have better semantics, and
> several places in the mgag200 driver could make use of RREG_DAC or similar
> RREG_* macros should likely be cleaned up for better semantics as well, but
> that task has been left as a future cleanup for a non-bugfix.
>
> Fixes: 414c45310625 ("mgag200: initial g200se driver (v2)")
> Signed-off-by: Jacob Keller <jacob.e.keller@...el.com>
Reviewed-by: Thomas Zimmermann <tzimmermann@...e.de>
> ---
> We still do not know if the reconfiguration is caused by a different
> bug or by a faulty VGA connector or something else. However, there is no
> reason that this function should be spinning instead of sleeping while
> waiting for the BMC scan to stop.
>
> It is known that removing the mgag200 module avoids the issue. It is also
> likely that use of CONFIG_PREEMPT (or CONFIG_PREEMPT_RT) could allow the
> high priority process to preempt the kernel thread even while it is
> delaying. However, it is better to let the process sleep() so that other
> tasks can execute even if these steps are not taken.
>
> There are multiple other udelay() which likely could safely be converted to
> usleep_range(). However they are all short, and I felt that the smallest
> targeted fix made the most sense. They could perhaps be cleaned up in a
> non-fix commit or series along with other improvements like fixing the
> other RREG_* macros.
> ---
> drivers/gpu/drm/mgag200/mgag200_drv.h | 6 ++++++
> drivers/gpu/drm/mgag200/mgag200_bmc.c | 27 ++++++++++-----------------
> 2 files changed, 16 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/gpu/drm/mgag200/mgag200_drv.h b/drivers/gpu/drm/mgag200/mgag200_drv.h
> index f4bf40cd7c88..a875c4bf8cbe 100644
> --- a/drivers/gpu/drm/mgag200/mgag200_drv.h
> +++ b/drivers/gpu/drm/mgag200/mgag200_drv.h
> @@ -111,6 +111,12 @@
> #define DAC_INDEX 0x3c00
> #define DAC_DATA 0x3c0a
>
> +#define RREG_DAC(reg) \
> + ({ \
> + WREG8(DAC_INDEX, reg); \
> + RREG8(DAC_DATA); \
> + }) \
> +
> #define WREG_DAC(reg, v) \
> do { \
> WREG8(DAC_INDEX, reg); \
> diff --git a/drivers/gpu/drm/mgag200/mgag200_bmc.c b/drivers/gpu/drm/mgag200/mgag200_bmc.c
> index a689c71ff165..599b710bab9b 100644
> --- a/drivers/gpu/drm/mgag200/mgag200_bmc.c
> +++ b/drivers/gpu/drm/mgag200/mgag200_bmc.c
> @@ -1,6 +1,7 @@
> // SPDX-License-Identifier: GPL-2.0-only
>
> #include <linux/delay.h>
> +#include <linux/iopoll.h>
>
> #include <drm/drm_atomic_helper.h>
> #include <drm/drm_edid.h>
> @@ -12,7 +13,7 @@
> void mgag200_bmc_stop_scanout(struct mga_device *mdev)
> {
> u8 tmp;
> - int iter_max;
> + int ret;
>
> /*
> * 1 - The first step is to inform the BMC of an upcoming mode
> @@ -44,28 +45,20 @@ void mgag200_bmc_stop_scanout(struct mga_device *mdev)
> * 3a- The third step is to verify if there is an active scan.
> * We are waiting for a 0 on remhsyncsts <XSPAREREG<0>).
> */
Either these comments or the original test seems incorrect.
The test below is supposed to detect whether the BMC is scanning out
from the framebuffer. While it reads a horizontal scanline the bit
should be 0. That's what the test is for, but it gets the condition wrong.
> - iter_max = 300;
> - while (!(tmp & 0x1) && iter_max) {
> - WREG8(DAC_INDEX, MGA1064_SPAREREG);
> - tmp = RREG8(DAC_DATA);
> - udelay(1000);
> - iter_max--;
> - }
> + ret = read_poll_timeout(RREG_DAC, tmp, !(tmp & 0x1),
> + 1000, 300000, false,
> + MGA1064_SPAREREG);
The original while loop ran as long as "!(tmp & 0x1)". And now the test
stops if "!(tmp & 0x1)" AFAICT. This (accidentally?) fixes the test and
makes the comment correct.
> + if (ret == -ETIMEDOUT)
> + return;
>
> /*
> * 3b- This step occurs only if the remove is actually
Since you're at it, maybe fix this comment to say
'... only if the remote BMC is ...'
> * scanning. We are waiting for the end of the frame which is
> * a 1 on remvsyncsts (XSPAREREG<1>)
> */
> - if (iter_max) {
> - iter_max = 300;
> - while ((tmp & 0x2) && iter_max) {
> - WREG8(DAC_INDEX, MGA1064_SPAREREG);
> - tmp = RREG8(DAC_DATA);
> - udelay(1000);
> - iter_max--;
> - }
> - }
> + (void)read_poll_timeout(RREG_DAC, tmp, (tmp & 0x2),
> + 1000, 300000, false,
> + MGA1064_SPAREREG);
Again, the comment and original code disagree and the original test
condition appears to be inverted. It whats to test of the BMC has
finished scanning out the final frame. The bit should turn 1. Instead it
tests if the bit is already 1, which is likely true. Hence that's
probably where your 300 msec delays comes from.
Best regards
Thomas
> }
>
> void mgag200_bmc_start_scanout(struct mga_device *mdev)
>
> ---
> base-commit: e535c23513c63f02f67e3e09e0787907029efeaf
> change-id: 20260127-jk-mgag200-fix-bad-udelay-409133777e3a
>
> Best regards,
> --
> Jacob Keller <jacob.e.keller@...el.com>
>
--
--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Frankenstr. 146, 90461 Nürnberg, Germany, www.suse.com
GF: Jochen Jaser, Andrew McDonald, Werner Knoblich, (HRB 36809, AG Nürnberg)
Powered by blists - more mailing lists