linux-kernel - Re: [PATCH v2] drm/mgag200: fix mgag200_bmc_stop

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <a274b913-b41f-4fe5-bc56-b45ea030c2b7@redhat.com>
Date: Tue, 3 Feb 2026 09:35:19 +0100
From: Jocelyn Falempe <jfalempe@...hat.com>
To: Jacob Keller <jacob.e.keller@...el.com>, Dave Airlie
 <airlied@...hat.com>, Thomas Zimmermann <tzimmermann@...e.de>,
 Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>,
 Maxime Ripard <mripard@...nel.org>, Simona Vetter <simona@...ll.ch>
Cc: Pasi Vaananen <pvaanane@...hat.com>, dri-devel@...ts.freedesktop.org,
 linux-kernel@...r.kernel.org, stable@...r.kernel.org
Subject: Re: [PATCH v2] drm/mgag200: fix mgag200_bmc_stop_scanout()

Thanks, it looks good to me.

Reviewed-by: Jocelyn Falempe <jfalempe@...hat.com>

-- 

Jocelyn

On 03/02/2026 01:16, Jacob Keller wrote:
> The mgag200_bmc_stop_scanout() function is called by the .atomic_disable()
> handler for the MGA G200 VGA BMC encoder. This function performs a few
> register writes to inform the BMC of an upcoming mode change, and then
> polls to wait until the BMC actually stops.
> 
> The polling is implemented using a busy loop with udelay() and an iteration
> timeout of 300, resulting in the function blocking for 300 milliseconds.
> 
> The function gets called ultimately by the output_poll_execute work thread
> for the DRM output change polling thread of the mgag200 driver:
> 
> kworker/0:0-mm_    3528 [000]  4555.315364:
>          ffffffffaa0e25b3 delay_halt.part.0+0x33
>          ffffffffc03f6188 mgag200_bmc_stop_scanout+0x178
>          ffffffffc087ae7a disable_outputs+0x12a
>          ffffffffc087c12a drm_atomic_helper_commit_tail+0x1a
>          ffffffffc03fa7b6 mgag200_mode_config_helper_atomic_commit_tail+0x26
>          ffffffffc087c9c1 commit_tail+0x91
>          ffffffffc087d51b drm_atomic_helper_commit+0x11b
>          ffffffffc0509694 drm_atomic_commit+0xa4
>          ffffffffc05105e8 drm_client_modeset_commit_atomic+0x1e8
>          ffffffffc0510ce6 drm_client_modeset_commit_locked+0x56
>          ffffffffc0510e24 drm_client_modeset_commit+0x24
>          ffffffffc088a743 __drm_fb_helper_restore_fbdev_mode_unlocked+0x93
>          ffffffffc088a683 drm_fb_helper_hotplug_event+0xe3
>          ffffffffc050f8aa drm_client_dev_hotplug+0x9a
>          ffffffffc088555a output_poll_execute+0x29a
>          ffffffffa9b35924 process_one_work+0x194
>          ffffffffa9b364ee worker_thread+0x2fe
>          ffffffffa9b3ecad kthread+0xdd
>          ffffffffa9a08549 ret_from_fork+0x29
> 
> On a server running ptp4l with the mgag200 driver loaded, we found that
> ptp4l would sometimes get blocked from execution because of this busy
> waiting loop.
> 
> Every so often, approximately once every 20 minutes -- though with large
> variance -- the output_poll_execute() thread would detect some sort of
> change that required performing a hotplug event which results in attempting
> to stop the BMC scanout, resulting in a 300msec delay on one CPU.
> 
> On this system, ptp4l was pinned to a single CPU. When the
> output_poll_execute() thread ran on that CPU, it blocked ptp4l from
> executing for its 300 millisecond duration.
> 
> This resulted in PTP service disruptions such as failure to send a SYNC
> message on time, failure to handle ANNOUNCE messages on time, and clock
> check warnings from the application. All of this despite the application
> being configured with FIFO_RT and a higher priority than the background
> workqueue tasks. (However, note that the kernel did not use
> CONFIG_PREEMPT...)
> 
> It is unclear if the event is due to a faulty VGA connection, another bug,
> or actual events causing a change in the connection. At least on the system
> under test it is not a one-time event and consistently causes disruption to
> the time sensitive applications.
> 
> The function has some helpful comments explaining what steps it is
> attempting to take. In particular, step 3a and 3b are explained as such:
> 
>    3a - The third step is to verify if there is an active scan. We are
>         waiting on a 0 on remhsyncsts (<XSPAREREG<0>.
> 
>    3b - This step occurs only if the remove is actually scanning. We are
>         waiting for the end of the frame which is a 1 on remvsyncsts
>         (<XSPAREREG<1>).
> 
> The actual steps 3a and 3b are implemented as while loops with a
> non-sleeping udelay(). The first step iterates while the tmp value at
> position 0 is *not* set. That is, it keeps iterating as long as the bit is
> zero. If the bit is already 0 (because there is no active scan), it will
> iterate the entire 300 attempts which wastes 300 milliseconds in total.
> This is opposite of what the description claims.
> 
> The step 3b logic only executes if we do not iterate over the entire 300
> attempts in the first loop. If it does trigger, it is trying to check and
> wait for a 1 on the remvsyncsts. However, again the condition is actually
> inverted and it will loop as long as the bit is 1, stopping once it hits
> zero (rather than the explained attempt to wait until we see a 1).
> 
> Worse, both loops are implemented using non-sleeping waits which spin
> instead of allowing the scheduler to run other processes. If the kernel is
> not configured to allow arbitrary preemption, it will waste valuable CPU
> time doing nothing.
> 
> There does not appear to be any documentation for the BMC register
> interface, beyond what is in the comments here. It seems more probable that
> the comment here is correct and the implementation accidentally got
> inverted from the intended logic.
> 
> Reading through other DRM driver implementations, it does not appear that
> the .atomic_enable or .atomic_disable handlers need to delay instead of
> sleep. For example, the ast_astdp_encoder_helper_atomic_disable() function
> calls ast_dp_set_phy_sleep() which uses msleep(). The "atomic" in the name
> is referring to the atomic modesetting support, which is the support to
> enable atomic configuration from userspace, and not to the "atomic context"
> of the kernel. There is no reason to use udelay() here if a sleep would be
> sufficient.
> 
> Replace the while loops with a read_poll_timeout() based implementation
> that will sleep between iterations, and which stops polling once the
> condition is met (instead of looping as long as the condition is met). This
> aligns with the commented behavior and avoids blocking on the CPU while
> doing nothing.
> 
> Note the RREG_DAC is implemented using a statement expression to allow
> working properly with the read_poll_timeout family of functions. The other
> RREG_<TYPE> macros ought to be cleaned up to have better semantics, and
> several places in the mgag200 driver could make use of RREG_DAC or similar
> RREG_* macros should likely be cleaned up for better semantics as well, but
> that task has been left as a future cleanup for a non-bugfix.
> 
> Fixes: 414c45310625 ("mgag200: initial g200se driver (v2)")
> Suggested-by: Thomas Zimmermann <tzimmermann@...e.de>
> Signed-off-by: Jacob Keller <jacob.e.keller@...el.com>
> ---
> We still do not know if the reconfiguration is caused by a different
> bug or by a faulty VGA connector or something else. However, there is no
> reason that this function should be spinning instead of sleeping while
> waiting for the BMC scan to stop.
> 
> It is known that removing the mgag200 module avoids the issue. It is also
> likely that use of CONFIG_PREEMPT (or CONFIG_PREEMPT_RT) could allow the
> high priority process to preempt the kernel thread even while it is
> delaying. However, it is better to let the process sleep() so that other
> tasks can execute even if these steps are not taken.
> 
> There are multiple other udelay() which likely could safely be converted to
> usleep_range(). However they are all short, and I felt that the smallest
> targeted fix made the most sense. They could perhaps be cleaned up in a
> non-fix commit or series along with other improvements like fixing the
> other RREG_* macros.
> 
> Thanks to Thomas Zimmermann for catching the originally unintended flipping
> of the loop condition, and for helping determine this seems to actually be
> correct. It seems likely that we are blocking for 300 milliseconds every
> time unintentionally because we loop until there is an active scan instead
> of looping until there is no more active scan.
> ---
> Changes in v2:
> - Update the description after the insights from Thomas, and the testing
>    from Jocelyn.
> - Fix some minor typos in the comments.
> - No functional change from v1, though we now explain why we're changing
>    the conditions in the commit message properly.
> - Link to v1: https://patch.msgid.link/20260128-jk-mgag200-fix-bad-udelay-v1-1-db02e04c343d@intel.com
> ---
>   drivers/gpu/drm/mgag200/mgag200_drv.h |  6 ++++++
>   drivers/gpu/drm/mgag200/mgag200_bmc.c | 31 ++++++++++++-------------------
>   2 files changed, 18 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/gpu/drm/mgag200/mgag200_drv.h b/drivers/gpu/drm/mgag200/mgag200_drv.h
> index f4bf40cd7c88..a875c4bf8cbe 100644
> --- a/drivers/gpu/drm/mgag200/mgag200_drv.h
> +++ b/drivers/gpu/drm/mgag200/mgag200_drv.h
> @@ -111,6 +111,12 @@
>   #define DAC_INDEX 0x3c00
>   #define DAC_DATA 0x3c0a
>   
> +#define RREG_DAC(reg)						\
> +	({							\
> +		WREG8(DAC_INDEX, reg);				\
> +		RREG8(DAC_DATA);				\
> +	})							\
> +
>   #define WREG_DAC(reg, v)					\
>   	do {							\
>   		WREG8(DAC_INDEX, reg);				\
> diff --git a/drivers/gpu/drm/mgag200/mgag200_bmc.c b/drivers/gpu/drm/mgag200/mgag200_bmc.c
> index a689c71ff165..bbdeb791c5b3 100644
> --- a/drivers/gpu/drm/mgag200/mgag200_bmc.c
> +++ b/drivers/gpu/drm/mgag200/mgag200_bmc.c
> @@ -1,6 +1,7 @@
>   // SPDX-License-Identifier: GPL-2.0-only
>   
>   #include <linux/delay.h>
> +#include <linux/iopoll.h>
>   
>   #include <drm/drm_atomic_helper.h>
>   #include <drm/drm_edid.h>
> @@ -12,7 +13,7 @@
>   void mgag200_bmc_stop_scanout(struct mga_device *mdev)
>   {
>   	u8 tmp;
> -	int iter_max;
> +	int ret;
>   
>   	/*
>   	 * 1 - The first step is to inform the BMC of an upcoming mode
> @@ -42,30 +43,22 @@ void mgag200_bmc_stop_scanout(struct mga_device *mdev)
>   
>   	/*
>   	 * 3a- The third step is to verify if there is an active scan.
> -	 * We are waiting for a 0 on remhsyncsts <XSPAREREG<0>).
> +	 * We are waiting for a 0 on remhsyncsts (<XSPAREREG<0>).
>   	 */
> -	iter_max = 300;
> -	while (!(tmp & 0x1) && iter_max) {
> -		WREG8(DAC_INDEX, MGA1064_SPAREREG);
> -		tmp = RREG8(DAC_DATA);
> -		udelay(1000);
> -		iter_max--;
> -	}
> +	ret = read_poll_timeout(RREG_DAC, tmp, !(tmp & 0x1),
> +				1000, 300000, false,
> +				MGA1064_SPAREREG);
> +	if (ret == -ETIMEDOUT)
> +		return;
>   
>   	/*
> -	 * 3b- This step occurs only if the remove is actually
> +	 * 3b- This step occurs only if the remote BMC is actually
>   	 * scanning. We are waiting for the end of the frame which is
>   	 * a 1 on remvsyncsts (XSPAREREG<1>)
>   	 */
> -	if (iter_max) {
> -		iter_max = 300;
> -		while ((tmp & 0x2) && iter_max) {
> -			WREG8(DAC_INDEX, MGA1064_SPAREREG);
> -			tmp = RREG8(DAC_DATA);
> -			udelay(1000);
> -			iter_max--;
> -		}
> -	}
> +	(void)read_poll_timeout(RREG_DAC, tmp, (tmp & 0x2),
> +				1000, 300000, false,
> +				MGA1064_SPAREREG);
>   }
>   
>   void mgag200_bmc_start_scanout(struct mga_device *mdev)
> 
> ---
> base-commit: e535c23513c63f02f67e3e09e0787907029efeaf
> change-id: 20260127-jk-mgag200-fix-bad-udelay-409133777e3a
> 
> Best regards,
> --
> Jacob Keller <jacob.e.keller@...el.com>
>