linux-kernel - Re: [PATCH] drm/mgag200: sleep instead of busy wait for BMC

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e122f276-5d19-4c88-bf41-4a4633ea2c46@intel.com>
Date: Thu, 29 Jan 2026 09:02:58 -0800
From: Jacob Keller <jacob.e.keller@...el.com>
To: Thomas Zimmermann <tzimmermann@...e.de>, Dave Airlie <airlied@...hat.com>,
	Jocelyn Falempe <jfalempe@...hat.com>, Maarten Lankhorst
	<maarten.lankhorst@...ux.intel.com>, Maxime Ripard <mripard@...nel.org>,
	Simona Vetter <simona@...ll.ch>
CC: Pasi Vaananen <pvaanane@...hat.com>, <dri-devel@...ts.freedesktop.org>,
	<linux-kernel@...r.kernel.org>, <stable@...r.kernel.org>
Subject: Re: [PATCH] drm/mgag200: sleep instead of busy wait for BMC



On 1/29/2026 12:15 AM, Thomas Zimmermann wrote:
> Hi,
> 
> thanks for the patch. If I'm not mistaken, you have fixed a long- 
> standing bug. See my comments below.
> 
> Please double check my comments and, if valid, integrate them into the 
> commit description.
> 
> Am 28.01.26 um 21:41 schrieb Jacob Keller:
>> The mgag200_bmc_stop_scanout() function is called by 
>> the .atomic_disable()
>> handler for the MGA G200 VGA BMC encoder. This function performs a few
>> register writes to inform the BMC of an upcoming mode change, and then
>> polls to wait until the BMC actually stops.
>>
>> The polling is implemented using a busy loop with udelay() and an 
>> iteration
>> timeout of 300, resulting in the function blocking for 300 milliseconds.
>>
>> The function gets called ultimately by the output_poll_execute work 
>> thread
>> for the DRM output change polling thread of the mgag200 driver:
>>
>> kworker/0:0-mm_    3528 [000]  4555.315364:
>>          ffffffffaa0e25b3 delay_halt.part.0+0x33
>>          ffffffffc03f6188 mgag200_bmc_stop_scanout+0x178
>>          ffffffffc087ae7a disable_outputs+0x12a
>>          ffffffffc087c12a drm_atomic_helper_commit_tail+0x1a
>>          ffffffffc03fa7b6 
>> mgag200_mode_config_helper_atomic_commit_tail+0x26
>>          ffffffffc087c9c1 commit_tail+0x91
>>          ffffffffc087d51b drm_atomic_helper_commit+0x11b
>>          ffffffffc0509694 drm_atomic_commit+0xa4
>>          ffffffffc05105e8 drm_client_modeset_commit_atomic+0x1e8
>>          ffffffffc0510ce6 drm_client_modeset_commit_locked+0x56
>>          ffffffffc0510e24 drm_client_modeset_commit+0x24
>>          ffffffffc088a743 
>> __drm_fb_helper_restore_fbdev_mode_unlocked+0x93
>>          ffffffffc088a683 drm_fb_helper_hotplug_event+0xe3
>>          ffffffffc050f8aa drm_client_dev_hotplug+0x9a
>>          ffffffffc088555a output_poll_execute+0x29a
>>          ffffffffa9b35924 process_one_work+0x194
>>          ffffffffa9b364ee worker_thread+0x2fe
>>          ffffffffa9b3ecad kthread+0xdd
>>          ffffffffa9a08549 ret_from_fork+0x29
>>
>> On a server running ptp4l with the mgag200 driver loaded, we found that
>> ptp4l would sometimes get blocked from execution because of this busy
>> waiting loop.
>>
>> Every so often, approximately once every 20 minutes though with large
>> variance, the output_poll_execute() thread would detect some sort of 
>> change
>> that required performing a hotplug event which results in attempting to
>> stop the BMC scanout, resulting in a 300msec delay on one CPU.
>>
>> On this system, ptp4l was pinned to a single CPU. When the
>> output_poll_execute() thread ran on that CPU, it blocked ptp4l from
>> executing for its 300millisecond duration.
>>
>> This resulted in PTP service disruptions such as failure to send a SYNC
>> message on time, failure to handle ANNOUNCE messages on time, and clock
>> check warnings from the application. All of this despite the application
>> being configured with FIFO_RT and a higher priority than the background
>> workqueue tasks. (However, note that the kernel did not use
>> CONFIG_PREEMPT...)
>>
>> It is unclear if the event is due to a faulty VGA connection, another 
>> bug,
>> or actual events causing a change in the connection. At least on the 
>> system
>> under test it is not a one-time event and consistently causes 
>> disruption to
>> the time sensitive applications.
> 
> I'm not surprised. The drivers for some of the server chipsets have been 
> imported from old Xorg's user-space code, so that there's something to 
> work with in the kernel. We've since been fixing them to better 
> integrate with the kernel.
> 
> Not busy waiting for the BMC is also a little improvement to any other 
> workload.
> 

Right.

>>
>> Reading through other DRM driver implementations, it does not appear that
>> the .atomic_enable or .atomic_disable handlers need to delay instead of
>> sleep. For example, the ast_astdp_encoder_helper_atomic_disable() 
>> function
>> calls ast_dp_set_phy_sleep() which uses msleep(). The "atomic" in the 
>> name
>> is referring to the atomic modesetting support, which is the support to
>> enable atomic configuration from userspace, and not to the "atomic 
>> context"
>> of the kernel.
> 
> Yes. "Atomic" means "apply all changes to hardware, or none".
> 
>>
>> Replace the busy wait with a sleeping loop based on read_poll_timeout().
>> This ensures that other time sensitive operations are not blocked from
>> executing while the work thread is waiting for the BMC hardware.
>>
>> Note the RREG_DAC is implemented using a statement expression to allow
>> working properly with the read_poll_timeout family of functions. The 
>> other
>> RREG_<TYPE> macros ought to be cleaned up to have better semantics, and
>> several places in the mgag200 driver could make use of RREG_DAC or 
>> similar
>> RREG_* macros should likely be cleaned up for better semantics as 
>> well, but
>> that task has been left as a future cleanup for a non-bugfix.
>>
>> Fixes: 414c45310625 ("mgag200: initial g200se driver (v2)")
>> Signed-off-by: Jacob Keller <jacob.e.keller@...el.com>
> 
> Reviewed-by: Thomas Zimmermann <tzimmermann@...e.de>
> 
>> ---
>> We still do not know if the reconfiguration is caused by a different
>> bug or by a faulty VGA connector or something else. However, there is no
>> reason that this function should be spinning instead of sleeping while
>> waiting for the BMC scan to stop.
>>
>> It is known that removing the mgag200 module avoids the issue. It is also
>> likely that use of CONFIG_PREEMPT (or CONFIG_PREEMPT_RT) could allow the
>> high priority process to preempt the kernel thread even while it is
>> delaying. However, it is better to let the process sleep() so that other
>> tasks can execute even if these steps are not taken.
>>
>> There are multiple other udelay() which likely could safely be 
>> converted to
>> usleep_range(). However they are all short, and I felt that the smallest
>> targeted fix made the most sense. They could perhaps be cleaned up in a
>> non-fix commit or series along with other improvements like fixing the
>> other RREG_* macros.
>> ---
>>   drivers/gpu/drm/mgag200/mgag200_drv.h |  6 ++++++
>>   drivers/gpu/drm/mgag200/mgag200_bmc.c | 27 ++++++++++-----------------
>>   2 files changed, 16 insertions(+), 17 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/mgag200/mgag200_drv.h b/drivers/gpu/drm/ 
>> mgag200/mgag200_drv.h
>> index f4bf40cd7c88..a875c4bf8cbe 100644
>> --- a/drivers/gpu/drm/mgag200/mgag200_drv.h
>> +++ b/drivers/gpu/drm/mgag200/mgag200_drv.h
>> @@ -111,6 +111,12 @@
>>   #define DAC_INDEX 0x3c00
>>   #define DAC_DATA 0x3c0a
>> +#define RREG_DAC(reg)                        \
>> +    ({                            \
>> +        WREG8(DAC_INDEX, reg);                \
>> +        RREG8(DAC_DATA);                \
>> +    })                            \
>> +
>>   #define WREG_DAC(reg, v)                    \
>>       do {                            \
>>           WREG8(DAC_INDEX, reg);                \
>> diff --git a/drivers/gpu/drm/mgag200/mgag200_bmc.c b/drivers/gpu/drm/ 
>> mgag200/mgag200_bmc.c
>> index a689c71ff165..599b710bab9b 100644
>> --- a/drivers/gpu/drm/mgag200/mgag200_bmc.c
>> +++ b/drivers/gpu/drm/mgag200/mgag200_bmc.c
>> @@ -1,6 +1,7 @@
>>   // SPDX-License-Identifier: GPL-2.0-only
>>   #include <linux/delay.h>
>> +#include <linux/iopoll.h>
>>   #include <drm/drm_atomic_helper.h>
>>   #include <drm/drm_edid.h>
>> @@ -12,7 +13,7 @@
>>   void mgag200_bmc_stop_scanout(struct mga_device *mdev)
>>   {
>>       u8 tmp;
>> -    int iter_max;
>> +    int ret;
>>       /*
>>        * 1 - The first step is to inform the BMC of an upcoming mode
>> @@ -44,28 +45,20 @@ void mgag200_bmc_stop_scanout(struct mga_device 
>> *mdev)
>>        * 3a- The third step is to verify if there is an active scan.
>>        * We are waiting for a 0 on remhsyncsts <XSPAREREG<0>).
>>        */
> 
> Either these comments or the original test seems incorrect.
> 
> The test below is supposed to detect whether the BMC is scanning out 
> from the framebuffer. While it reads a horizontal scanline the bit 
> should be 0. That's what the test is for, but it gets the condition wrong.
> 

Ahh... You're right, I didn't read the change carefully enough.

>> -    iter_max = 300;
>> -    while (!(tmp & 0x1) && iter_max) {
>> -        WREG8(DAC_INDEX, MGA1064_SPAREREG);
>> -        tmp = RREG8(DAC_DATA);
>> -        udelay(1000);
>> -        iter_max--;
>> -    }
>> +    ret = read_poll_timeout(RREG_DAC, tmp, !(tmp & 0x1),
>> +                1000, 300000, false,
>> +                MGA1064_SPAREREG);
> 
> The original while loop ran as long as "!(tmp & 0x1)".  And now the test 
> stops if "!(tmp & 0x1)" AFAICT.  This (accidentally?) fixes the test and 
> makes the comment correct.
> 

Right, but that changes the functionality. I'd need to know whether that 
swap is actually correct.

> 
>> +    if (ret == -ETIMEDOUT)
>> +        return;
>>       /*
>>        * 3b- This step occurs only if the remove is actually
> 
> Since you're at it, maybe fix this comment to say
> 
> '... only if the remote BMC is ...'
> 
>>        * scanning. We are waiting for the end of the frame which is
>>        * a 1 on remvsyncsts (XSPAREREG<1>)
>>        */
>> -    if (iter_max) {
>> -        iter_max = 300;
>> -        while ((tmp & 0x2) && iter_max) {
>> -            WREG8(DAC_INDEX, MGA1064_SPAREREG);
>> -            tmp = RREG8(DAC_DATA);
>> -            udelay(1000);
>> -            iter_max--;
>> -        }
>> -    }
>> +    (void)read_poll_timeout(RREG_DAC, tmp, (tmp & 0x2),
>> +                1000, 300000, false,
>> +                MGA1064_SPAREREG);
> 
> Again, the comment and original code disagree and the original test 
> condition appears to be inverted. It whats to test of the BMC has 
> finished scanning out the final frame. The bit should turn 1. Instead it 
> tests if the bit is already 1, which is likely true. Hence that's 
> probably where your 300 msec delays comes from.

Either that or the first case was checking for a 0, and looping the 
entire time because it was already 0.

I'll need to double check which is true and make a v2. I'd be more 
comfortable keeping the logic identical, which is my mistake on not 
inverting the condition with the change to read_poll_timeout...

> 
> Best regards
> Thomas
> 

Thanks for the careful review.

>>   }
>>   void mgag200_bmc_start_scanout(struct mga_device *mdev)
>>
>> ---
>> base-commit: e535c23513c63f02f67e3e09e0787907029efeaf
>> change-id: 20260127-jk-mgag200-fix-bad-udelay-409133777e3a
>>
>> Best regards,
>> -- 
>> Jacob Keller <jacob.e.keller@...el.com>
>>
>