lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <CABXGCsMmtqzBfUykT-JgyhZn-7ZXtftHL35znDdYuTnUOpGnoQ@mail.gmail.com>
Date: Sat, 20 Jul 2024 22:08:43 +0500
From: Mikhail Gavrilov <mikhail.v.gavrilov@...il.com>
To: amd-gfx list <amd-gfx@...ts.freedesktop.org>, 
	dri-devel <dri-devel@...ts.freedesktop.org>, 
	Christian König <christian.koenig@....com>, 
	"Deucher, Alexander" <alexander.deucher@....com>, mukul.joshi@....com, 
	Linux List Kernel Mailing <linux-kernel@...r.kernel.org>, 
	Linux regressions mailing list <regressions@...ts.linux.dev>
Subject: 6.10/bisected/regression - Since commit e356d321d024 in the kernel
 log appears the message "MES failed to respond to msg=MISC (WAIT_REG_MEM)"
 which were never seen before

Hi,
I spotted "MES failed to respond to msg=MISC (WAIT_REG_MEM)" messages
in my kernel log since 6.10-rc5.
After this message, usually follow "[drm:amdgpu_mes_reg_write_reg_wait
[amdgpu]] *ERROR* failed to reg_write_reg_wait".

[ 8972.590502] input: Noble FoKus Mystique (AVRCP) as
/devices/virtual/input/input21
[ 9964.748433] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9964.748433] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9964.748434] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9964.748493] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9964.748494] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9964.748493] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9964.748493] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9964.748476] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9964.748478] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9964.748479] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9964.748477] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9964.748477] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9964.748477] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9964.748478] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9964.748477] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9964.748477] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9964.748478] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9964.748661] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9964.748770] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9977.224893] Bluetooth: hci0: ACL packet for unknown connection handle 3837
[ 9980.347061] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9980.347077] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9980.349857] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9980.349857] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9980.349857] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9980.349859] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9980.349858] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9980.349859] amdgpu 0000:03:00.0: amdgpu: MES failed to respond to
msg=MISC (WAIT_REG_MEM)
[ 9980.349870] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9980.349868] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9980.349870] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9980.349890] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9980.349866] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9980.349865] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9980.349865] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9980.349866] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9980.349866] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9980.349867] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9980.349867] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9980.349869] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9980.349871] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9980.349871] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[ 9980.349871] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR*
failed to reg_write_reg_wait
[10037.250083] Bluetooth: hci0: ACL packet for unknown connection handle 3837
[12054.238867] workqueue: gc_worker [nf_conntrack] hogged CPU for
>10000us 1027 times, consider switching to WQ_UNBOUND
[12851.087896] fossilize_repla (45968) used greatest stack depth:
17440 bytes left

Unfortunately, it is not easily reproducible.
Usually it appears when I play several hours in the game "STAR WARS
Jedi: Survivor".
So it is why I bisected it so long.

git bisect start
# status: waiting for both good and bad commits
# bad: [f2661062f16b2de5d7b6a5c42a9a5c96326b8454] Linux 6.10-rc5
git bisect bad f2661062f16b2de5d7b6a5c42a9a5c96326b8454
# good: [50736169ecc8387247fe6a00932852ce7b057083] Merge tag
'for-6.10-rc4-tag' of
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
git bisect good 50736169ecc8387247fe6a00932852ce7b057083
# bad: [d4ba3313e84dfcdeb92a13434a2d02aad5e973e1] Merge tag
'loongarch-fixes-6.10-2' of
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson
git bisect bad d4ba3313e84dfcdeb92a13434a2d02aad5e973e1
# good: [264efe488fd82cf3145a3dc625f394c61db99934] Merge tag
'ovl-fixes-6.10-rc5' of
git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs
git bisect good 264efe488fd82cf3145a3dc625f394c61db99934
# bad: [35bb670d65fc0f80c62383ab4f2544cec85ac57a] Merge tag
'scsi-fixes' of
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
git bisect bad 35bb670d65fc0f80c62383ab4f2544cec85ac57a
# good: [f0d576f840153392d04b2d52cf3adab8f62e8cb6] drm/amdgpu: fix
UBSAN warning in kv_dpm.c
git bisect good f0d576f840153392d04b2d52cf3adab8f62e8cb6
# bad: [07e06189c5ea7ffe897d12b546c918380d3bffb1] Merge tag
'amd-drm-fixes-6.10-2024-06-19' of
https://gitlab.freedesktop.org/agd5f/linux into drm-fixes
git bisect bad 07e06189c5ea7ffe897d12b546c918380d3bffb1
# bad: [ed5a4484f074aa2bfb1dad99ff3628ea8da4acdc] drm/amdgpu: init TA
fw for psp v14
git bisect bad ed5a4484f074aa2bfb1dad99ff3628ea8da4acdc
# bad: [e356d321d0240663a09b139fa3658ddbca163e27] drm/amdgpu: cleanup
MES11 command submission
git bisect bad e356d321d0240663a09b139fa3658ddbca163e27
# first bad commit: [e356d321d0240663a09b139fa3658ddbca163e27]
drm/amdgpu: cleanup MES11 command submission

Author: Christian König <christian.koenig@....com>
Date:   Fri May 31 10:56:00 2024 +0200

    drm/amdgpu: cleanup MES11 command submission

    The approach of having a separate WB slot for each submission doesn't
    really work well and for example breaks GPU reset.

    Use a status query packet for the fence update instead since those
    should always succeed we can use the fence of the original packet to
    signal the state of the operation.

    While at it cleanup the coding style.

    Fixes: eef016ba8986 ("drm/amdgpu/mes11: Use a separate fence per
transaction")
    Reviewed-by: Mukul Joshi <mukul.joshi@....com>
    Signed-off-by: Christian König <christian.koenig@....com>
    Signed-off-by: Alex Deucher <alexander.deucher@....com>

And I can confirm after reverting e356d321d024 I played the whole day,
and the "MES failed to respond" error message does not appear anymore.

My hardware specs are: https://linux-hardware.org/?probe=78d8c680db

Christian, can you look into it, please?

-- 
Best Regards,
Mike Gavrilov.

Download attachment "dmesg.zip" of type "application/zip" (57513 bytes)

Download attachment ".config.zip" of type "application/zip" (66515 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ