[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CY5PR11MB6366B9836B306AD2A72BDDFEEDC6A@CY5PR11MB6366.namprd11.prod.outlook.com>
Date: Sun, 2 Nov 2025 10:12:56 +0000
From: "Usyskin, Alexander" <alexander.usyskin@...el.com>
To: Guenter Roeck <linux@...ck-us.net>,
Marek Marczykowski-Górecki
<marmarek@...isiblethingslab.com>
CC: Greg Kroah-Hartman <gregkh@...uxfoundation.org>, "Abliyev, Reuven"
<reuven.abliyev@...el.com>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>
Subject: RE: [char-misc-next] mei: hook mei_device on class device
> Subject: Re: [char-misc-next] mei: hook mei_device on class device
>
> Hi,
>
> On Tue, Aug 26, 2025 at 03:56:17PM +0300, Alexander Usyskin wrote:
> > mei_device lifetime was managed by devm procedure of parent device.
> > But such memory is freed on device_del.
> > Mei_device object is used by client object that may be alive after
> > parent device is removed.
> > It may lead to use-after-free if discrete graphics driver unloads
> > mei_gsc auxiliary device while user-space holds open handle to mei
> > character device.
> >
> > Connect mei_device structure lifteme to mei class device lifetime
> > by adding mei_device free to class device remove callback.
> >
> > Move exising parent device pointer to separate field in mei_device
> > to avoid misuse.
> >
> > Allocate character device dynamically and allow to control its own
> > lifetime as it may outlive mei_device structure while character
> > device closes after parent device is removed from the system.
> >
> > Leave power management on parent device as we overwrite pci runtime
> > pm procedure and user-space is expecting it there.
> >
> > Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/14201
> > Signed-off-by: Alexander Usyskin <alexander.usyskin@...el.com>
>
> This patch results in:
>
> [ 20.260342] mei mei0: wait hw ready failed
> [ 20.264530] mei mei0: hw_start failed ret = -62 fw status = 00070355
> 002F0006 00000000 00000000 00000000 00000000
> [ 22.308353] mei mei0: wait hw ready failed
> [ 22.312489] mei mei0: hw_start failed ret = -62 fw status = 00070355
> 002F0006 00000000 00000000 00000000 00000000
> [ 24.356433] mei mei0: wait hw ready failed
> [ 24.360577] mei mei0: hw_start failed ret = -62 fw status = 00070355
> 002F0006 00000000 00000000 00000000 00000000
> [ 24.370911] mei mei0: reset: reached maximal consecutive resets: disabling
> the device
> [ 24.378787] mei mei0: reset failed ret = -19
> [ 24.383079] mei mei0: link layer initialization failed.
> [ 24.388329] mei_me 0000:00:16.0: init hw failure.
> [ 51.219835] watchdog: BUG: soft lockup - CPU#0 stuck for 26s!
> [kworker/0:4:838]
> [ 79.219833] watchdog: BUG: soft lockup - CPU#0 stuck for 52s!
> [kworker/0:4:838]
> [ 107.219832] watchdog: BUG: soft lockup - CPU#0 stuck for 78s!
> [kworker/0:4:838]
> [ 135.219831] watchdog: BUG: soft lockup - CPU#0 stuck for 104s!
> [kworker/0:4:838]
>
> when trying to run v6.18-rc1/2/3 on a Skylaker server with non-functional
> MEI support.
> The problem is only seen if various debug options are enabled. Reverting the
> MEI
> patches since 6.17 fixes the problem (reverting this patch alone is not
> possible).
>
> Bisect log and more verbose backtracec attached for reference.
>
> Please let me know if there is anything I can do to help tracking down the
> problem.
>
> Thanks,
> Guenter
>
> ----
> Backtrace:
>
> [ 93.187907] watchdog: BUG: soft lockup - CPU#0 stuck for 52s!
> [kworker/0:2:834]
> [ 93.187909] Modules linked in:
> [ 93.187909] irq event stamp: 523739110
> [ 93.187910] hardirqs last enabled at (523739109): [<ffffffffa43975d7>]
> work_grab_pending+0x1a7/0x360
> [ 93.187912] hardirqs last disabled at (523739110): [<ffffffffa5203b9e>]
> sysvec_apic_timer_interrupt+0xe/0x90
> [ 93.187914] softirqs last enabled at (523364938): [<ffffffffa4378d9b>]
> __irq_exit_rcu+0x6b/0x140
> [ 93.187916] softirqs last disabled at (523364933): [<ffffffffa4378d9b>]
> __irq_exit_rcu+0x6b/0x140
> [ 93.187918] CPU: 0 UID: 0 PID: 834 Comm: kworker/0:2 Tainted: G L
> 6.18.0-dbg-DEV #2 NONE
> [ 93.187920] Tainted: [L]=SOFTLOCKUP
> [ 93.187920] Hardware name: Google LLC Indus/Indus_QC_03, BIOS
> 30.116.4 08/29/2025
> [ 93.187922] Workqueue: events work_for_cpu_fn
> [ 93.187923] RIP: 0010:work_grab_pending+0x36/0x360
> [ 93.187925] Code: 41 54 53 48 83 ec 38 49 89 d7 89 34 24 48 89 fb 65 48
> 8b 05 cc eb 66 03 48 89 44 24 30 48 8d 47 48 48 89 44 24 08 eb 02 f3 90
> <9c> 8f 44 24 28 48 8b 44 24 28 fa 49 89 07 a9 00 02 00 00 74 05 e8
> [ 93.187927] RSP: 0000:ffffb51d9b59fc60 EFLAGS: 00000202
> [ 93.187928] RAX: 94f926a301ea6700 RBX: ffff9afc8231b418 RCX:
> 0000000000000002
> [ 93.187928] RDX: 0000000000000006 RSI: ffff9afc81b70b20 RDI:
> ffffffffa43975d7
> [ 93.187929] RBP: 0000000000000002 R08: 00000000000e00c8 R09:
> ffffffffffffffff
> [ 93.187930] R10: ffffffffa43974b2 R11: 0000000000000000 R12:
> ffffffffa66e0848
> [ 93.187930] R13: ffffffffa439de1d R14: ffffffffa43974b2 R15:
> ffffb51d9b59fcd0
> [ 93.187931] FS: 0000000000000000(0000) GS:ffff9b59d8812000(0000)
> knlGS:0000000000000000
> [ 93.187932] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 93.187932] CR2: ffff9bba9ffff000 CR3: 0000005c0b44c001 CR4:
> 00000000007706f0
> [ 93.187933] PKRU: 55555554
> [ 93.187934] Call Trace:
> [ 93.187934] <TASK>
> [ 93.187936] ? lock_release+0x100/0x340
> [ 93.187939] ? process_scheduled_works+0x26d/0x630
> [ 93.187941] __cancel_work+0x29/0xf0
> [ 93.187945] cancel_work_sync+0x18/0x80
> [ 93.187947] mei_cancel_work+0x19/0x40
> [ 93.187950] mei_me_probe+0x2bf/0x330
> [ 93.187952] ? process_scheduled_works+0x26d/0x630
> [ 93.187954] local_pci_probe+0x45/0x90
> [ 93.187957] work_for_cpu_fn+0x1b/0x30
> [ 93.187959] process_scheduled_works+0x2d3/0x630
> [ 93.187965] worker_thread+0x1e8/0x2f0
> [ 93.187969] kthread+0x21f/0x240
> [ 93.187972] ? __pfx_worker_thread+0x10/0x10
> [ 93.187974] ? lock_release+0x100/0x340
> [ 93.187975] ? ret_from_fork+0x2d/0x2b0
> [ 93.187979] ? __pfx_kthread+0x10/0x10
> [ 93.187982] ret_from_fork+0x197/0x2b0
> [ 93.187984] ? __pfx_kthread+0x10/0x10
> [ 93.187986] ret_from_fork_asm+0x1a/0x30
> [ 93.187994] </TASK>
>
> ---
> Bisect log:
>
> # bad: [3a8660878839faadb4f1a6dd72c3179c1df56787] Linux 6.18-rc1
> # good: [e5f0a698b34ed76002dc5cff3804a61c80233a7a] Linux 6.17
> git bisect start 'HEAD' 'v6.17'
> # good: [58809f614e0e3f4e12b489bddf680bfeb31c0a20] Merge tag 'drm-
> next-2025-10-01' of https://gitlab.freedesktop.org/drm/kernel
> git bisect good 58809f614e0e3f4e12b489bddf680bfeb31c0a20
> # good: [bed0653fe2aacb0ca8196075cffc9e7062e74927] Merge tag 'iommu-
> updates-v6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux
> git bisect good bed0653fe2aacb0ca8196075cffc9e7062e74927
> # bad: [6a74422b9710e987c7d6b85a1ade7330b1e61626] Merge tag
> 'mips_6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux
> git bisect bad 6a74422b9710e987c7d6b85a1ade7330b1e61626
> # good: [b66451723c45b791fd2824d1b8f62fe498989e23] Merge tag
> 'platform-drivers-x86-v6.18-1' of
> git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
> git bisect good b66451723c45b791fd2824d1b8f62fe498989e23
> # good: [59697e061f6aec86d5738cd4752e16520f1d60dc] Merge tag
> 'staging-6.18-rc1' of
> git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
> git bisect good 59697e061f6aec86d5738cd4752e16520f1d60dc
> # good: [561285d048053fec8a3d6d1e3ddc60df11c393a0] MAINTAINERS:
> Support ROHM BD79112 ADC
> git bisect good 561285d048053fec8a3d6d1e3ddc60df11c393a0
> # bad: [eafedbc7c050c44744fbdf80bdf3315e860b7513] rust_binder: add
> Rust Binder driver
> git bisect bad eafedbc7c050c44744fbdf80bdf3315e860b7513
> # good: [0c82fd9609a1e4bf1db84b0fd56bc3b2773da179] ibmasm: Replace
> kzalloc() + copy_from_user() with memdup_user_nul()
> git bisect good 0c82fd9609a1e4bf1db84b0fd56bc3b2773da179
> # bad: [ef509269d93d9832b366005f9626b44e38cc0ca7] Merge tag
> 'counter-updates-for-6.18' of
> ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/wbg/counter into char-
> misc-next
> git bisect bad ef509269d93d9832b366005f9626b44e38cc0ca7
> # bad: [ae0de6333368fd8c4535f5dbdfe1b2660438e089] slimbus:
> messaging: fix "transfered"->"transferred"
> git bisect bad ae0de6333368fd8c4535f5dbdfe1b2660438e089
> # bad: [7704e6be4ed2835832c445807cdcb2d56d8a8430] mei: hook
> mei_device on class device
> git bisect bad 7704e6be4ed2835832c445807cdcb2d56d8a8430
> # good: [ceda408c0d1d41094ad125332c6fb1d488e61c0c] misc: remove
> ineffective WARN_ON() check from misc_deregister()
> git bisect good ceda408c0d1d41094ad125332c6fb1d488e61c0c
> # good: [76254bc489d39dae9a3427f0984fe64213d20548] cdx: Fix device
> node reference leak in cdx_msi_domain_init
> git bisect good 76254bc489d39dae9a3427f0984fe64213d20548
> # first bad commit: [7704e6be4ed2835832c445807cdcb2d56d8a8430] mei:
> hook mei_device on class device
Seems I've missed the error flow in probe (my test machines always have an ME in a good state...).
Below patch should fix the problem, can you confirm?
From c58f311df60f26df2efe1e0f9fc523bfa4b93936 Mon Sep 17 00:00:00 2001
From: Alexander Usyskin <alexander.usyskin@...el.com>
Date: Sun, 2 Nov 2025 10:57:22 +0200
Subject: [PATCH] mei: fix error flow in probe
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Dismantle class device last in probe error flow to avoid accessing freed memory like:
[ 87.926774] WARNING: CPU: 9 PID: 518 at kernel/workqueue.c:4234
__flush_work+0x340/0x390
...
[ 87.926912] Workqueue: async async_run_entry_fn
[ 87.926918] RIP: e030:__flush_work+0x340/0x390
[ 87.926923] Code: 26 9d 05 00 65 48 8b 15 26 3c ca 02 48 85 db 48 8b
04 24 48 89 54 24 58 0f 85 de fe ff ff e9 f6 fd ff ff 0f 0b e9 77 ff ff
ff <0f> 0b e9 70 ff ff ff 0f 0b e9 19 ff ff ff e8 7d 8b 0e 01 48 89 de
[ 87.926931] RSP: e02b:ffffc900412ebc00 EFLAGS: 00010246
[ 87.926936] RAX: 0000000000000000 RBX: ffff888103e55090 RCX: 0000000000000000
[ 87.926941] RDX: 000fffffffe00000 RSI: 0000000000000001 RDI: ffffc900412ebc60
[ 87.926945] RBP: ffff888103e55090 R08: ffffffffc1266ec8 R09: ffff8881109076e8
[ 87.926949] R10: 0000000080040003 R11: 0000000000000000 R12: ffff888103e54000
[ 87.926953] R13: ffffc900412ebc18 R14: 0000000000000001 R15: 0000000000000000
[ 87.926962] FS: 0000000000000000(0000) GS:ffff888233238000(0000) knlGS:0000000000000000
[ 87.926967] CS: e030 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 87.926971] CR2: 00007e7923b32708 CR3: 00000001088df000 CR4: 0000000000050660
[ 87.926977] Call Trace:
[ 87.926981] <TASK>
[ 87.926987] ? __call_rcu_common.constprop.0+0x11e/0x310
[ 87.926993] cancel_work_sync+0x5e/0x80
[ 87.926999] mei_cancel_work+0x19/0x40 [mei]
[ 87.927051] mei_me_probe+0x273/0x2b0 [mei_me]
[ 87.927060] local_pci_probe+0x45/0x90
[ 87.927066] pci_call_probe+0x5b/0x180
[ 87.927070] pci_device_probe+0x95/0x140
[ 87.927074] ? driver_sysfs_add+0x57/0xc0
[ 87.927079] really_probe+0xde/0x340
[ 87.927083] ? pm_runtime_barrier+0x54/0x90
[ 87.927087] __driver_probe_device+0x78/0x110
[ 87.927092] driver_probe_device+0x1f/0xa0
[ 87.927095] __driver_attach_async_helper+0x5e/0xe0
[ 87.927100] async_run_entry_fn+0x34/0x130
[ 87.927104] process_one_work+0x18d/0x340
[ 87.927108] worker_thread+0x256/0x3a0
[ 87.927111] ? __pfx_worker_thread+0x10/0x10
[ 87.927115] kthread+0xfc/0x240
[ 87.927120] ? __pfx_kthread+0x10/0x10
[ 87.927124] ? __pfx_kthread+0x10/0x10
[ 87.927127] ret_from_fork+0xf5/0x110
[ 87.927132] ? __pfx_kthread+0x10/0x10
[ 87.927136] ret_from_fork_asm+0x1a/0x30
[ 87.927141] </TASK>
Reported-by: Marek Marczykowski-Górecki <marmarek@...isiblethingslab.com>
Reported-by: Guenter Roeck <linux@...ck-us.net>
Fixes: 7704e6be4ed2 ("mei: hook mei_device on class device")
Signed-off-by: Alexander Usyskin <alexander.usyskin@...el.com>
---
drivers/misc/mei/pci-me.c | 13 ++++++-------
drivers/misc/mei/pci-txe.c | 13 ++++++-------
drivers/misc/mei/platform-vsc.c | 11 +++++------
3 files changed, 17 insertions(+), 20 deletions(-)
diff --git a/drivers/misc/mei/pci-me.c b/drivers/misc/mei/pci-me.c
index b017ff29dbd1..73cad914be9f 100644
--- a/drivers/misc/mei/pci-me.c
+++ b/drivers/misc/mei/pci-me.c
@@ -223,6 +223,10 @@ static int mei_me_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
hw->mem_addr = pcim_iomap_table(pdev)[0];
hw->read_fws = mei_me_read_fws;
+ err = mei_register(dev, &pdev->dev);
+ if (err)
+ goto end;
+
pci_enable_msi(pdev);
hw->irq = pdev->irq;
@@ -237,13 +241,9 @@ static int mei_me_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
if (err) {
dev_err(&pdev->dev, "request_threaded_irq failure. irq = %d\n",
pdev->irq);
- goto end;
+ goto deregister;
}
- err = mei_register(dev, &pdev->dev);
- if (err)
- goto release_irq;
-
if (mei_start(dev)) {
dev_err(&pdev->dev, "init hw failure.\n");
err = -ENODEV;
@@ -283,11 +283,10 @@ static int mei_me_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
return 0;
deregister:
- mei_deregister(dev);
-release_irq:
mei_cancel_work(dev);
mei_disable_interrupts(dev);
free_irq(pdev->irq, dev);
+ mei_deregister(dev);
end:
dev_err(&pdev->dev, "initialization failed.\n");
return err;
diff --git a/drivers/misc/mei/pci-txe.c b/drivers/misc/mei/pci-txe.c
index 06b55a891c6b..98d1bc2c7f4b 100644
--- a/drivers/misc/mei/pci-txe.c
+++ b/drivers/misc/mei/pci-txe.c
@@ -87,6 +87,10 @@ static int mei_txe_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
hw = to_txe_hw(dev);
hw->mem_addr = pcim_iomap_table(pdev);
+ err = mei_register(dev, &pdev->dev);
+ if (err)
+ goto end;
+
pci_enable_msi(pdev);
/* clear spurious interrupts */
@@ -106,13 +110,9 @@ static int mei_txe_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
if (err) {
dev_err(&pdev->dev, "mei: request_threaded_irq failure. irq = %d\n",
pdev->irq);
- goto end;
+ goto deregister;
}
- err = mei_register(dev, &pdev->dev);
- if (err)
- goto release_irq;
-
if (mei_start(dev)) {
dev_err(&pdev->dev, "init hw failure.\n");
err = -ENODEV;
@@ -145,11 +145,10 @@ static int mei_txe_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
return 0;
deregister:
- mei_deregister(dev);
-release_irq:
mei_cancel_work(dev);
mei_disable_interrupts(dev);
free_irq(pdev->irq, dev);
+ mei_deregister(dev);
end:
dev_err(&pdev->dev, "initialization failed.\n");
return err;
diff --git a/drivers/misc/mei/platform-vsc.c b/drivers/misc/mei/platform-vsc.c
index 288e7b72e942..9787b9cee71c 100644
--- a/drivers/misc/mei/platform-vsc.c
+++ b/drivers/misc/mei/platform-vsc.c
@@ -362,28 +362,27 @@ static int mei_vsc_probe(struct platform_device *pdev)
ret = mei_register(mei_dev, dev);
if (ret)
- goto err_dereg;
+ goto err;
ret = mei_start(mei_dev);
if (ret) {
dev_err_probe(dev, ret, "init hw failed\n");
- goto err_cancel;
+ goto err;
}
pm_runtime_enable(mei_dev->parent);
return 0;
-err_dereg:
- mei_deregister(mei_dev);
-
-err_cancel:
+err:
mei_cancel_work(mei_dev);
vsc_tp_register_event_cb(tp, NULL, NULL);
mei_disable_interrupts(mei_dev);
+ mei_deregister(mei_dev);
+
return ret;
}
--
2.43.0
Powered by blists - more mailing lists