lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJZ5v0gjt3ptZ8=qJcXagZfXrJbpcz7nDwZxRvg50PytdNScRQ@mail.gmail.com>
Date: Mon, 26 Aug 2024 11:58:12 +0200
From: "Rafael J. Wysocki" <rafael@...nel.org>
To: Nícolas F. R. A. Prado <nfraprado@...labora.com>
Cc: "Rafael J. Wysocki" <rjw@...ysocki.net>, Linux PM <linux-pm@...r.kernel.org>, 
	LKML <linux-kernel@...r.kernel.org>, Daniel Lezcano <daniel.lezcano@...aro.org>, 
	Lukasz Luba <lukasz.luba@....com>, Zhang Rui <rui.zhang@...el.com>, regressions@...ts.linux.dev, 
	kernelci@...ts.linux.dev, kernel@...labora.com
Subject: Re: [PATCH v3 00/14] thermal: Rework binding cooling devices to trip points

On Sat, Aug 24, 2024 at 8:45 PM Nícolas F. R. A. Prado
<nfraprado@...labora.com> wrote:
>
> On Mon, Aug 19, 2024 at 05:49:07PM +0200, Rafael J. Wysocki wrote:
> > Hi Everyone,
> >
> > This is one more update of
> >
> > https://lore.kernel.org/linux-pm/3134863.CbtlEUcBR6@rjwysocki.net/#r
> >
> > the cover letter of which was sent separately by mistake:
> >
> > https://lore.kernel.org/linux-pm/CAJZ5v0jo5vh2uD5t4GqBnN0qukMBG_ty33PB=NiEqigqxzBcsw@mail.gmail.com/
> >
> > and it has been updated once already:
> >
> > https://lore.kernel.org/linux-pm/114901234.nniJfEyVGO@rjwysocki.net/
> >
> > Relative to the v2 above it drops 3 patches, one because it was broken ([04/17
> > in the v2), and two more that would need to be rebased significantly, either
> > because of dropping the other broken patch or because of the recent Bang-bang
> > governor fixes:
> >
> > https://lore.kernel.org/linux-pm/1903691.tdWV9SEqCh@rjwysocki.net/
> >
> > The remaining 14 patches, 2 of which have been slightly rebased and the rest
> > is mostly unchanged (except for some very minor subject and changelog fixes),
> > is not expected to be controversial and are targeting 6.12, on top of the
> > current linux-next material.
> >
> > The original motivation for this series quoted below has not changed:
> >
> >  The code for binding cooling devices to trip points (and unbinding them from
> >  trip point) is one of the murkiest pieces of the thermal subsystem.  It is
> >  convoluted, bloated with unnecessary code doing questionable things, and it
> >  works backwards.
> >
> >  The idea is to bind cooling devices to trip points in accordance with some
> >  information known to the thermal zone owner (thermal driver).  This information
> >  is not known to the thermal core when the thermal zone is registered, so the
> >  driver needs to be involved, but instead of just asking the driver whether
> >  or not the given cooling device should be bound to a given trip point, the
> >  thermal core expects the driver to carry out all of the binding process
> >  including calling functions specifically provided by the core for this
> >  purpose which is cumbersome and counter-intuitive.
> >
> >  Because the driver has no information regarding the representation of the trip
> >  points at the core level, it is forced to walk them (and it has to avoid some
> >  locking traps while doing this), or it needs to make questionable assumptions
> >  regarding the ordering of the trips in the core.  There are drivers doing both
> >  these things.
> >
> > The first 5 patches in the series are preliminary.
> >
> > Patch [06/14] introduces a new .should_bind() callback for thermal zones and
> > patches [07,09-12/14] modifies drivers to use it instead of the .bind() and
> > .unbind() callbacks which allows them to be simplified quite a bit.
> >
> > The other patches [08,13-14/14] get rid of code that becomes unused after the
> > previous changes and do some cleanups on top of that.
> >
> > The entire series along with 2 patches on top of it (that were present in the
> > v2 of this set of patches) is available in the thermal-core-testing git branch:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/log/?h=thermal-core-testing
> >
> > (note that this branch is going to be rebased shortly on top of 6.11-rc4
> > and the thermal control material in linux-next).
> >
> > Thanks!
>
> Hi,
>
> KernelCI has identified a boot regression originating from this series. I've
> verified that reverting the series fixes the issue.

Thanks for the report!

There was a bug in the original patch [12/14] that would cause
symptoms like what you are observing to appear, which was reported on
Friday and has since been fixed in the tree.  Please see:

https://lore.kernel.org/linux-pm/CAJZ5v0iw7uXE_cfU5VXOjFDg9GM8Hu0+hKxqfzU3v0OM5KK9oQ@mail.gmail.com/

You probably have not tested the fixed tree yet, so please let
kernelci run again on it and if the issue is still there, please let
me know.


> Affected platforms:
> * mt8195-cherry-tomato-r2
> * mt8192-asurada-spherion-r0
> * mt8183-kukui-jacuzzi-juniper-sku16
> * mt8186-corsola-steelix-sku131072
> * sc7180-trogdor-kingoftown
> * sc7180-trogdor-lazor-limozeen
>
> Relevant log from mt8195-cherry-tomato-r2 (with additional debug configs
> enabled):
>
> [   11.326726] BUG: sleeping function called from invalid context at kernel/locking/rwsem.c:1578
> [   11.335294] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 165, name: udevd
> [   11.342944] preempt_count: 1, expected: 0
> [   11.346943] RCU nest depth: 0, expected: 0
> [   11.351028] 4 locks held by udevd/165:
> [   11.354766]  #0: ffff4dc8825db0f8 (&dev->mutex){....}-{3:3}, at: __driver_attach+0x130/0x4a8
> [   11.363207]  #1: ffffc208f386c3c8 (thermal_list_lock){+.+.}-{3:3}, at: thermal_zone_device_register_with_trips+0x85c/0xcd8
> [   11.374248]  #2: ffff4dc7dc3586f0 (&tz->lock){+.+.}-{3:3}, at: thermal_zone_cdev_binding.part.0+0x98/0x280
> [   11.383896]  #3: ffffc208f39b7b78 (devtree_lock){....}-{2:2}, at: of_get_next_child+0x2c/0xc4
> [   11.392418] irq event stamp: 173740
> [   11.395895] hardirqs last  enabled at (173739): [<ffffc208ecde804c>] _raw_spin_unlock_irqrestore+0x84/0x90
> [   11.405537] hardirqs last disabled at (173740): [<ffffc208ecde6f7c>] _raw_spin_lock_irqsave+0xe0/0xf4
> [   11.414742] softirqs last  enabled at (172404): [<ffffc208e978bb20>] handle_softirqs+0x534/0x874
> [   11.423517] softirqs last disabled at (172393): [<ffffc208e961097c>] __do_softirq+0x14/0x20
> [   11.431857] CPU: 5 UID: 0 PID: 165 Comm: udevd Not tainted 6.11.0-rc4-next-20240822-00002-gfbbbf9faa56a #628
> [   11.441670] Hardware name: Acer Tomato (rev2) board (DT)
> [   11.446970] Call trace:
> [   11.449407]  dump_backtrace+0x98/0xf0
> [   11.453059]  show_stack+0x18/0x24
> [   11.456364]  dump_stack_lvl+0x90/0xd0
> [   11.460018]  dump_stack+0x1c/0x28
> [   11.463322]  __might_resched+0x358/0x570
> [   11.467234]  __might_sleep+0xa4/0x16c
> [   11.470885]  down_write+0x8c/0x21c
> [   11.474277]  kernfs_remove+0x64/0x98
> [   11.477844]  sysfs_remove_dir+0xa8/0xe8
> [   11.481669]  __kobject_del+0xb0/0x27c
> [   11.485321]  kobject_release+0xfc/0x134
> [   11.489146]  kobject_put+0xb0/0x130
> [   11.492624]  of_node_put+0x18/0x28
> [   11.496016]  of_get_next_child+0x64/0xc4
> [   11.499929]  thermal_of_should_bind+0x154/0x390
> [   11.504449]  thermal_zone_cdev_binding.part.0+0x174/0x280
> [   11.509836]  thermal_zone_device_register_with_trips+0x914/0xcd8
> [   11.515831]  thermal_of_zone_register+0x284/0x464
> [   11.520523]  devm_thermal_of_zone_register+0x80/0xf4
> [   11.525476]  lvts_domain_init+0x500/0x760 [lvts_thermal]
> [   11.530785]  lvts_probe+0x1b4/0x3ac [lvts_thermal]
> [   11.535565]  platform_probe+0xc4/0x214
> [   11.539303]  really_probe+0x188/0x5d0
> [   11.542954]  __driver_probe_device+0x160/0x2e8
> [   11.547386]  driver_probe_device+0x5c/0x298
> [   11.551558]  __driver_attach+0x13c/0x4a8
> [   11.555470]  bus_for_each_dev+0xf8/0x180
> [   11.559383]  driver_attach+0x3c/0x58
> [   11.562947]  bus_add_driver+0x1c4/0x458
> [   11.566772]  driver_register+0xf4/0x3c0
> [   11.570598]  __platform_driver_register+0x60/0x88
> [   11.575291]  lvts_driver_init+0x20/0x1000 [lvts_thermal]
> [   11.580593]  do_one_initcall+0xcc/0x284
> [   11.584418]  do_init_module+0x278/0x740
> [   11.588244]  load_module+0xed8/0x1434
> [   11.591897]  init_module_from_file+0xdc/0x1fc
> [   11.596243]  idempotent_init_module+0x2bc/0x604
> [   11.600762]  __arm64_sys_finit_module+0xac/0x100
> [   11.605368]  invoke_syscall+0x6c/0x258
> [   11.609107]  el0_svc_common.constprop.0+0xac/0x230
> [   11.613886]  do_el0_svc+0x40/0x58
> [   11.617190]  el0_svc+0x48/0xb8
> [   11.620234]  el0t_64_sync_handler+0x100/0x12c
> [   11.624580]  el0t_64_sync+0x190/0x194
> [   11.628233]
> [   11.629713] =============================
> [   11.633708] [ BUG: Invalid wait context ]
> [   11.637705] 6.11.0-rc4-next-20240822-00002-gfbbbf9faa56a #628 Tainted: G        W
> [   11.645953] -----------------------------
> [   11.649950] udevd/165 is trying to lock:
> [   11.653859] ffff4dc880881148 (&root->kernfs_rwsem){++++}-{3:3}, at: kernfs_remove+0x64/0x98
> [   11.662200] other info that might help us debug this:
> [   11.667238] context-{4:4}
> [   11.669846] 4 locks held by udevd/165:
> [   11.673582]  #0: ffff4dc8825db0f8 (&dev->mutex){....}-{3:3}, at: __driver_attach+0x130/0x4a8
> [   11.682009]  #1: ffffc208f386c3c8 (thermal_list_lock){+.+.}-{3:3}, at: thermal_zone_device_register_with_trips+0x85c/0xcd8
> [   11.693041]  #2: ffff4dc7dc3586f0 (&tz->lock){+.+.}-{3:3}, at: thermal_zone_cdev_binding.part.0+0x98/0x280
> [   11.702684]  #3: ffffc208f39b7b78 (devtree_lock){....}-{2:2}, at: of_get_next_child+0x2c/0xc4
> [   11.711199] stack backtrace:
> [   11.714067] CPU: 5 UID: 0 PID: 165 Comm: udevd Tainted: G        W          6.11.0-rc4-next-20240822-00002-gfbbbf9faa56a #628
> [   11.725355] Tainted: [W]=WARN
> [   11.728310] Hardware name: Acer Tomato (rev2) board (DT)
> [   11.733608] Call trace:
> [   11.736041]  dump_backtrace+0x98/0xf0
> [   11.739692]  show_stack+0x18/0x24
> [   11.742994]  dump_stack_lvl+0x90/0xd0
> [   11.746645]  dump_stack+0x1c/0x28
> [   11.749948]  __lock_acquire+0x10f8/0x2710
> [   11.753948]  lock_acquire.part.0+0x218/0x518
> [   11.758206]  lock_acquire+0x90/0xb4
> [   11.761683]  down_write+0xb4/0x21c
> [   11.765074]  kernfs_remove+0x64/0x98
> [   11.768637]  sysfs_remove_dir+0xa8/0xe8
> [   11.772461]  __kobject_del+0xb0/0x27c
> [   11.776111]  kobject_release+0xfc/0x134
> [   11.779935]  kobject_put+0xb0/0x130
> [   11.783413]  of_node_put+0x18/0x28
> [   11.786803]  of_get_next_child+0x64/0xc4
> [   11.790714]  thermal_of_should_bind+0x154/0x390
> [   11.795231]  thermal_zone_cdev_binding.part.0+0x174/0x280
> [   11.800617]  thermal_zone_device_register_with_trips+0x914/0xcd8
> [   11.806609]  thermal_of_zone_register+0x284/0x464
> [   11.811301]  devm_thermal_of_zone_register+0x80/0xf4
> [   11.816253]  lvts_domain_init+0x500/0x760 [lvts_thermal]
> [   11.821553]  lvts_probe+0x1b4/0x3ac [lvts_thermal]
> [   11.826332]  platform_probe+0xc4/0x214
> [   11.830069]  really_probe+0x188/0x5d0
> [   11.833719]  __driver_probe_device+0x160/0x2e8
> [   11.838150]  driver_probe_device+0x5c/0x298
> [   11.842320]  __driver_attach+0x13c/0x4a8
> [   11.846230]  bus_for_each_dev+0xf8/0x180
> [   11.850141]  driver_attach+0x3c/0x58
> [   11.853704]  bus_add_driver+0x1c4/0x458
> [   11.857529]  driver_register+0xf4/0x3c0
> [   11.861352]  __platform_driver_register+0x60/0x88
> [   11.866043]  lvts_driver_init+0x20/0x1000 [lvts_thermal]
> [   11.871342]  do_one_initcall+0xcc/0x284
> [   11.875166]  do_init_module+0x278/0x740
> [   11.878990]  load_module+0xed8/0x1434
> [   11.882641]  init_module_from_file+0xdc/0x1fc
> [   11.886986]  idempotent_init_module+0x2bc/0x604
> [   11.891504]  __arm64_sys_finit_module+0xac/0x100
> [   11.896109]  invoke_syscall+0x6c/0x258
> [   11.899846]  el0_svc_common.constprop.0+0xac/0x230
> [   11.904624]  do_el0_svc+0x40/0x58
> [   11.907927]  el0_svc+0x48/0xb8
> [   11.910969]  el0t_64_sync_handler+0x100/0x12c
> [   11.915314]  el0t_64_sync+0x190/0x194
> [   36.261761] watchdog: Watchdog detected hard LOCKUP on cpu 0
> [   36.267414] Modules linked in: cbmem cros_ec_lid_angle cros_ec_sensors(+) cros_ec_sensors_core pcie_mediatek_gen3 sbs_battery cros_kbd_led_backlight industrialio_triggered_buffer kfifo_buf cros_ec_chardev cros_ec_rpmsg lvts_thermal(+) cros_ec_typec leds_cros_ec mtk_svs snd_sof_mt8195 mtk_adsp_common snd_sof_xtensa_dsp snd_sof_of mt6577_auxadc snd_soc_mt8195_afe snd_sof snd_sof_utils mtk_scp mtk_rpmsg mtk_scp_ipi pwm_bl mtk_wdt coreboot_table backlight mt8195_mt6359 ramoops reed_solomon
> [   36.310414] irq event stamp: 197347
> [   36.313890] hardirqs last  enabled at (197347): [<ffffc208ecdc994c>] exit_to_kernel_mode+0x38/0x118
> [   36.322923] hardirqs last disabled at (197346): [<ffffc208ecdcac44>] el1_interrupt+0x24/0x54
> [   36.331347] softirqs last  enabled at (197268): [<ffffc208e978bb20>] handle_softirqs+0x534/0x874
> [   36.340117] softirqs last disabled at (197263): [<ffffc208e961097c>] __do_softirq+0x14/0x20
>
> Full log at http://0x0.st/XyID.txt
>
> Let me know if you need any more information.
>
> #regzbot introduced: next-20240821..next-20240822
> #regzbot title: Hang during boot in sysfs_remove_dir() called by thermal_of_zone_register()
>
> Thanks,
> Nícolas
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ