linux-kernel - Re: [PATCH v1] PM: sleep: core: Synchronize runtime PM status of parents and children

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1c2433d4-7e0f-4395-b841-b8eac7c25651@nvidia.com>
Date: Fri, 7 Feb 2025 13:38:58 +0000
From: Jon Hunter <jonathanh@...dia.com>
To: "Rafael J. Wysocki" <rjw@...ysocki.net>,
 Linux PM <linux-pm@...r.kernel.org>
Cc: LKML <linux-kernel@...r.kernel.org>,
 Alan Stern <stern@...land.harvard.edu>, Bjorn Helgaas <helgaas@...nel.org>,
 Linux PCI <linux-pci@...r.kernel.org>, Ulf Hansson <ulf.hansson@...aro.org>,
 Johan Hovold <johan@...nel.org>,
 Manivannan Sadhasivam <manivannan.sadhasivam@...aro.org>,
 Kevin Xie <kevin.xie@...rfivetech.com>,
 "linux-tegra@...r.kernel.org" <linux-tegra@...r.kernel.org>
Subject: Re: [PATCH v1] PM: sleep: core: Synchronize runtime PM status of
 parents and children

Hi Rafael,

On 28/01/2025 19:24, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@...el.com>
> 
> Commit 6e176bf8d461 ("PM: sleep: core: Do not skip callbacks in the
> resume phase") overlooked the case in which the parent of a device with
> DPM_FLAG_SMART_SUSPEND set did not use that flag and could be runtime-
> suspended before a transition into a system-wide sleep state.  In that
> case, if the child is resumed during the subsequent transition from
> that state into the working state, its runtime PM status will be set to
> RPM_ACTIVE, but the runtime PM status of the parent will not be updated
> accordingly, even though the parent will be resumed too, because of the
> dev_pm_skip_suspend() check in device_resume_noirq().
> 
> Address this problem by tracking the need to set the runtime PM status
> to RPM_ACTIVE during system-wide resume transitions for devices with
> DPM_FLAG_SMART_SUSPEND set and all of the devices depended on by them.
> 
> Fixes: 6e176bf8d461 ("PM: sleep: core: Do not skip callbacks in the resume phase")
> Closes: https://lore.kernel.org/linux-pm/Z30p2Etwf3F2AUvD@hovoldconsulting.com/
> Reported-by: Johan Hovold <johan@...nel.org>
> Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@...aro.org>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@...el.com>
> ---
>   drivers/base/power/main.c |   29 ++++++++++++++++++++---------
>   include/linux/pm.h        |    1 +
>   2 files changed, 21 insertions(+), 9 deletions(-)
> 
> --- a/drivers/base/power/main.c
> +++ b/drivers/base/power/main.c
> @@ -656,13 +656,15 @@
>   	 * so change its status accordingly.
>   	 *
>   	 * Otherwise, the device is going to be resumed, so set its PM-runtime
> -	 * status to "active", but do that only if DPM_FLAG_SMART_SUSPEND is set
> -	 * to avoid confusing drivers that don't use it.
> +	 * status to "active" unless its power.set_active flag is clear, in
> +	 * which case it is not necessary to update its PM-runtime status.
>   	 */
> -	if (skip_resume)
> +	if (skip_resume) {
>   		pm_runtime_set_suspended(dev);
> -	else if (dev_pm_skip_suspend(dev))
> +	} else if (dev->power.set_active) {
>   		pm_runtime_set_active(dev);
> +		dev->power.set_active = false;
> +	}
>   
>   	if (dev->pm_domain) {
>   		info = "noirq power domain ";
> @@ -1189,18 +1191,24 @@
>   	return PMSG_ON;
>   }
>   
> -static void dpm_superior_set_must_resume(struct device *dev)
> +static void dpm_superior_set_must_resume(struct device *dev, bool set_active)
>   {
>   	struct device_link *link;
>   	int idx;
>   
> -	if (dev->parent)
> +	if (dev->parent) {
>   		dev->parent->power.must_resume = true;
> +		if (set_active)
> +			dev->parent->power.set_active = true;
> +	}
>   
>   	idx = device_links_read_lock();
>   
> -	list_for_each_entry_rcu_locked(link, &dev->links.suppliers, c_node)
> +	list_for_each_entry_rcu_locked(link, &dev->links.suppliers, c_node) {
>   		link->supplier->power.must_resume = true;
> +		if (set_active)
> +			link->supplier->power.set_active = true;
> +	}
>   
>   	device_links_read_unlock(idx);
>   }
> @@ -1278,8 +1286,11 @@
>   	      dev->power.may_skip_resume))
>   		dev->power.must_resume = true;
>   
> -	if (dev->power.must_resume)
> -		dpm_superior_set_must_resume(dev);
> +	if (dev->power.must_resume) {
> +		dev->power.set_active = dev->power.set_active ||
> +			dev_pm_test_driver_flags(dev, DPM_FLAG_SMART_SUSPEND);
> +		dpm_superior_set_must_resume(dev, dev->power.set_active);
> +	}
>   
>   Complete:
>   	complete_all(&dev->power.completion);
> --- a/include/linux/pm.h
> +++ b/include/linux/pm.h
> @@ -683,6 +683,7 @@
>   	bool			no_pm_callbacks:1;	/* Owned by the PM core */
>   	bool			async_in_progress:1;	/* Owned by the PM core */
>   	bool			must_resume:1;		/* Owned by the PM core */
> +	bool			set_active:1;		/* Owned by the PM core */
>   	bool			may_skip_resume:1;	/* Set by subsystems */
>   #else
>   	bool			should_wakeup:1;


I am seeing the following crash during suspend on a couple of our boards (with mainline/next) and bisect is pointing to this commit ...

[  216.311009] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[  216.311229] Mem abort info:
[  216.311298]   ESR = 0x0000000096000004
[  216.311393]   EC = 0x25: DABT (current EL), IL = 32 bits
[  216.311515]   SET = 0, FnV = 0
[  216.311593]   EA = 0, S1PTW = 0
[  216.311668]   FSC = 0x04: level 0 translation fault
[  216.311770] Data abort info:
[  216.311855]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
[  216.311972]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[  216.312081]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[  216.312267] user pgtable: 4k pages, 48-bit VAs, pgdp=000000010b905000
[  216.312411] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000
[  216.312620] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
[  216.312747] Modules linked in: snd_soc_tegra210_admaif snd_soc_tegra186_asrc snd_soc_tegra_pcm snd_soc_tegra210_mixer snd_soc_tegra210_adx snd_soc_tegra210_mvc snd_soc_tegra210_ope snd_soc_tegra210_dmic snd_soc_tegra210_amx snd_soc_tegra210_i2s snd_soc_tegra210_sfc tegra_drm drm_dp_aux_bus cec drm_display_helper drm_client_lib drm_kms_helper snd_soc_tegra210_ahub tegra210_adma drm snd_soc_tegra_audio_graph_card snd_soc_audio_graph_card snd_soc_rt5659 backlight snd_soc_simple_card_utils snd_soc_rl6231 ucsi_ccg typec_ucsi typec pwm_tegra ina3221 tegra_aconnect pwm_fan snd_hda_codec_hdmi snd_hda_tegra snd_hda_codec snd_hda_core phy_tegra194_p2u tegra_xudc at24 lm90 tegra_bpmp_thermal host1x pcie_tegra194 ip_tables x_tables ipv6
[  216.354364] CPU: 1 UID: 0 PID: 14450 Comm: rtcwake Tainted: G        W          6.14.0-rc1-next-20250206-g808eb958781e #1
[  216.365388] Tainted: [W]=WARN
[  216.368279] Hardware name: NVIDIA Jetson AGX Xavier Developer Kit (DT)
[  216.375099] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  216.381928] pc : simple_pm_bus_runtime_suspend+0x14/0x48
[  216.387698] lr : pm_generic_runtime_suspend+0x2c/0x44
[  216.392962] sp : ffff80009003baf0
[  216.396103] x29: ffff80009003baf0 x28: ffff000088043480 x27: 0000000000000000
[  216.403453] x26: ffff800082c11350 x25: ffff800082e0e000 x24: ffff80008093d634
[  216.410803] x23: 0000000000000000 x22: 0000000000000002 x21: ffff800082e0e000
[  216.418411] x20: ffff800080938af8 x19: ffff0000808ec410 x18: ffffffffffff4ac8
[  216.425328] x17: 2033303a35353a36 x16: 312031312d39302d x15: 000047edca5d49da
[  216.432937] x14: ffff000088043500 x13: 0000000000000001 x12: 0000000000000000
[  216.440110] x11: 000000321a7856a0 x10: 0000000000000aa0 x9 : ffff80009003b8e0
[  216.447219] x8 : ffff000088043f80 x7 : ffff0003fdf08a40 x6 : 0000000000000000
[  216.454291] x5 : 0000000000000000 x4 : ffff800081ffe8c0 x3 : ffff0000808ec410
[  216.461723] x2 : ffff8000805d133c x1 : 0000000000000000 x0 : 0000000000000000
[  216.468986] Call trace:
[  216.471179]  simple_pm_bus_runtime_suspend+0x14/0x48 (P)
[  216.476775]  pm_generic_runtime_suspend+0x2c/0x44
[  216.481499]  pm_runtime_force_suspend+0x54/0x14c
[  216.486049]  device_suspend_noirq+0x6c/0x278
[  216.490253]  dpm_suspend_noirq+0xc0/0x198
[  216.494278]  suspend_devices_and_enter+0x210/0x4c0
[  216.499348]  pm_suspend+0x164/0x1c8
[  216.503023]  state_store+0x8c/0xfc
[  216.506260]  kobj_attr_store+0x18/0x2c
[  216.509940]  sysfs_kf_write+0x44/0x54
[  216.513699]  kernfs_fop_write_iter+0x118/0x1a8
[  216.518163]  vfs_write+0x2b0/0x35c
[  216.521399]  ksys_write+0x68/0xfc
[  216.524810]  __arm64_sys_write+0x1c/0x28
[  216.528574]  invoke_syscall+0x48/0x110
[  216.532253]  el0_svc_common.constprop.0+0x40/0xe8
[  216.536628]  do_el0_svc+0x20/0x2c
[  216.540299]  el0_svc+0x30/0xd0
[  216.543016]  el0t_64_sync_handler+0x144/0x168
[  216.547736]  el0t_64_sync+0x198/0x19c
[  216.551327] Code: a9be7bfd 910003fd a90153f3 f9403c00 (f9400014)
[  216.557197] ---[ end trace 0000000000000000 ]---

I have not looked any further, but if you have any thoughts, let me know.

Thanks
Jon

-- 
nvpublic