[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMi1Hd2drhMGTsvnX1aqwpDTOkK_+n2OKMYY1Y9ONHyYiFLTSQ@mail.gmail.com>
Date: Mon, 12 Aug 2024 15:09:47 +0530
From: Amit Pundir <amit.pundir@...aro.org>
To: Stephen Boyd <swboyd@...omium.org>
Cc: Bjorn Andersson <andersson@...nel.org>, Michael Turquette <mturquette@...libre.com>,
Stephen Boyd <sboyd@...nel.org>, linux-kernel@...r.kernel.org, linux-clk@...r.kernel.org,
patches@...ts.linux.dev, linux-arm-msm@...r.kernel.org,
Laura Nao <laura.nao@...labora.com>, Dmitry Baryshkov <dmitry.baryshkov@...aro.org>,
Douglas Anderson <dianders@...omium.org>, Taniya Das <quic_tdas@...cinc.com>
Subject: Re: [PATCH] clk: qcom: Park shared RCGs upon registration
On Tue, 6 Aug 2024 at 03:07, Stephen Boyd <swboyd@...omium.org> wrote:
>
> Quoting Amit Pundir (2024-08-05 03:43:14)
> > On Sat, 3 Aug 2024 at 06:29, Stephen Boyd <swboyd@...omium.org> wrote:
> > >
> > > Also please send back the dmesg so we can see what clks are configured
> > > for at boot time. If they're using TCXO source at boot then they're not
> > > going to be broken. In which case those clks can keep using the old clk
> > > ops and we can focus on the ones that aren't sourcing from TCXO.
> >
> > Thank your for this debug patch. I thought I narrowed down the
> > breakage to the clks in drivers/clk/qcom/gcc-sm8550.c, until I ran
> > into the following kernel panic in ucsi_glink driver in later test
> > runs.
>
> Thanks for the info. These are the clks that aren't sourcing from XO
> at registration time:
>
> gcc_qupv3_wrap1_s7_clk_src with cfg 0x102601 -> parent is gpll0_out_even
> gcc_ufs_phy_axi_clk_src with cfg 0x103 -> parent is gpll0_out_main
> gcc_ufs_phy_ice_core_clk_src with cfg 0x503 -> parent is gpll4_out_main
> gcc_ufs_phy_unipro_core_clk_src with cfg 0x103 -> parent is gpll0_out_main
> gcc_usb30_prim_master_clk_src with cfg 0x105 -> parent is gpll0_out_main
>
> The original patch is going to inform the clk framework that the parent
> of these clks aren't XO but something like gpll0_out_even, whatever the
> hardware is configured for. That may cause these PLLs to be turned off
> earlier than before if, for example, gcc_ufs_phy_axi_clk_src is turned
> off by a consumer and gcc_usb30_prim_master_clk_src is left enabled at
> boot. That's why we force park clks at registration time, so that they
> can't have their parent clk get turned off by some other clk consumer
> enabling and then disabling a clk that's also parented to the same
> parent.
>
> This same problem exists for RCGs that aren't shared too, but it's
> particularly bad for shared RCGs because the parent PLLs aren't turned
> on automatically by the hardware when things like the GDSC do their
> housekeeping. At least when software is in control we can enable the
> parent PLL and unstick the RCG that was previously cut off.
Thank you Stephen for the detailed explanation. Really appreciate this
knowledge dump.
>
> Can you narrow down the list above to the clk that matters? I guess if
> USB isn't working then gcc_usb30_prim_master_clk_src is the one that
> should be changed and nothing else. Although, I noticed that in the
> first dmesg log you sent the serial console had garbage, and that's
> likely because the rate changed while the clk was registered. I don't
> know why the gcc_qupv3_wrap1_s7_clk_src is marked with the shared clk
> ops. That's confusing to me as I don't expect that to need to be parked
> for any reasons. Maybe qcom folks can comment there but I'd expect plain
> rcg2_ops to be used for those clks. Anyway, if you can narrow down to
> which clk needs to be left untouched it would be helpful.
gcc_qupv3_wrap1_s7_clk_src and gcc_usb30_prim_master_clk_src need to
be left untouched to fix the Audio codec and USB-C host mode breakages
respectively. It seem to have fixed the serial console garbage dump
issue as well.
UFS on SM8550-HDK has it's own issues when booting AOSP from it, so I
couldn't test/verify if the reported gcc_ufs_phy_*_clk_src clk changes
make any difference. But I'll bookmark this change for future
reference if I run into any relevant UFS probe deferrals once we fix
the existing/on-going issues.
>
> >
> > [ 7.882923][ T1] init: Loading module /lib/modules/ucsi_glink.ko
> > with args ''
> > [ 7.892929][ T92] Unable to handle kernel NULL pointer
> > dereference at virtual address 0000000000000010
> > [ 7.894935][ T1] init: Loaded kernel module /lib/modules/ucsi_glink.ko
> > [ 7.902670][ T92] user pgtable: 4k pages, 39-bit VAs, pgdp=0000000886218000
> > [ 7.902674][ T92] Internal error: Oops: 0000000096000006 [#1] PREEMPT SMP
> > [ 7.993995][ T64] qcom_pmic_glink pmic-glink: Failed to create
> > device link (0x180) with a600000.usb
> > [ 8.078673][ T92] CPU: 7 UID: 0 PID: 92 Comm: kworker/7:2
> > Tainted: G S E 6.11.0-rc2-mainline-00001-g4153d980358d
> > #6
> > [ 8.078676][ T92] Tainted: [S]=CPU_OUT_OF_SPEC, [E]=UNSIGNED_MODULE
> > [ 8.078677][ T92] Hardware name: Qualcomm Technologies, Inc.
> > SM8550 HDK (DT)
> > [ 8.078679][ T92] Workqueue: events pmic_glink_ucsi_register [ucsi_glink]
> > [ 8.078682][ T92] pstate: 63400005 (nZCv daif +PAN -UAO +TCO +DIT
> > -SSBS BTYPE=--)
> > [ 8.078684][ T92] pc : pmic_glink_send+0x10/0x2c [pmic_glink]
> > [ 8.078685][ T92] lr : pmic_glink_ucsi_read+0x84/0x14c [ucsi_glink]
> > [ 8.078704][ T92] Call trace:
> > [ 8.078705][ T92] pmic_glink_send+0x10/0x2c [pmic_glink]
> > [ 8.078706][ T92] pmic_glink_ucsi_read+0x84/0x14c [ucsi_glink]
> > [ 8.078707][ T92] pmic_glink_ucsi_read_version+0x20/0x30 [ucsi_glink]
> > [ 8.078708][ T92] ucsi_register+0x28/0x70
> > [ 8.078717][ T92] pmic_glink_ucsi_register+0x18/0x28 [ucsi_glink]
> > [ 8.078718][ T92] process_one_work+0x184/0x2e8
> > [ 8.078723][ T92] worker_thread+0x2f0/0x404
> > [ 8.078725][ T92] kthread+0x114/0x118
> > [ 8.078728][ T92] ret_from_fork+0x10/0x20
> > [ 8.078732][ T92] ---[ end trace 0000000000000000 ]---
> > [ 8.078734][ T92] Kernel panic - not syncing: Oops: Fatal exception
> > [ 8.078735][ T92] SMP: stopping secondary CPUs
> > [ 8.279136][ T92] Kernel Offset: 0x14d9480000 from 0xffffffc080000000
> > [ 8.279141][ T92] PHYS_OFFSET: 0x80000000
> > [ 8.279143][ T92] CPU features: 0x18,004e0003,80113128,564676af
> > [ 8.279148][ T92] Memory Limit: none
>
> That looks like 'client' is NULL in pmic_glink_send(). The VA of 0x10 is
> the offset of 'pg' in struct pmic_glink_client. I don't know much about
> that driver but I'd guess that ucsi_glink has some race condition
> assigning the client pointer?
>
> Oh actually, I see the problem. devm_pmic_glink_register_client()
> returns a struct pmic_glink_client pointer that's assigned to
> 'ucsi->client'. And pmic_glink_ucsi_read() uses 'ucsi->client' to call
> pmic_glink_send(). That pointer is NULL because the workqueue that runs
> pmic_glink_ucsi_register() must run before
> devm_pmic_glink_register_client() returns and assigns the client pointer
> to 'ucsi->client'. This is simply a race.
>
> CPU0 CPU1
> ---- ----
> ucsi->client = NULL;
> devm_pmic_glink_register_client()
> client->pdr_notify(client->priv, pg->client_state)
> pmic_glink_ucsi_pdr_notify()
> schedule_work(&ucsi->register_work)
> <schedule away>
> pmic_glink_ucsi_register()
> ucsi_register()
> pmic_glink_ucsi_read_version()
> pmic_glink_ucsi_read()
> pmic_glink_ucsi_read()
> pmic_glink_send(ucsi->client)
> <client is NULL BAD>
> ucsi->client = client // Too late!
>
> >
> > I couldn't reproduce this kernel panic on vanilla v6.11-rc2 in 50+
> > test runs after that. So I'm assuming that this debug patch may have
> > triggered it.
> > Attaching the crashing and working dmesg logs with the debug patch applied.
> >
>
> Sounds like you just need to reboot a bunch! Or add an msleep() in
> devm_pmic_glink_register_client() after the notify call to open the race
> window and let the workqueue run.
Thank you for diagnosing this race in ucsi_glink. I needed to run an
overnight reboot test to reproduce this crash, and could reproduce it
on ~380th reboot. I'll check if it has already been reported or fixed
on linux-next.
Regards,
Amit Pundir
Powered by blists - more mailing lists