[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <486d6d25-77e6-5fe4-4110-7256c20ba742@quicinc.com>
Date: Wed, 6 Dec 2023 16:09:56 +0530
From: Mukesh Ojha <quic_mojha@...cinc.com>
To: <myungjoo.ham@...sung.com>, <kyungmin.park@...sung.com>,
<cw00.choi@...sung.com>
CC: <linux-pm@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
<huangzaiyang@...o.com>
Subject: Re: [PATCH v4] PM / devfreq: Synchronize device_monitor_[start/stop]
Gentle reminder..
-Mukesh
On 11/25/2023 2:41 AM, Mukesh Ojha wrote:
> There is a chance if a frequent switch of the governor
> done in a loop result in timer list corruption where
> timer cancel being done from two place one from
> cancel_delayed_work_sync() and followed by expire_timers()
> can be seen from the traces[1].
>
> while true
> do
> echo "simple_ondemand" > /sys/class/devfreq/1d84000.ufshc/governor
> echo "performance" > /sys/class/devfreq/1d84000.ufshc/governor
> done
>
> It looks to be issue with devfreq driver where
> device_monitor_[start/stop] need to synchronized so that
> delayed work should get corrupted while it is either
> being queued or running or being cancelled.
>
> Let's use polling flag and devfreq lock to synchronize the
> queueing the timer instance twice and work data being
> corrupted.
>
> [1]
> ...
> ..
> <idle>-0 [003] 9436.209662: timer_cancel timer=0xffffff80444f0428
> <idle>-0 [003] 9436.209664: timer_expire_entry timer=0xffffff80444f0428 now=0x10022da1c function=__typeid__ZTSFvP10timer_listE_global_addr baseclk=0x10022da1c
> <idle>-0 [003] 9436.209718: timer_expire_exit timer=0xffffff80444f0428
> kworker/u16:6-14217 [003] 9436.209863: timer_start timer=0xffffff80444f0428 function=__typeid__ZTSFvP10timer_listE_global_addr expires=0x10022da2b now=0x10022da1c flags=182452227
> vendor.xxxyyy.ha-1593 [004] 9436.209888: timer_cancel timer=0xffffff80444f0428
> vendor.xxxyyy.ha-1593 [004] 9436.216390: timer_init timer=0xffffff80444f0428
> vendor.xxxyyy.ha-1593 [004] 9436.216392: timer_start timer=0xffffff80444f0428 function=__typeid__ZTSFvP10timer_listE_global_addr expires=0x10022da2c now=0x10022da1d flags=186646532
> vendor.xxxyyy.ha-1593 [005] 9436.220992: timer_cancel timer=0xffffff80444f0428
> xxxyyyTraceManag-7795 [004] 9436.261641: timer_cancel timer=0xffffff80444f0428
>
> [2]
>
> 9436.261653][ C4] Unable to handle kernel paging request at virtual address dead00000000012a
> [ 9436.261664][ C4] Mem abort info:
> [ 9436.261666][ C4] ESR = 0x96000044
> [ 9436.261669][ C4] EC = 0x25: DABT (current EL), IL = 32 bits
> [ 9436.261671][ C4] SET = 0, FnV = 0
> [ 9436.261673][ C4] EA = 0, S1PTW = 0
> [ 9436.261675][ C4] Data abort info:
> [ 9436.261677][ C4] ISV = 0, ISS = 0x00000044
> [ 9436.261680][ C4] CM = 0, WnR = 1
> [ 9436.261682][ C4] [dead00000000012a] address between user and kernel address ranges
> [ 9436.261685][ C4] Internal error: Oops: 96000044 [#1] PREEMPT SMP
> [ 9436.261701][ C4] Skip md ftrace buffer dump for: 0x3a982d0
> ...
>
> [ 9436.262138][ C4] CPU: 4 PID: 7795 Comm: TraceManag Tainted: G S W O 5.10.149-android12-9-o-g17f915d29d0c #1
> [ 9436.262141][ C4] Hardware name: Qualcomm Technologies, Inc. (DT)
> [ 9436.262144][ C4] pstate: 22400085 (nzCv daIf +PAN -UAO +TCO BTYPE=--)
> [ 9436.262161][ C4] pc : expire_timers+0x9c/0x438
> [ 9436.262164][ C4] lr : expire_timers+0x2a4/0x438
> [ 9436.262168][ C4] sp : ffffffc010023dd0
> [ 9436.262171][ C4] x29: ffffffc010023df0 x28: ffffffd0636fdc18
> [ 9436.262178][ C4] x27: ffffffd063569dd0 x26: ffffffd063536008
> [ 9436.262182][ C4] x25: 0000000000000001 x24: ffffff88f7c69280
> [ 9436.262185][ C4] x23: 00000000000000e0 x22: dead000000000122
> [ 9436.262188][ C4] x21: 000000010022da29 x20: ffffff8af72b4e80
> [ 9436.262191][ C4] x19: ffffffc010023e50 x18: ffffffc010025038
> [ 9436.262195][ C4] x17: 0000000000000240 x16: 0000000000000201
> [ 9436.262199][ C4] x15: ffffffffffffffff x14: ffffff889f3c3100
> [ 9436.262203][ C4] x13: ffffff889f3c3100 x12: 00000000049f56b8
> [ 9436.262207][ C4] x11: 00000000049f56b8 x10: 00000000ffffffff
> [ 9436.262212][ C4] x9 : ffffffc010023e50 x8 : dead000000000122
> [ 9436.262216][ C4] x7 : ffffffffffffffff x6 : ffffffc0100239d8
> [ 9436.262220][ C4] x5 : 0000000000000000 x4 : 0000000000000101
> [ 9436.262223][ C4] x3 : 0000000000000080 x2 : ffffff889edc155c
> [ 9436.262227][ C4] x1 : ffffff8001005200 x0 : ffffff80444f0428
> [ 9436.262232][ C4] Call trace:
> [ 9436.262236][ C4] expire_timers+0x9c/0x438
> [ 9436.262240][ C4] __run_timers+0x1f0/0x330
> [ 9436.262245][ C4] run_timer_softirq+0x28/0x58
> [ 9436.262255][ C4] efi_header_end+0x168/0x5ec
> [ 9436.262265][ C4] __irq_exit_rcu+0x108/0x124
> [ 9436.262274][ C4] __handle_domain_irq+0x118/0x1e4
> [ 9436.262282][ C4] gic_handle_irq.30369+0x6c/0x2bc
> [ 9436.262286][ C4] el0_irq_naked+0x60/0x6c
>
> Reported-by: Joyyoung Huang <huangzaiyang@...o.com>
> Signed-off-by: Mukesh Ojha <quic_mojha@...cinc.com>
> ---
> Huang,
>
> Would be looking for your tested-by..
>
> -Mukesh
>
>
> Changes in v4: https://lore.kernel.org/lkml/1700238027-20518-1-git-send-email-quic_mojha@quicinc.com/
> - Mistakenly put cancel work under devfreq lock which could result in deadlock
> reported by [Joyyoung Huang]
> https://lore.kernel.org/lkml/KL1PR02MB8141D1A307457AF69EBB6AFBA3B8A@KL1PR02MB8141.apcprd02.prod.outlook.com/
>
> Changes in v3: https://lore.kernel.org/lkml/1700235522-31105-1-git-send-email-quic_mojha@quicinc.com/
> - Remove the unexpected 'twice' from the subject.
>
> Changes in v2: https://lore.kernel.org/lkml/1699957648-31299-1-git-send-email-quic_mojha@quicinc.com/
> - Changed subject.
> - Added lock to avoid work data corruption due to
> parallel calls to devfreq_monitor_start while work
> is queued in flight.
> - Added lock to cover the same as above case while the
> work is being cancelled.
> - Added Reported-by for similar issue reported at
> https://lore.kernel.org/lkml/SEYPR02MB565398175FA093AC3E63EE7BA3B0A@SEYPR02MB5653.apcprd02.prod.outlook.com/
>
> drivers/devfreq/devfreq.c | 24 ++++++++++++++++++++++--
> 1 file changed, 22 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/devfreq/devfreq.c b/drivers/devfreq/devfreq.c
> index b3a68d5833bd..cb1c24721a37 100644
> --- a/drivers/devfreq/devfreq.c
> +++ b/drivers/devfreq/devfreq.c
> @@ -461,10 +461,14 @@ static void devfreq_monitor(struct work_struct *work)
> if (err)
> dev_err(&devfreq->dev, "dvfs failed with (%d) error\n", err);
>
> + if (devfreq->stop_polling)
> + goto out;
> +
> queue_delayed_work(devfreq_wq, &devfreq->work,
> msecs_to_jiffies(devfreq->profile->polling_ms));
> - mutex_unlock(&devfreq->lock);
>
> +out:
> + mutex_unlock(&devfreq->lock);
> trace_devfreq_monitor(devfreq);
> }
>
> @@ -483,6 +487,10 @@ void devfreq_monitor_start(struct devfreq *devfreq)
> if (IS_SUPPORTED_FLAG(devfreq->governor->flags, IRQ_DRIVEN))
> return;
>
> + mutex_lock(&devfreq->lock);
> + if (delayed_work_pending(&devfreq->work))
> + goto out;
> +
> switch (devfreq->profile->timer) {
> case DEVFREQ_TIMER_DEFERRABLE:
> INIT_DEFERRABLE_WORK(&devfreq->work, devfreq_monitor);
> @@ -491,12 +499,16 @@ void devfreq_monitor_start(struct devfreq *devfreq)
> INIT_DELAYED_WORK(&devfreq->work, devfreq_monitor);
> break;
> default:
> - return;
> + goto out;
> }
>
> if (devfreq->profile->polling_ms)
> queue_delayed_work(devfreq_wq, &devfreq->work,
> msecs_to_jiffies(devfreq->profile->polling_ms));
> +
> +out:
> + devfreq->stop_polling = false;
> + mutex_unlock(&devfreq->lock);
> }
> EXPORT_SYMBOL(devfreq_monitor_start);
>
> @@ -513,6 +525,14 @@ void devfreq_monitor_stop(struct devfreq *devfreq)
> if (IS_SUPPORTED_FLAG(devfreq->governor->flags, IRQ_DRIVEN))
> return;
>
> + mutex_lock(&devfreq->lock);
> + if (devfreq->stop_polling) {
> + mutex_unlock(&devfreq->lock);
> + return;
> + }
> +
> + devfreq->stop_polling = true;
> + mutex_unlock(&devfreq->lock);
> cancel_delayed_work_sync(&devfreq->work);
> }
> EXPORT_SYMBOL(devfreq_monitor_stop);
Powered by blists - more mailing lists