[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <56681478-399a-4fae-828c-ccf1b579c296@linaro.org>
Date: Tue, 9 Jul 2024 16:04:05 +0200
From: Daniel Lezcano <daniel.lezcano@...aro.org>
To: Niklas Söderlund
<niklas.soderlund+renesas@...natech.se>, Thomas Gleixner
<tglx@...utronix.de>, linux-kernel@...r.kernel.org,
linux-renesas-soc@...r.kernel.org
Subject: Re: [PATCH] clocksource/drivers/sh_cmt: Address race condition for
clock events
On 02/07/2024 21:02, Niklas Söderlund wrote:
> There is a race condition in the CMT interrupt handler. In the interrupt
> handler the driver sets a driver private flag, FLAG_IRQCONTEXT. This
> flag is used to indicate any call to set_next_event() should not be
> directly propagated to the device, but instead cached. This is done as
> the interrupt handler itself reprograms the device when needed before it
> completes and this avoids this operation to take place twice.
>
> It is unclear why this design was chosen, my suspicion is to allow the
> struct clock_event_device.event_handler callback, which is called while
> the FLAG_IRQCONTEXT is set, can update the next event without having to
> write to the device twice.
>
> Unfortunately there is a race between when the FLAG_IRQCONTEXT flag is
> set and later cleared where the interrupt handler have already started to
> write the next event to the device. If set_next_event() is called in
> this window the value is only cached in the driver but not written. This
> leads to the board to misbehave, or worse lockup and produce a splat.
>
> rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> rcu: 0-...!: (0 ticks this GP) idle=f5e0/0/0x0 softirq=519/519 fqs=0 (false positive?)
> rcu: (detected by 1, t=6502 jiffies, g=-595, q=77 ncpus=2)
> Sending NMI from CPU 1 to CPUs 0:
> NMI backtrace for cpu 0
> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.10.0-rc5-arm64-renesas-00019-g74a6f86eaf1c-dirty #20
> Hardware name: Renesas Salvator-X 2nd version board based on r8a77965 (DT)
> pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> pc : tick_check_broadcast_expired+0xc/0x40
> lr : cpu_idle_poll.isra.0+0x8c/0x168
> sp : ffff800081c63d70
> x29: ffff800081c63d70 x28: 00000000580000c8 x27: 00000000bfee5610
> x26: 0000000000000027 x25: 0000000000000000 x24: 0000000000000000
> x23: ffff00007fbb9100 x22: ffff8000818f1008 x21: ffff8000800ef07c
> x20: ffff800081c79ec0 x19: ffff800081c70c28 x18: 0000000000000000
> x17: 0000000000000000 x16: 0000000000000000 x15: 0000ffffc2c717d8
> x14: 0000000000000000 x13: ffff000009c18080 x12: ffff8000825f7fc0
> x11: 0000000000000000 x10: ffff8000818f3cd4 x9 : 0000000000000028
> x8 : ffff800081c79ec0 x7 : ffff800081c73000 x6 : 0000000000000000
> x5 : 0000000000000000 x4 : ffff7ffffe286000 x3 : 0000000000000000
> x2 : ffff7ffffe286000 x1 : ffff800082972900 x0 : ffff8000818f1008
> Call trace:
> tick_check_broadcast_expired+0xc/0x40
> do_idle+0x9c/0x280
> cpu_startup_entry+0x34/0x40
> kernel_init+0x0/0x11c
> do_one_initcall+0x0/0x260
> __primary_switched+0x80/0x88
> rcu: rcu_preempt kthread timer wakeup didn't happen for 6501 jiffies! g-595 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
> rcu: Possible timer handling issue on cpu=0 timer-softirq=262
> rcu: rcu_preempt kthread starved for 6502 jiffies! g-595 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
> rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
> rcu: RCU grace-period kthread stack dump:
> task:rcu_preempt state:I stack:0 pid:15 tgid:15 ppid:2 flags:0x00000008
> Call trace:
> __switch_to+0xbc/0x100
> __schedule+0x358/0xbe0
> schedule+0x48/0x148
> schedule_timeout+0xc4/0x138
> rcu_gp_fqs_loop+0x12c/0x764
> rcu_gp_kthread+0x208/0x298
> kthread+0x10c/0x110
> ret_from_fork+0x10/0x20
>
> The design have been part of the driver since it was first merged in
> early 2009. It becomes increasingly harder to trigger the issue the
> older kernel version one tries. It only takes a few boots on v6.10-rc5,
> while hundreds of boots are needed to trigger it on v5.10.
>
> Close the race condition by using the CMT channel lock for the two
> competing sections. The channel lock was added to the driver after its
> initial design.
>
> Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@...natech.se>
> ---
Applied, thanks
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
Powered by blists - more mailing lists