linux-kernel - Re: [PATCH v13 1/7] locking/mutex: Remove wakeups from under mutex::wait

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANDhNCo_CMNaVm5J+5ka24rb+gRjfeujyVakV6Hai_5j=V-axQ@mail.gmail.com>
Date: Wed, 13 Nov 2024 11:07:05 -0800
From: John Stultz <jstultz@...gle.com>
To: Arnd Bergmann <arnd@...db.de>
Cc: LKML <linux-kernel@...r.kernel.org>, Peter Zijlstra <peterz@...radead.org>, 
	Joel Fernandes <joelaf@...gle.com>, Qais Yousef <qyousef@...alina.io>, Ingo Molnar <mingo@...hat.com>, 
	Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>, 
	Dietmar Eggemann <dietmar.eggemann@....com>, Valentin Schneider <vschneid@...hat.com>, 
	Steven Rostedt <rostedt@...dmis.org>, Benjamin Segall <bsegall@...gle.com>, 
	Zimuzo Ezeozue <zezeozue@...gle.com>, Mel Gorman <mgorman@...e.de>, Will Deacon <will@...nel.org>, 
	Waiman Long <longman@...hat.com>, Boqun Feng <boqun.feng@...il.com>, 
	"Paul E. McKenney" <paulmck@...nel.org>, Metin Kaya <Metin.Kaya@....com>, 
	Xuewen Yan <xuewen.yan94@...il.com>, K Prateek Nayak <kprateek.nayak@....com>, 
	Thomas Gleixner <tglx@...utronix.de>, Daniel Lezcano <daniel.lezcano@...aro.org>, kernel-team@...roid.com, 
	Davidlohr Bueso <dave@...olabs.net>, regressions@...ts.linux.dev, 
	Thorsten Leemhuis <linux@...mhuis.info>, Anders Roxell <anders.roxell@...aro.org>
Subject: Re: [PATCH v13 1/7] locking/mutex: Remove wakeups from under mutex::wait_lock

On Wed, Nov 13, 2024 at 10:18 AM Arnd Bergmann <arnd@...db.de> wrote:
>
> On Sat, Oct 12, 2024, at 01:25, John Stultz wrote:
> > From: Peter Zijlstra <peterz@...radead.org>
> >
> > In preparation to nest mutex::wait_lock under rq::lock we need
> > to remove wakeups from under it.
> >
> > Do this by utilizing wake_qs to defer the wakeup until after the
> > lock is dropped.
>
> To follow up from IRC, this patch is the one that caused a
> boot time regression in linux-next in the regulator framework.
>
> Anders Roxell found this during testing on the Rock Pi 4 board
> (rockchips rk3399 based).
>
> The book load with the NULL pointer dereference is at
> https://lkft.validation.linaro.org/scheduler/job/7979980#L741
>
> The interesting bit is this:
>
> [ 0.957586] rk_gmac-dwmac fe300000.ethernet: Enable RX Mitigation via HW Watchdog Timer"}
> [ 0.969402] hub 6-0:1.0: USB hub found"}
> [ 0.969450] hub 6-0:1.0: 1 port detected"}
> [ 0.988163] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000"}
> [ 0.988172] Mem abort info:"}
> [ 0.988174]   ESR = 0x0000000096000004"}
> [ 0.988176]   EC = 0x25: DABT (current EL), IL = 32 bits"}
> [ 0.988180]   SET = 0, FnV = 0"}
> [ 0.988183]   EA = 0, S1PTW = 0"}
> [ 0.988185]   FSC = 0x04: level 0 translation fault"}
> [ 0.988187] Data abort info:"}
> [ 0.988189]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000"}
> [ 0.988191]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0"}
> [ 0.988194]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0"}
> [ 0.988197] [0000000000000000] user address but active_mm is swapper"}
> [ 0.988201] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP"}
> [ 0.988205] Modules linked in:"}
> [ 0.988217] Hardware name: Radxa ROCK Pi 4B (DT)"}
> [ 0.988225] pc : wake_up_q (kernel/sched/core.c:1059)
> [ 0.988238] lr : wake_up_q (kernel/sched/core.c:1054)
> [ 0.988243] sp : ffff800083433a00"}
> [ 0.988245] x29: ffff800083433a00 x28: 0000000000000000 x27: ffff0000053b6080"}
> [ 0.988253] x26: ffff800083433b90 x25: ffff0000053b6000 x24: ffff800080098000"}
> [ 0.988259] x23: 00000000ffffffff x22: 0000000000000001 x21: 0000000000000000"}
> [ 0.988265] x20: fffffffffffff850 x19: 0000000000000000 x18: 0000000000000001"}
> [ 0.988272] x17: ffff800075678000 x16: ffff800082728000 x15: 019ee6ab98006e30"}
> [ 0.988278] x14: 000002ce459acd0c x13: 000b52b4cf08772c x12: 000000000000000f"}
> [ 0.988284] x11: 0000000000000000 x10: 0000000000000a50 x9 : ffff800083433870"}
> [ 0.988291] x8 : ffff0000050fceb0 x7 : ffff0000f76ab9c0 x6 : 00000000000b52b4"}
> [ 0.988297] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000"}
> [ 0.988303] x2 : 0000000000002710 x1 : 0000000000000001 x0 : 0000000000002710"}
> [ 0.988310] Call trace:"}
> [ 0.988313] wake_up_q+0x50/0xf0 P)"}
> [ 0.988319] wake_up_q+0xa0/0xf0 L)"}
> [ 0.988325] __ww_rt_mutex_lock.isra.0 (arch/arm64/include/asm/preempt.h:62 (discriminator 2) kernel/locking/rtmutex.c:1794 kernel/locking/ww_rt_mutex.c:71)
> [ 0.988333] ww_mutex_lock (kernel/locking/ww_rt_mutex.c:82)
> [ 0.988338] regulator_lock_recursive (drivers/regulator/core.c:161 drivers/regulator/core.c:333)
> [ 0.988347] regulator_lock_recursive (drivers/regulator/core.c:348)
> [ 0.988354] regulator_lock_dependent (drivers/regulator/core.c:409)
> [ 0.988360] regulator_set_voltage (drivers/regulator/core.c:4173)
> [ 0.988366] _opp_config_regulator_single (include/linux/regulator/consumer.h:707 (discriminator 1) drivers/opp/core.c:933 drivers/opp/core.c:1019)
> [ 0.988375] _set_opp (drivers/opp/core.c:1253)
> [ 0.988379] dev_pm_opp_set_rate (drivers/opp/core.c:1357)
> [ 0.988384] set_target (drivers/cpufreq/cpufreq-dt.c:63)
> [ 0.988392] __cpufreq_driver_target (drivers/cpufreq/cpufreq.c:2292 drivers/cpufreq/cpufreq.c:2355)
> [ 0.988398] sugov_work (kernel/sched/cpufreq_schedutil.c:537)
> [ 0.988406] kthread_worker_fn (arch/arm64/include/asm/jump_label.h:32 include/linux/freezer.h:36 include/linux/freezer.h:54 kernel/kthread.c:861)
> [ 0.988414] kthread (kernel/kthread.c:389)
> [ 0.988420] ret_from_fork (arch/arm64/kernel/entry.S:863)
> [ 0.988430] Code: f100067f 54000320 d11ec274 aa1303f5 (f9400273) "}
>
>
> > @@ -1776,8 +1785,11 @@ static int __sched rt_mutex_slowlock(struct
> > rt_mutex_base *lock,
> >        * irqsave/restore variants.
> >        */
> >       raw_spin_lock_irqsave(&lock->wait_lock, flags);
> > -     ret = __rt_mutex_slowlock_locked(lock, ww_ctx, state);
> > +     ret = __rt_mutex_slowlock_locked(lock, ww_ctx, state, &wake_q);
> > +     preempt_disable();
> >       raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
> > +     wake_up_q(&wake_q);
> > +     preempt_enable();
> >       rt_mutex_post_schedule();
> >
> >       return ret;
>
> This is apparently where things went wrong, but it's possible that
> the actual root cause is in the regulator framework instead.
>
> The NULL pointer itself happens when chasing wake_q->first->next,
> so it would seem that one of these got reinitialized at
> the wrong time, perhaps with a task_struct getting freed
> or getting put on more than one wake_q_head lists at the
> same time.

Thanks so much again to Anders for finding and reporting this!

So I've managed to reproduce this and I'm chasing down what's going wrong now.

Thanks so much for the report!
-john