[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAJZ5v0jKPDrpcqk2tG8UNba3z2JL5XTUJeWcatErKC+i7om2JQ@mail.gmail.com>
Date: Mon, 25 Nov 2024 11:40:12 +0100
From: "Rafael J. Wysocki" <rafael@...nel.org>
To: Linus Torvalds <torvalds@...ux-foundation.org>, Sasha Levin <sashal@...nel.org>,
Lukasz Luba <lukasz.luba@....com>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@...el.com>, Linux PM <linux-pm@...r.kernel.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>, Daniel Lezcano <daniel.lezcano@...aro.org>
Subject: Re: [GIT PULL] Thermal control updates for v6.13-rc1
Sasha, thanks for the report!
Linus, thanks for the analysis!
On Sun, Nov 24, 2024 at 7:40 PM Linus Torvalds
<torvalds@...ux-foundation.org> wrote:
>
> On Sun, 24 Nov 2024 at 09:10, Sasha Levin <sashal@...nel.org> wrote:
> >
> > On Mon, Nov 18, 2024 at 11:23:46AM +0100, Rafael J. Wysocki wrote:
> > >Hi Linus,
> > >
> > >Please pull from the tag
> > >
> > > git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git \
> > > thermal-6.13-rc1
> >
> > Hi Rafael,
> >
> > After merging this PR into linus-next, KernelCI started reporting boot
> > failures on a few platforms:
> >
> > [ 6.921489] [00000000000000d0] user address but active_mm is swapper
>
> This is an odd arm64 way of saying "NULL pointer dereference in kernel".
>
> The NULL pointer is in the user space address range, so it
> superficially looks like it's trying to do a user access.
>
> People are more used to the x86 page fault error reporting (and
> honestly - they are a bit better worded in this case too).
>
> If I did the disassembly correctly, the code disassembles to
>
> cbz x7, 0x168
> mov x5, x7
> stp x6, x7, [x20, #24]
> mov w19, #0x0
> ldr x0, [x5, #208]! <--- faulting instruction
>
> which does match that address 00000000000000d0 does match "x5+208",
> since x5 is NULL.
>
> Matching it up manually with my build (different config, different
> compiler, so different register allocation), it's this:
>
> // drivers/thermal/gov_power_allocator.c:527: if (last_passive) {
> cbz x6, .L177 // last_passive,
> .L134:
> // drivers/thermal/gov_power_allocator.c:595:
> list_for_each_entry(instance, &td->thermal_instances, trip_node) {
> mov x5, x6 // _21, last_passive
> // drivers/thermal/gov_power_allocator.c:529:
> params->trip_max = last_passive;
> stp x0, x6, [x21, 24] // first_passive, last_passive,
> // drivers/thermal/gov_power_allocator.c:593: int ret = 0;
> mov w19, 0 // <retval>,
> // drivers/thermal/gov_power_allocator.c:595:
> list_for_each_entry(instance, &td->thermal_instances, trip_node) {
> ldr x0, [x5, 208]! // __mptr, MEM[(const struct
> thermal_trip_desc *)prephitmp_29].thermal_instances.next
>
> so it looks like it is that
>
> list_for_each_entry(instance, &td->thermal_instances, trip_node) {
>
> with the 'td' being NULL.
>
> The code seems to do that
>
> const struct thermal_trip_desc *td =
> trip_to_trip_desc(params->trip_max);
>
> So apparently params->trip_max is NULL.
>
> That's where I stopped looking. It's almost certainly due to commit
> 0dc23567c206 ("thermal: core: Move lists of thermal instances to trip
> descriptors") but I don't know the code.
>
> Over to Rafael and Lukasz,
You are right, the above commit introduced this issue and there is one
more hint on it in Sasha's call trace, which is
thermal_tripless_zone_device_register() used for registering the
thermal zone. Of course, this means that there are no trip points in
that zone, so params->trip_max is NULL in check_power_actors() and it
needs to be explicitly checked against NULL.
Previously, the check was not needed because the list of thermal
instances was located in the thermal zone object, so walking it would
just produce no results.
IOW, something like the attached patch is needed.
Thanks, Rafael
View attachment "thermal-gov_power_allocator-fix-null-deref.patch" of type "text/x-patch" (1738 bytes)
Powered by blists - more mailing lists