[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <6452909d-94f8-4df3-87fc-d8ee0bdba01a@amd.com>
Date: Wed, 7 May 2025 10:16:03 -0400
From: Jason Andryuk <jason.andryuk@....com>
To: Jürgen Groß <jgross@...e.com>, Stefano Stabellini
<sstabellini@...nel.org>, Oleksandr Tyshchenko
<oleksandr_tyshchenko@...m.com>, Boris Ostrovsky <boris.ostrovsky@...cle.com>
CC: Marek Marczykowski-Górecki
<marmarek@...isiblethingslab.com>, <stable@...r.kernel.org>,
<xen-devel@...ts.xenproject.org>, <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] xenbus: Use kref to track req lifetime
On 2025-05-07 05:27, Jürgen Groß wrote:
> On 06.05.25 23:09, Jason Andryuk wrote:
>> Marek reported seeing a NULL pointer fault in the xenbus_thread
>> callstack:
>> BUG: kernel NULL pointer dereference, address: 0000000000000000
>> RIP: e030:__wake_up_common+0x4c/0x180
>> Call Trace:
>> <TASK>
>> __wake_up_common_lock+0x82/0xd0
>> process_msg+0x18e/0x2f0
>> xenbus_thread+0x165/0x1c0
>>
>> process_msg+0x18e is req->cb(req). req->cb is set to xs_wake_up(), a
>> thin wrapper around wake_up(), or xenbus_dev_queue_reply(). It seems
>> like it was xs_wake_up() in this case.
>>
>> It seems like req may have woken up the xs_wait_for_reply(), which
>> kfree()ed the req. When xenbus_thread resumes, it faults on the zero-ed
>> data.
>>
>> Linux Device Drivers 2nd edition states:
>> "Normally, a wake_up call can cause an immediate reschedule to happen,
>> meaning that other processes might run before wake_up returns."
>> ... which would match the behaviour observed.
>>
>> Change to keeping two krefs on each request. One for the caller, and
>> one for xenbus_thread. Each will kref_put() when finished, and the last
>> will free it.
>>
>> This use of kref matches the description in
>> Documentation/core-api/kref.rst
>>
>> Link: https://lore.kernel.org/xen-devel/ZO0WrR5J0xuwDIxW@mail-itl/
>> Reported-by: "Marek Marczykowski-Górecki"
>> <marmarek@...isiblethingslab.com>
>> Fixes: fd8aa9095a95 ("xen: optimize xenbus driver for multiple
>> concurrent xenstore accesses")
>> Cc: stable@...r.kernel.org
>> Signed-off-by: Jason Andryuk <jason.andryuk@....com>
>
> Reviewed-by: Juergen Gross <jgross@...e.com>
Thanks
>> ---
>> Kinda RFC-ish as I don't know if it fixes Marek's issue. This does seem
>> like the correct approach if we are seeing req free()ed out from under
>> xenbus_thread.
>
> I think your analysis is correct. When writing this code I didn't think
> of wake_up() needing to access req->wq _after_ having woken up the waiter.
Yes, this was tricky.
One other thing that makes me think this is correct. If this is the
same underlying issue:
https://lore.kernel.org/xen-devel/Z_lJTyVipJJEpWg2@mail-itl/
The failure is in the unlock:
pvqspinlock: lock 0xffff8881029af110 has corrupted value 0x0!
WARNING: CPU: 1 PID: 118 at kernel/locking/qspinlock_paravirt.h:504
__pv_queued_spin_unlock_slowpath+0xdc/0x120
Which makes me think the req was fine entering wake_up(), and it's only
found to be corrupt on the way out.
Regards,
Jason
Powered by blists - more mailing lists