[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <b0a52bb4-9f5a-4b52-8209-c585228ca28f@nvidia.com>
Date: Tue, 4 Mar 2025 10:18:47 -0500
From: Joel Fernandes <joelagnelf@...dia.com>
To: Steven Rostedt <rostedt@...dmis.org>
Cc: Strforexc yn <strforexc@...il.com>, Lai Jiangshan
<jiangshanlai@...il.com>, "Paul E. McKenney" <paulmck@...nel.org>,
Josh Triplett <josh@...htriplett.org>,
Mathieu Desnoyers <mathieu.desnoyers@...icios.com>, rcu@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: KASAN: global-out-of-bounds Read in srcu_gp_start_if_needed
On 3/4/2025 10:11 AM, Steven Rostedt wrote:
> On Mon, 3 Mar 2025 22:57:32 -0500
> Joel Fernandes <joelagnelf@...dia.com> wrote:
>
>>>
>>> The lock taken is from the passed in rcu_pending pointer.
>>>
>>>> [ 92.322655][ T28] rcu_pending_enqueue+0x686/0xd30
>>>> [ 92.322676][ T28] ? __pfx_rcu_pending_enqueue+0x10/0x10
>>>> [ 92.322693][ T28] ? trace_lock_release+0x11a/0x180
>>>> [ 92.322708][ T28] ? bkey_cached_free+0xa3/0x170
>>>> [ 92.322725][ T28] ? lock_release+0x13/0x180
>>>> [ 92.322744][ T28] ? bkey_cached_free+0xa3/0x170
>>>> [ 92.322760][ T28] bkey_cached_free+0xfd/0x170
>>>
>>> Which has:
>>>
>>> static void bkey_cached_free(struct btree_key_cache *bc,
>>> struct bkey_cached *ck)
>>> {
>>> kfree(ck->k);
>>> ck->k = NULL;
>>> ck->u64s = 0;
>>>
>>> six_unlock_write(&ck->c.lock);
>>> six_unlock_intent(&ck->c.lock);
>>>
>>> bool pcpu_readers = ck->c.lock.readers != NULL;
>>> rcu_pending_enqueue(&bc->pending[pcpu_readers], &ck->rcu);
>>> this_cpu_inc(*bc->nr_pending);
>>> }
>>>
>>> So if that bc->pending[pcpu_readers] gets corrupted in anyway, that could trigger this.
>>
>> True, another thing that could corrupt it is if per-cpu global data section
>> section is corrupted, because the crash is happening in this trylock per the
>> above stack:
>>
>> srcu_gp_start_if_needed ->
>> spin_lock_irqsave_sdp_contention(sdp) ->
>> spin_trylock(sdp->lock)
>>
>> where sdp is ssp->sda and is allocated from per-cpu storage.
>>
>> So corruption of the per-cpu global data section can also trigger this, even
>> if the rcu_pending pointer is intact.
>
> If there was corruption of the per-cpu section, you would think it would
> have a bigger impact than just this location. As most of the kernel relies
> on the per-cpu section.
>
> But it could be corruption of the per-cpu variable that was used. Caused by
> the code that uses it.
>
> That code is quite complex, and I usually try to rule out the code that is
> used in one location as being the issue before looking at something like
> per-cpu or RCU code which is used all over the place, and if that was
> buggy, it would likely blow up elsewhere outside of bcachefs.
Your strategy does make sense, as usually bugs are isolated though FWIW, we are
in a monolithic world leading to some definition of "isolated" ;-)
> But who knows, perhaps the bcachefs triggered a corner case?
Yeah could be. Anyway, lets see if the complaint comes back. ;-)
- Joel
Powered by blists - more mailing lists