[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251021083405.3047161-1-lizhi.xu@windriver.com>
Date: Tue, 21 Oct 2025 16:34:05 +0800
From: Lizhi Xu <lizhi.xu@...driver.com>
To: <dan.carpenter@...aro.org>
CC: <lizhi.xu@...driver.com>, <davem@...emloft.net>, <edumazet@...gle.com>,
<horms@...nel.org>, <kuba@...nel.org>, <linux-hams@...r.kernel.org>,
<linux-kernel@...r.kernel.org>, <netdev@...r.kernel.org>,
<pabeni@...hat.com>,
<syzbot+2860e75836a08b172755@...kaller.appspotmail.com>,
<syzkaller-bugs@...glegroups.com>
Subject: Re: [PATCH V2] netrom: Prevent race conditions between multiple add route
On Tue, 21 Oct 2025 09:36:47 +0300, Dan Carpenter wrote:
> On Tue, Oct 21, 2025 at 10:05:33AM +0800, Lizhi Xu wrote:
> > On Mon, 20 Oct 2025 20:59:24 +0300, Dan Carpenter wrote:
> > > On Mon, Oct 20, 2025 at 09:49:12PM +0800, Lizhi Xu wrote:
> > > > On Mon, 20 Oct 2025 21:34:56 +0800, Lizhi Xu wrote:
> > > > > > Task0 Task1 Task2
> > > > > > ===== ===== =====
> > > > > > [97] nr_add_node()
> > > > > > [113] nr_neigh_get_dev() [97] nr_add_node()
> > > > > > [214] nr_node_lock()
> > > > > > [245] nr_node->routes[2].neighbour->count--
> > > > > > [246] nr_neigh_put(nr_node->routes[2].neighbour);
> > > > > > [248] nr_remove_neigh(nr_node->routes[2].neighbour)
> > > > > > [283] nr_node_unlock()
> > > > > > [214] nr_node_lock()
> > > > > > [253] nr_node->routes[2].neighbour = nr_neigh
> > > > > > [254] nr_neigh_hold(nr_neigh); [97] nr_add_node()
> > > > > > [XXX] nr_neigh_put()
> > > > > > ^^^^^^^^^^^^^^^^^^^^
> > > > > >
> > > > > > These charts are supposed to be chronological so [XXX] is wrong because the
> > > > > > use after free happens on line [248]. Do we really need three threads to
> > > > > > make this race work?
> > > > > The UAF problem occurs in Task2. Task1 sets the refcount of nr_neigh to 1,
> > > > > then Task0 adds it to routes[2]. Task2 releases routes[2].neighbour after
> > > > > executing [XXX]nr_neigh_put().
> > > > Execution Order:
> > > > 1 -> Task0
> > > > [113] nr_neigh_get_dev() // After execution, the refcount value is 3
> > > >
> > > > 2 -> Task1
> > > > [246] nr_neigh_put(nr_node->routes[2].neighbour); // After execution, the refcount value is 2
> > > > [248] nr_remove_neigh(nr_node->routes[2].neighbour) // After execution, the refcount value is 1
> > > >
> > > > 3 -> Task0
> > > > [253] nr_node->routes[2].neighbour = nr_neigh // nr_neigh's refcount value is 1 and add it to routes[2]
> > > >
> > > > 4 -> Task2
> > > > [XXX] nr_neigh_put(nr_node->routes[2].neighbour) // After execution, neighhour is freed
> > > > if (nr_node->routes[2].neighbour->count == 0 && !nr_node->routes[2].neighbour->locked) // Uaf occurs this line when accessing neighbour->count
> > >
> > > Let's step back a bit and look at the bigger picture design. (Which is
> > > completely undocumented so we're just guessing).
> > >
> > > When we put nr_neigh into nr_node->routes[] we bump the nr_neigh_hold()
> > > reference count and nr_neigh->count++, then when we remove it from
> > > ->routes[] we drop the reference and do nr_neigh->count--.
> > >
> > > If it's the last reference (and we are not holding ->locked) then we
> > > remove it from the &nr_neigh_list and drop the reference count again and
> > > free it. So we drop the reference count twice. This is a complicated
> > > design with three variables: nr_neigh_hold(), nr_neigh->count and
> > > ->locked. Why can it not just be one counter nr_neigh_hold(). So
> > > instead of setting locked = true we would just take an extra reference?
> > > The nr_neigh->count++ would be replaced with nr_neigh_hold() as well.
> > locked controls whether the neighbor quality can be automatically updated;
>
> I'm not sure your patch fixes the bug because we could still race against
> nr_del_node().
This is fine in this issue, but for rigor, I'll add locks to all related
ioctl and route frame operations to maintain synchronization.
I will send V3 patch to improve it.
>
> I'm not saying get rid of locked completely, I'm saying get rid of code like
> this:
> if (nr_node->routes[2].neighbour->count == 0 && !nr_node->routes[2].neighbour->locked)
> nr_remove_neigh(nr_node->routes[2].neighbour);
>
> Right now, locked serves as a special kind of reference count, because we
> don't drop the reference if it's true.
I don't think this is correct.
>
> > count controls the number of different routes a neighbor is linked to;
>
> Sure, that is interesting information for the user, so keep it around to
> print in the proc file, but don't use it as a reference count.
>
> > refcount is simply used to manage the neighbor lifecycle.
>
> The bug is caused because our reference counting is bad.
>
> So right now what happens is we allocate nr_neigh and we put it on the
> &nr_neigh_list. Then we lock it or we add it to ->routes[] and each of
> those has a different reference count. Then when we drop those references
> we do:
>
> if (nr_node->routes[2].neighbour->count == 0 && !nr_node->routes[2].neighbour->locked)
> nr_remove_neigh(nr_node->routes[2].neighbour);
>
> This removes it from the list, and hopefully this is the last reference
> and it frees it.
>
> It would be much simpler to say, we only use nr_neigh_hold()/put() for
> reference counting. When we set locked we do:
>
> nr_neigh_hold(nr_neigh);
> nr_neigh->locked = true;
>
> Incrementing the refcount means it can't be freed.
No, setting locked = 1 is only done in nr_add_neigh(), and nr_neigh_hold()
is not executed, and the refcount value is 1.
>
> Then when we remove nr_neigh from ->routes[] we wouldn't "remove it from
> the list", instead we would just drop a reference. When we dropped the
> last reference, nr_neigh_put() would remove it from the list.
>
> My proposal would be a behavior change because right now what happens is:
>
> 1: allocate nr_neigh
> 2: add it to ->routes[]
> 3: remove it from ->routes[]
> (freed automatically because we drop two references)
No, No, I know where your analysis went wrong, it is here.
The problem is not when allocating neigh and adding it to routes[2],
but when nr_add_node is executed twice later, one is Task0 as I mentioned
above, and the other is Task1.
After Task1 moves the neighbor out of routes, Task0 uses nr_neigh_get_dev()
to get the neighbor that Task1 moved out, and then adds it to its routes.
This is wrong. Task0 should not use nr_neigh_get_dev() to obtain the neighbor
before other tasks move it out. This will interfere with the reference count
of the neighbor, which is the root cause of the problem.
BR,
Lizhi
Powered by blists - more mailing lists