[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <2754825.KlZ2vcFHjT@sven-desktop>
Date: Sat, 27 Sep 2025 19:21:51 +0200
From: Sven Eckelmann <sven@...fation.org>
To: Marek Lindner <marek.lindner@...lbox.org>,
Simon Wunderlich <sw@...onwunderlich.de>,
Antonio Quartulli <antonio@...delbit.com>,
"David S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
Simon Horman <horms@...nel.org>, b.a.t.m.a.n@...ts.open-mesh.org,
Network Development <netdev@...r.kernel.org>,
Linus Lüssing <linus.luessing@...3.blue>,
Tetsuo Handa <penguin-kernel@...ove.sakura.ne.jp>
Subject:
Re: unregister_netdevice: waiting for batadv_slave_0 to become free. Usage
count = 2
On Tuesday, 23 September 2025 16:45:48 CEST Tetsuo Handa wrote:
> On 2025/09/22 23:09, Tetsuo Handa wrote:
> > I suspect that batadv_hard_if_event_meshif() has something to do upon
> > NETDEV_UNREGISTER event because batadv_hard_if_event_meshif() receives
> > NETDEV_POST_INIT / NETDEV_REGISTER / NETDEV_UNREGISTER / NETDEV_PRE_UNINIT
> > events when this reproducer is executed, but I don't know what to do...
>
> With a change show bottom, the reproducer no longer triggers this problem.
> But is this change correct?
Thanks for the writeup. Unfortunately, I tried to read this multiple times but
you don't make it easy to figure out what the problem is.
Please don't take the following paragraphs the wrong way - I was just not able
to really figure out what you meant. I have therefore just added my thoughts
on each paragraph.
At the end of this mail, I tried to give a shorter summary about the interface
states and what I think is the actual problem.
> Commit 9e6b5648bbc4 ("batman-adv: Fix duplicated OGMs on NETDEV_UP")
> replaced batadv_iv_iface_activate() (which is called via iface.activate()
> from batadv_hardif_activate_interface()) with batadv_iv_iface_enabled()
> (which is called via iface.enabled() from batadv_hardif_enable_interface()).
> But that commit missed that batadv_hardif_activate_interface() is called from
> both batadv_hardif_enable_interface() and batadv_hard_if_event().
Why did it miss it? This is the plan in the mentioned commit.
What is the relevant effect which is creating the problem here?
> Since batadv_iv_ogm_schedule_buff() updates if_status to BATADV_IF_ACTIVE
> only when if_status was BATADV_IF_TO_BE_ACTIVATED, we need to call
> batadv_iv_ogm_schedule_buff() from batadv_iv_ogm_schedule() from
> batadv_iv_iface_enabled() via iface.enabled() with
> if_status == BATADV_IF_TO_BE_ACTIVATED if we want iface.enabled() from
> batadv_hardif_enable_interface() to update if_status to BATADV_IF_ACTIVE.
This basically says: Because of A <- B <- C, we need to have C to B to get A.
But not really what effect this would have.
> But when IFF_UP is not set upon creation, batadv_hardif_enable_interface()
> does not call batadv_hardif_activate_interface(), which means that
> if_status remains BATADV_IF_INACTIVE despite
> batadv_iv_ogm_schedule_buff() is called via iface.enabled().
It must stay inactive when it is down. But the periodic OGM scheduling is
still needed to have the OGM queued for active interfaces.
> And when IFF_UP is set after creation, batadv_hard_if_event() calls
> batadv_hardif_activate_interface(). But despite "Interface activated: %s\n"
> message being printed, if_status remains BATADV_IF_TO_BE_ACTIVATED because
> iface.activate == NULL due to above-mentioned commit.
This is not necessarily true. It will simply be switched to BATADV_IF_ACTIVATE
with the next OGM because batadv_iv_ogm_schedule is rescheduled all the time.
> Since we need to call iface.enabled() instead of iface.activate() so that
> batadv_iv_ogm_schedule_buff() will update if_status to BATADV_IF_ACTIVE,
> move iface.enabled() from batadv_hardif_enable_interface() to
> batadv_hardif_activate_interface().
Why do we need to switch to this state in the first place? I didn't find this
explanation here.
If I would add your change, we would suddenly have multiple scheduled OGMs
again, right? Because it basically an oddly written revert of
commit 9e6b5648bbc4 ("batman-adv: Fix duplicated OGMs on NETDEV_UP").
If this would actually the problem, why would the BATADV_CMD_GET_NEIGHBORS
request be required? I would have expected that following would show the same
problem when the switch of the state is the problem:
rmmod batman-adv || true
modprobe batman-adv
ip link add batadv_slave_0 type veth peer name veth0_to_batadv
ip link add batadv0 type batadv
batctl meshif batadv0 it 1000000
ip link set down master batadv0 dev batadv_slave_0
ip link set up address 00a:a:aa:aa:aa:aa dev batadv_slave_0
ip link del dev batadv_slave_0
Let me rephrase what happens with the state of an interface. Just so we are on
the same page:
* an interface detected by batman-adv is starting with the state
BATADV_IF_NOT_IN_USE
* an active interface has the state BATADV_IF_ACTIVE
* when an interface is added to a batman-adv meshif, it is first set to the
state BATADV_IF_TO_BE_ACTIVATED
- this is done by batadv_hardif_activate_interface()
- there are some dependencies to the UP state of the netdev - see below
* the transition from BATADV_IF_TO_BE_ACTIVATED to BATADV_IF_ACTIVE is
algorithm specific
- IV: scheduling of the OGM buffer
- V: "directly" after BATADV_IF_TO_BE_ACTIVATED is set
The transition from state BATADV_IF_NOT_IN_USE to BATADV_IF_TO_BE_ACTIVATED
only happens when the interface is actually UP when .ndo_add_slave is called.
Otherwise, batman-adv postpones the call to batadv_hardif_activate_interface()
until the NETDEV_UP event is received.
As mentioned earlier the B.A.T.M.A.N. IV algorithm implementation is (in your
reproducer) responsible for switching the interface state from
BATADV_IF_TO_BE_ACTIVATED to BATADV_IF_ACTIVE. And it will only do
this when the interface was actually in the state BATADV_IF_TO_BE_ACTIVATED.
There is a comment above the statement which explains the details.
And exactly this "delay" makes it more likely that operations are started
while an interface is in the transition state of BATADV_IF_TO_BE_ACTIVATED.
Simply because the periodic OGM scheduling isn't triggered immediately - it
continues with its normal periodicity.
The question would now be: what is the actual problem? For example, with
following change, the problem also seems to be gone:
diff --git i/net/batman-adv/originator.c w/net/batman-adv/originator.c
index c84420cb..f82dce20 100644
--- i/net/batman-adv/originator.c
+++ w/net/batman-adv/originator.c
@@ -763,7 +763,7 @@ int batadv_hardif_neigh_dump(struct sk_buff *msg, struct netlink_callback *cb)
bat_priv = netdev_priv(mesh_iface);
primary_if = batadv_primary_if_get_selected(bat_priv);
- if (!primary_if || primary_if->if_status != BATADV_IF_ACTIVE) {
+ if (!primary_if) {
ret = -ENOENT;
goto out_put_mesh_iface;
}
And now we are most likely on the right path... primary_if is holding a
reference to an hardif and therefore also a reference to the netdev. And the
error handling is only taking care of releasing the reference to the meshif
but not the primary_if.
I will later send a fix for this with you and
syzbot+881d65229ca4f9ae8c84@...kaller.appspotmail.com as Reported-by. Is this
okay for you?
As said, earlier - I am really thankful that you worked on this. So please
don't interpret this as a harsh criticism. I just had really big problems to
figure out what you wanted to say in these paragraphs. I usually prefer
something which is easier to consume.
Regards,
Sven
Download attachment "signature.asc" of type "application/pgp-signature" (229 bytes)
Powered by blists - more mailing lists