lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <2450943.NG923GbCHz@sven-desktop>
Date: Sun, 28 Sep 2025 09:50:02 +0200
From: Sven Eckelmann <sven@...fation.org>
To: Marek Lindner <marek.lindner@...lbox.org>,
 Simon Wunderlich <sw@...onwunderlich.de>,
 Antonio Quartulli <antonio@...delbit.com>,
 "David S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
 Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
 Simon Horman <horms@...nel.org>, b.a.t.m.a.n@...ts.open-mesh.org,
 Network Development <netdev@...r.kernel.org>,
 Linus Lüssing <linus.luessing@...3.blue>,
 Tetsuo Handa <penguin-kernel@...ove.sakura.ne.jp>
Subject:
 Re: unregister_netdevice: waiting for batadv_slave_0 to become free. Usage
 count = 2

On Sunday, 28 September 2025 03:06:05 CEST Tetsuo Handa wrote:
> Thank you for responding quickly.
> 
> On 2025/09/28 2:21, Sven Eckelmann wrote:
> > The question would now be: what is the actual problem? 
> 
> Sorry, my explanation was not clear enough.

It was long and contained a lot of things - but not what the actual problem is. 
It is necessary to read a lot of inline calltraces with subclauses - and then 
by reading between the lines, we must figure out what you actually wanted to 
say.

It is no problem to not know the underlying problem. But all these absolute 
statements, accusations and overly detailed statement made me think that I am 
just too stupid to get it and you must be right.

> What I thought as a problem is the difference between
> 
> 	netlink_device_change(&nlmsg, sock, "batadv_slave_0", true, "batadv0", 0, 0);
> 	//netlink_device_change(&nlmsg, sock, "batadv_slave_0", true, 0, &macaddr, ETH_ALEN);
> 
> and
> 
> 	netlink_device_change(&nlmsg, sock, "batadv_slave_0", false, "batadv0", 0, 0);
> 	netlink_device_change(&nlmsg, sock, "batadv_slave_0", true, 0, &macaddr, ETH_ALEN);
> 
> . The former makes hard_iface->if_status == BATADV_IF_ACTIVE while the latter makes
> hard_iface->if_status == BATADV_IF_TO_BE_ACTIVATED (because batadv_iv_ogm_schedule_buff()
> is not called).
> 
> This difference makes operations that depend on hard_iface->if_status == BATADV_IF_ACTIVE
> impossible to work properly. You can confirm this difference using diff show below.

This is again (in my opinion) this kind of (odd) absolute statement again. 
"impossible to work properly" - this sounds like BATADV_IF_TO_BE_ACTIVATED is 
an state which you cannot escape. And that functions/operations depend on 
BATADV_IF_ACTIVE. Both statements are not really true. 

BATADV_IF_TO_BE_ACTIVATED is a transient state and some algorithm depending 
code is responsible to automatically get it in the BATADV_IF_ACTIVE state. 
This is somewhat important here because the first time I read your second 
mail, I was under the impression that something in the reproducer showed that 
the state would be stuck. I searched rather hard in the code but couldn't find 
the reason for this. Only much later, I decided to ignore all this and look 
what the reproducer is actually doing. And also ignore commit 9e6b5648bbc4 
("batman-adv: Fix duplicated OGMs on NETDEV_UP") - because it was impossible 
for me to reproduce it on this commit.

And regarding the functions/operations which "impossible to work properly": 
called functions must "work properly" independent of the state. Just what 
they are doing as work can be different depending on the state. But maybe this 
is a case of "glass is half full" vs "glass is half empty".

The problem is therefore that some function broke this "promise". Your second 
mail (and the patch) was then basically saying "BATADV_IF_TO_BE_ACTIVATED" must 
not exist and we must directly go to BATADV_IF_ACTIVE. (Even if this is in my 
opinion not the right statement) it never said why it must not exist and what 
broke because of "BATADV_IF_TO_BE_ACTIVATED".

The inline calltraces with detailed statements in subclauses make it 
harder to digest. Some small high level statements like 

"I don't know exactly what the underlying problem is but skipping 
BATADV_IF_TO_BE_ACTIVATED in batadv_hardif_activate_interface() seems to work 
around the problem. I suspect that some function is not handling 
BATADV_IF_TO_BE_ACTIVATED correctly. Maybe some kind of race condition between 
switching to BATADV_IF_ACTIVE and executing some specific code. Here are my 
detailed notes:"

would have helped me not to get stuck too long in the interpretation of 
paragraphs. But at the same time, would have given a lot of pointers in the 
right direction. But maybe I would have been stuck anyway - no idea.

Anyway, I hope we found the problem now and thanks for the help.

Regards,
	Sven
Download attachment "signature.asc" of type "application/pgp-signature" (229 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ