lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Tue, 04 Dec 2012 15:51:55 +0100
From:	Sven Eckelmann <sven@...fation.org>
To:	b.a.t.m.a.n@...ts.open-mesh.org, netdev@...r.kernel.org
Cc:	Sasha Levin <sasha.levin@...cle.com>
Subject: Re: net, batman: lockdep circular dependency warning

Hi,

thanks for your report. It seems nobody else wanted to give an answer... so I 
try to give a small overview.

On Monday 12 November 2012 15:37:47 Sasha Levin wrote:
> Hi all,
> 
> While fuzzing with trinity inside a KVM tools (lkvm) guest running latest
> -next kernel, I've stumbled on the following:
> 
> [ 1002.969392] ======================================================
> [ 1002.971639] [ INFO: possible circular locking dependency detected ]
> [ 1002.975805] 3.7.0-rc5-next-20121112-sasha-00018-g2f4ce0e #127 Tainted: G 
>       W [ 1002.983691]
> ------------------------------------------------------- [ 1002.983691]
> trinity-child18/8149 is trying to acquire lock:
> [ 1002.983691]  (s_active#313){++++.+}, at: [<ffffffff812f9941>]
> sysfs_addrm_finish+0x31/0x60 [ 1002.983691]
> [ 1002.983691] but task is already holding lock:
> [ 1002.983691]  (rtnl_mutex){+.+.+.}, at: [<ffffffff834fcc62>]
> rtnl_lock+0x12/0x20 [ 1002.983691]
> [ 1002.983691] which lock already depends on the new lock.

It is known that batman-adv has a problem with the attaching/detaching of 
interfaces over this sysfs. The cause of this problem is related to the fact 
that batman-adv not only creates its own net_devices, but also unregisters 
net_devices. This unregister will add a new element in the net_todo_list. This 
will cause a rtnl_lock when it calls netdev_wait_allrefs (there are some 
condition, but we just ignore them for now). So the whole exercise of using 
rtnl_trylock was useless.

This extra rtnl_lock can cause a deadlock as you found out because it is 
activated through a sysfs file and therefore the s_active mutex is locked (we 
have the dependency s_active -> rtnl_mutex, but other users have rtnl_mutex -> 
s_active).

So, what to do? There are different possibilities. We have to keep in mind 
that there is a patchset (not yet accepted by the batman-adv maintainers) 
which allows to use `ip link` or compatible tools to create/destroy batman-adv 
devices and attach/detach other devices.

1. Remove the sysfs interface to attach/detach net_devices (which
   destroys/creates batman-adv devices)

   This is not really backward compatible and therefore not really acceptable.
   Marek Lindner and Simon Wunderlich are also against forcing users to
   require special tools to add/configure batman-adv devices (even batctl, ip
   and so on).

2. Ignore the possible deadlock

   (sry, fill in your own comment...)

3. Add workarounds in the core net code

   Simon Wunderlich already tried it... I personally think it is not the right
   way because it more likely to introduce more bugs by hiding a batman-adv
   bug. And these bugs are a lot harder to find... trust me

   For example the usage of __rtnl_unlock will let this bug to appear in
   other places which use rtnl_trylock. This is caused by the fact that the
   todo item isn't processed by __rtnl_unlock (this is the whole idea by
   calling it) and therefore the todo work stays in net_todo_list. Another
   user of rtnl_trylock will now call rtnl_unlock and don't expect an entry in
   net_todo_list because he never unregistered a device. So he now has the
   problem of batman-adv (what an unsocial läderlappen).

   And moving everybody using rtnl_trylock to __rtnl_unlock has still the
   problem that batman-adv don't immediatelly work on its todo and so
   maybe causes other side effects because... the notifications weren't
   sent and therefore the refcount of the unregistered device didn't went
   to zero.

   (I'll leave other side effects as homework for the reader)

4. Don't automatically remove batman-adv devices

   The current approach is to automatically unregister batman-adv devices
   when they don't have attached slave-devices (hardif called by batman-adv).
   Removing this will slightly change the behaviour, but the interface can
   still be removed using `ip link del dev bat0` or a similar tool.

5. Add a workaround solution and promote the use of the standard interface

   So, the basic problem is the s_active mutex locked by the sysfs interface.
   An idea is to postpone the part which needs the rtnl_mutex to a later time.
   This has obviously the problem that we cannot return an error code to the
   caller when the device creation failed in the postponed part. This problem
   can reduced slightly be moving only the unregister part, but now I'll leave
   this out for simplicity of the description.

   A possible implementation would create a work_struct and add it to
   batadv_event_workqueue. This work item has to contain all information given
   by the user (which hardif, name of the batman-adv device).

   Simon Wunderlich already disliked this workaround, but Antonio Quartulli
   tried to encourage an RFC implementation. I've prefered a textual
   description rather than a patch missing explanations of the other
   alternatives.

Kind regards,
	Sven
Download attachment "signature.asc" of type "application/pgp-signature" (837 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ