[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b8136636-b729-a045-6266-6e93ba4b83f4@gmail.com>
Date: Mon, 30 Nov 2020 20:00:40 +0100
From: Eric Dumazet <eric.dumazet@...il.com>
To: Vladimir Oltean <olteanv@...il.com>,
Jakub Kicinski <kuba@...nel.org>
Cc: Eric Dumazet <edumazet@...gle.com>,
Stephen Hemminger <stephen@...workplumber.org>,
netdev <netdev@...r.kernel.org>,
Paul Gortmaker <paul.gortmaker@...driver.com>,
Jiri Benc <jbenc@...hat.com>,
Or Gerlitz <ogerlitz@...lanox.com>,
Cong Wang <xiyou.wangcong@...il.com>,
Jamal Hadi Salim <jhs@...atatu.com>,
Andrew Lunn <andrew@...n.ch>,
Florian Fainelli <f.fainelli@...il.com>
Subject: Re: Correct usage of dev_base_lock in 2020
On 11/30/20 7:48 PM, Vladimir Oltean wrote:
> On Mon, Nov 30, 2020 at 10:14:05AM -0800, Jakub Kicinski wrote:
>> On Mon, 30 Nov 2020 11:41:10 +0100 Eric Dumazet wrote:
>>>> So dev_base_lock dates back to the Big Kernel Lock breakup back in Linux 2.4
>>>> (ie before my time). The time has come to get rid of it.
>>>>
>>>> The use is sysfs is because could be changed to RCU. There have been issues
>>>> in the past with sysfs causing lock inversions with the rtnl mutex, that
>>>> is why you will see some trylock code there.
>>>>
>>>> My guess is that dev_base_lock readers exist only because no one bothered to do
>>>> the RCU conversion.
>>>
>>> I think we did, a long time ago.
>>>
>>> We took care of all ' fast paths' already.
>>>
>>> Not sure what is needed, current situation does not bother me at all ;)
>>
>> Perhaps Vladimir has a plan to post separately about it (in that case
>> sorry for jumping ahead) but the initial problem was procfs which is
>> (hopefully mostly irrelevant by now, and) taking the RCU lock only
>> therefore forcing drivers to have re-entrant, non-sleeping
>> .ndo_get_stats64 implementations.
>
> Right, the end reason why I'm even looking at this is because I want to
> convert all callers of dev_get_stats to use sleepable context and not
> atomic. This makes it easier to gather statistics from devices that have
> a firmware, or off-chip devices behind a slow bus like SPI.
>
> Like Jakub pointed out, some places call dev_get_stats while iterating
> through the list of network interfaces - one would be procfs, but not
> only. These callers are pure readers, so they use RCU protection. But
> that gives us atomic context when calling dev_get_stats. The naive
> solution is to convert all those callers to hold the RTNL mutex, which
> is the writer-side protection for the network interface lists, and which
> is sleepable. In fact I do have a series of 8 patches where I get that
> done. But there are some weirder cases, such as the bonding driver,
> where I need to do this:
>
> -----------------------------[cut here]-----------------------------
> From 369a0e18a2446cda8ff52d72c02aa144ae6687ec Mon Sep 17 00:00:00 2001
> From: Vladimir Oltean <vladimir.oltean@....com>
> Date: Mon, 30 Nov 2020 02:39:46 +0200
> Subject: [PATCH] net: bonding: retrieve device statistics under RTNL, not RCU
>
> In the effort of making .ndo_get_stats64 be able to sleep, we need to
> ensure the callers of dev_get_stats do not use atomic context.
>
> The bonding driver uses an RCU read-side critical section to ensure the
> integrity of the list of network interfaces, because the driver iterates
> through all net devices in the netns to find the ones which are its
> configured slaves. We still need some protection against an interface
> registering or deregistering, and the writer-side lock, the RTNL mutex,
> is fine for that, because it offers sleepable context.
>
> We are taking the RTNL this way (checking for rtnl_is_locked first)
> because the RTNL is not guaranteed to be held by all callers of
> ndo_get_stats64, in fact there will be work in the future that will
> avoid as much RTNL-holding as possible.
>
> Signed-off-by: Vladimir Oltean <vladimir.oltean@....com>
> ---
> drivers/net/bonding/bond_main.c | 18 +++++++-----------
> include/net/bonding.h | 1 -
> 2 files changed, 7 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> index e0880a3840d7..1d44534e95d2 100644
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -3738,21 +3738,17 @@ static void bond_get_stats(struct net_device *bond_dev,
> struct rtnl_link_stats64 *stats)
> {
> struct bonding *bond = netdev_priv(bond_dev);
> + bool rtnl_locked = rtnl_is_locked();
> struct rtnl_link_stats64 temp;
> struct list_head *iter;
> struct slave *slave;
> - int nest_level = 0;
>
> + if (!rtnl_locked)
> + rtnl_lock();
Gosh, do not do that.
Convert the bonding ->stats_lock to a mutex instead.
Adding more reliance to RTNL is not helping cases where
access to stats should not be blocked by other users of RTNL (which can be abused)
>
> - rcu_read_lock();
> -#ifdef CONFIG_LOCKDEP
> - nest_level = bond_get_lowest_level_rcu(bond_dev);
> -#endif
> -
> - spin_lock_nested(&bond->stats_lock, nest_level);
> memcpy(stats, &bond->bond_stats, sizeof(*stats));
>
> - bond_for_each_slave_rcu(bond, slave, iter) {
> + bond_for_each_slave(bond, slave, iter) {
> const struct rtnl_link_stats64 *new =
> dev_get_stats(slave->dev, &temp);
>
> @@ -3763,8 +3759,9 @@ static void bond_get_stats(struct net_device *bond_dev,
> }
>
> memcpy(&bond->bond_stats, stats, sizeof(*stats));
> - spin_unlock(&bond->stats_lock);
> - rcu_read_unlock();
> +
> + if (!rtnl_locked)
> + rtnl_unlock();
> }
>
> static int bond_do_ioctl(struct net_device *bond_dev, struct ifreq *ifr, int cmd)
> @@ -5192,7 +5189,6 @@ static int bond_init(struct net_device *bond_dev)
> if (!bond->wq)
> return -ENOMEM;
>
> - spin_lock_init(&bond->stats_lock);
> netdev_lockdep_set_classes(bond_dev);
>
> list_add_tail(&bond->bond_list, &bn->dev_list);
> diff --git a/include/net/bonding.h b/include/net/bonding.h
> index d9d0ff3b0ad3..6fbde9713424 100644
> --- a/include/net/bonding.h
> +++ b/include/net/bonding.h
> @@ -224,7 +224,6 @@ struct bonding {
> * ALB mode (6) - to sync the use and modifications of its hash table
> */
> spinlock_t mode_lock;
> - spinlock_t stats_lock;
> u8 send_peer_notif;
> u8 igmp_retrans;
> #ifdef CONFIG_PROC_FS
> -----------------------------[cut here]-----------------------------
>
> Only that there's a problem with this. It's the lock ordering that
> Stephen talked about.
>
> cat /sys/class/net/bond0/statistics/*
> [ 38.715829]
> [ 38.717329] ======================================================
> [ 38.723528] WARNING: possible circular locking dependency detected
> [ 38.729730] 5.10.0-rc5-next-20201127-00016-gde02bf51f2fa #1481 Not tainted
> [ 38.736628] ------------------------------------------------------
> [ 38.742828] cat/602 is trying to acquire lock:
> [ 38.747285] ffffcf4908119080 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_lock+0x20/0x28
> [ 38.754555]
> [ 38.754555] but task is already holding lock:
> [ 38.760406] ffff53364222c168 (kn->active#123){++++}-{0:0}, at: kernfs_seq_start+0x38/0xb0
> [ 38.768631]
> [ 38.768631] which lock already depends on the new lock.
> [ 38.768631]
> [ 38.776835]
> [ 38.776835] the existing dependency chain (in reverse order) is:
> [ 38.784341]
> [ 38.784341] -> #1 (kn->active#123){++++}-{0:0}:
> [ 38.790387] lock_acquire+0x238/0x468
> [ 38.794583] __kernfs_remove+0x26c/0x3e0
> [ 38.799040] kernfs_remove_by_name_ns+0x5c/0xb8
> [ 38.804107] remove_files.isra.1+0x40/0x80
> [ 38.808739] sysfs_remove_group+0x50/0xb0
> [ 38.813282] sysfs_remove_groups+0x3c/0x58
> [ 38.817913] device_remove_attrs+0x64/0x98
> [ 38.822544] device_del+0x16c/0x3f8
> [ 38.826566] netdev_unregister_kobject+0x80/0x90
> [ 38.831722] rollback_registered_many+0x444/0x688
> [ 38.836964] unregister_netdevice_many.part.163+0x20/0xa8
> [ 38.842904] unregister_netdevice_many+0x2c/0x38
> [ 38.848059] rtnl_delete_link+0x5c/0x90
> [ 38.852428] rtnl_dellink+0x158/0x310
> [ 38.856622] rtnetlink_rcv_msg+0x298/0x4d8
> [ 38.861254] netlink_rcv_skb+0x64/0x130
> [ 38.865625] rtnetlink_rcv+0x28/0x38
> [ 38.869733] netlink_unicast+0x1dc/0x288
> [ 38.874190] netlink_sendmsg+0x2b0/0x3e8
> [ 38.878647] ____sys_sendmsg+0x27c/0x2c0
> [ 38.883105] ___sys_sendmsg+0x90/0xd0
> [ 38.887299] __sys_sendmsg+0x7c/0xd0
> [ 38.891407] __arm64_sys_sendmsg+0x2c/0x38
> [ 38.896040] el0_svc_common.constprop.3+0x80/0x1b0
> [ 38.901369] do_el0_svc+0x34/0xa0
> [ 38.905216] el0_sync_handler+0x138/0x198
> [ 38.909760] el0_sync+0x140/0x180
> [ 38.913604]
> [ 38.913604] -> #0 (rtnl_mutex){+.+.}-{4:4}:
> [ 38.919295] check_noncircular+0x154/0x168
> [ 38.923926] __lock_acquire+0x1118/0x15e0
> [ 38.928470] lock_acquire+0x238/0x468
> [ 38.932667] __mutex_lock+0xa4/0x970
> [ 38.936776] mutex_lock_nested+0x54/0x70
> [ 38.941233] rtnl_lock+0x20/0x28
> [ 38.944993] bond_get_stats+0x140/0x1a8
> [ 38.949363] dev_get_stats+0xdc/0xf0
> [ 38.953472] netstat_show.isra.25+0x50/0xa0
> [ 38.958190] collisions_show+0x2c/0x38
> [ 38.962472] dev_attr_show+0x3c/0x78
> [ 38.966580] sysfs_kf_seq_show+0xbc/0x138
> [ 38.971124] kernfs_seq_show+0x44/0x50
> [ 38.975407] seq_read_iter+0xe4/0x450
> [ 38.979601] seq_read+0xf8/0x148
> [ 38.983359] kernfs_fop_read+0x1e0/0x280
> [ 38.987817] vfs_read+0xac/0x1c8
> [ 38.991576] ksys_read+0x74/0xf8
> [ 38.995335] __arm64_sys_read+0x24/0x30
> [ 38.999706] el0_svc_common.constprop.3+0x80/0x1b0
> [ 39.005035] do_el0_svc+0x34/0xa0
> [ 39.008882] el0_sync_handler+0x138/0x198
> [ 39.013426] el0_sync+0x140/0x180
> [ 39.017270]
> [ 39.017270] other info that might help us debug this:
> [ 39.017270]
> [ 39.025300] Possible unsafe locking scenario:
> [ 39.025300]
> [ 39.031236] CPU0 CPU1
> [ 39.035779] ---- ----
> [ 39.040321] lock(kn->active#123);
> [ 39.043826] lock(rtnl_mutex);
> [ 39.049507] lock(kn->active#123);
> [ 39.055540] lock(rtnl_mutex);
> [ 39.058691]
> [ 39.058691] *** DEADLOCK ***
> [ 39.058691]
> [ 39.064630] 3 locks held by cat/602:
> [ 39.068212] #0: ffff533645926f70 (&p->lock){+.+.}-{4:4}, at: seq_read_iter+0x64/0x450
> [ 39.076173] #1: ffff533643bfa090 (&of->mutex){+.+.}-{4:4}, at: kernfs_seq_start+0x30/0xb0
> [ 39.084482] #2: ffff53364222c168 (kn->active#123){++++}-{0:0}, at: kernfs_seq_start+0x38/0xb0
>
> Ok, I admit I don't really understand the message, because struct
> kernfs_node::active is an atomic_t and not really a lock, but
> intuitively I think lockdep is unhappy that the RTNL mutex is not being
> used here as a top-level mutex.
> If somebody were to cat /sys/class/net/bond0/statistics/*, the kernfs
> node locking happens first, then the RTNL mutex is acquired by the
> bonding implementation of ndo_get_stats64.
> If somebody were to delete the bonding device via an "ip link del"
> rtnetlink message, the RTNL mutex would be held first, then the kernfs
> node lock corresponding to the interface's sysfs attributes second.
> I don't think there's really any potential deadlock, only an ordering
> issue between lockdep classes.
>
> How I think this could be solved is if we made the network interface
> lists use some other writer-side protection than the big RTNL mutex.
> So I went on exactly to see how knee-deep I would need to get into that.
>
> There are 2 separate classes of problems:
> - We already have two ways of protecting pure readers: via RCU and via
> the rwlock. It doesn't help if we also add a second way of locking out
> pure writers. We need to first clean up what we have. That's the
> reason why I started the discussion about the rwlock.
> - Some callers appear to not use any sort of protection at all. Does
> this code path look ok to you?
>
> nfnetlink_rcv_batch
> -> NFT_MSG_NEWCHAIN
> -> nf_tables_addchain
> -> nft_chain_parse_hook
> -> nft_chain_parse_netdev
> -> nf_tables_parse_netdev_hooks
> -> nft_netdev_hook_alloc
> -> __dev_get_by_name
> -> netdev_name_node_lookup: must be under RTNL mutex or dev_base_lock protection
>
> Unless I'm missing something, there are just tons, tons of callers of
> __dev_get_by_name which hold neither the RTNL mutex, nor the
> dev_base_lock rwlock, nor RCU read-side critical section.
>
> What I was thinking was that if we could add a separate mutex for the
> network interface lists (and this time make it per netns, and not global
> to the kernel), then we could take that lock a lot more locally. I.e. we
> could have __dev_get_by_name take that lock, and alleviate the need for
> callers to hold the RTNL mutex. That would solve some of these bugs too,
> and would also be more usable for the bonding case.
>
Powered by blists - more mailing lists