[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170608201238.GC2553@templeofstupid.com>
Date: Thu, 8 Jun 2017 13:12:38 -0700
From: Krister Johansen <kjlx@...pleofstupid.com>
To: davem@...emloft.net, eric.dumazet@...il.com
Cc: netdev@...r.kernel.org
Subject: [PATCH net] Fix an intermittent pr_emerg warning about lo becoming
free.
It looks like this:
Message from syslogd@...mingo at Apr 26 00:45:00 ...
kernel:unregister_netdevice: waiting for lo to become free. Usage count = 4
They seem to coincide with net namespace teardown.
The message is emitted by netdev_wait_allrefs().
Forced a kdump in netdev_run_todo, but found that the refcount on the lo
device was already 0 at the time we got to the panic.
Used bcc to check the blocking in netdev_run_todo. The only places
where we're off cpu there are in the rcu_barrier() and msleep() calls.
That behavior is expected. The msleep time coincides with the amount of
time we spend waiting for the refcount to reach zero; the rcu_barrier()
wait times are not excessive.
After looking through the list of callbacks that the netdevice notifiers
invoke in this path, it appears that the dst_dev_event is the most
interesting. The dst_ifdown path places a hold on the loopback_dev as
part of releasing the dev associated with the original dst cache entry.
Most of our notifier callbacks are straight-forward, but this one a)
looks complex, and b) places a hold on the network interface in
question.
I constructed a new bcc script that watches various events in the
liftime of a dst cache entry. Note that dst_ifdown will take a hold on
the loopback device until the invalidated dst entry gets freed.
[ __dst_free] on DST: ffff883ccabb7900 IF tap1008300eth0 invoked at 1282115677036183
__dst_free
rcu_nocb_kthread
kthread
ret_from_fork
---
[ dst_gc_task] completed in 4: start: 1282115783057516 end 1282115783062462
[ dst_gc_task] completed in 5: start: 1282116447054101 end 1282116447059392
[ dst_gc_task] completed in 5: start: 1282118111030391 end 1282118111036041
[ dst_gc_task] completed in 6: start: 1282121247074938 end 1282121247081141
[ dst_gc_task] completed in 5: start: 1282126815053252 end 1282126815058751
[ dst_gc_task] completed in 6: start: 1282135007055673 end 1282135007061877
[ dst_gc_task] completed in 6: start: 1282145759021745 end 1282145759027830
[ dst_release] on DST: ffff883ccabb7900 IF tap1008300eth0 invoked at 1282147838674257
dst_release
ip_queue_xmit
tcp_transmit_skb
tcp_write_xmit
__tcp_push_pending_frames
tcp_send_fin
tcp_close
inet_release
sock_release
sock_close
__fput
____fput
task_work_run
do_exit
do_group_exit
SyS_exit_group
do_syscall_64
return_from_SYSCALL_64
[ dst_ifdown] on DST: ffff883ccabb7900 IF tap1008300eth0 invoked at 1282148067377680
dst_ifdown
notifier_call_chain
raw_notifier_call_chain
call_netdevice_notifiers_info
netdev_run_todo
kretprobe_trampoline
default_device_exit_batch
ops_exit_list.isra.4
cleanup_net
process_one_work
worker_thread
kthread
ret_from_fork
---
[ dst_ifdown] on DST: ffff883ccabb7900 IF lo invoked at 1282148067411269
dst_ifdown
notifier_call_chain
raw_notifier_call_chain
call_netdevice_notifiers_info
netdev_run_todo
kretprobe_trampoline
default_device_exit_batch
ops_exit_list.isra.4
cleanup_net
process_one_work
worker_thread
kthread
ret_from_fork
---
<...>
[ dst_destroy] on DST: ffff883ccabb7900 IF lo invoked at 1282160607074022
dst_destroy
kretprobe_trampoline
worker_thread
kthread
ret_from_fork
---
[ dst_gc_task] completed in 30: start: 1282160607047161 end 1282160607077572
The way this works is that if there's still a reference on the dst entry
at the time we try to free it, it gets placed in the gc list by
__dst_free and the dst_destroy() call is invoked by the gc task once the
refcount is 0. If the gc task processes a 10th or less of its entries
on a single pass, it inreases the amount of time it waits between gc
intervals.
Looking at the gc_task intervals, they started at 663ms when we invoked
__dst_free(). After that, they increased to 1663, 3136, 5567, 8191,
10751, and 14848. The release that set the refcnt to 0 on our dst entry
occurred after the gc_task was enqueued for 14 second interval so we had
to wait longer than the warning time in wait_allrefs in order for the
dst entry to get free'd and the hold on 'lo' to be released.
A simple solution to this problem is to have dst_dev_event() reset the
gc timer, which causes us to process this list shortly after the
gc_mutex is relased when dst_dev_event() completes.
Signed-off-by: Krister Johansen <kjlx@...pleofstupid.com>
---
net/core/dst.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/net/core/dst.c b/net/core/dst.c
index 6192f11..13ba4a0 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -469,6 +469,20 @@ static int dst_dev_event(struct notifier_block *this, unsigned long event,
spin_lock_bh(&dst_garbage.lock);
dst = dst_garbage.list;
dst_garbage.list = NULL;
+ /* The code in dst_ifdown places a hold on the loopback device.
+ * If the gc entry processing is set to expire after a lengthy
+ * interval, this hold can cause netdev_wait_allrefs() to hang
+ * out and wait for a long time -- until the the loopback
+ * interface is released. If we're really unlucky, it'll emit
+ * pr_emerg messages to console too. Reset the interval here,
+ * so dst cleanups occur in a more timely fashion.
+ */
+ if (dst_garbage.timer_inc > DST_GC_INC) {
+ dst_garbage.timer_inc = DST_GC_INC;
+ dst_garbage.timer_expires = DST_GC_MIN;
+ mod_delayed_work(system_wq, &dst_gc_work,
+ dst_garbage.timer_expires);
+ }
spin_unlock_bh(&dst_garbage.lock);
if (last)
--
2.7.4
Powered by blists - more mailing lists