lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:   Fri,  1 Nov 2019 15:16:05 -0700
From:   Cong Wang <xiyou.wangcong@...il.com>
To:     netdev@...r.kernel.org
Cc:     Cong Wang <xiyou.wangcong@...il.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Eric Dumazet <edumazet@...gle.com>
Subject: [RFC Patch] tcp: make icsk_retransmit_timer pinned

While investigating the spinlock contention on resetting TCP
retransmit timer:

  61.72%    61.71%  swapper          [kernel.kallsyms]                        [k] queued_spin_lock_slowpath
   ...
    - 58.83% tcp_v4_rcv
      - 58.80% tcp_v4_do_rcv
         - 58.80% tcp_rcv_established
            - 52.88% __tcp_push_pending_frames
               - 52.88% tcp_write_xmit
                  - 28.16% tcp_event_new_data_sent
                     - 28.15% sk_reset_timer
                        + mod_timer
                  - 24.68% tcp_schedule_loss_probe
                     - 24.68% sk_reset_timer
                        + 24.68% mod_timer

it turns out to be a serious timer migration issue. After collecting timer_start
trace events for tcp_write_timer, it shows more than 77% times this timer got
migrated to a difference CPU:

	$ perl -ne 'if (/\[(\d+)\].* cpu=(\d+)/){print if $1 != $2 ;}' tcp_timer_trace.txt | wc -l
	1303826
	$ wc -l tcp_timer_trace.txt
	1681068 tcp_timer_trace.txt
	$ python
	Python 2.7.5 (default, Jul 11 2019, 17:13:53)
	[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2
	Type "help", "copyright", "credits" or "license" for more information.
	>>> 1303826 / 1681068.0
	0.7755938486723916

And all of those migration happened during an idle CPU serving a network RX
softirq.  So, the logic of testing CPU idleness in idle_cpu() is false positive.
I don't know whether we should relax it for this scenario particuarly, something
like:

-	if (!idle_cpu(cpu) && housekeeping_cpu(cpu, HK_FLAG_TIMER))
+	if ((!idle_cpu(cpu) || in_serving_softirq()) &&
+	    housekeeping_cpu(cpu, HK_FLAG_TIMER))
 		return cpu;

(There could be better way than in_serving_softirq() to measure the idleness,
of course.)

Or simply just make the TCP retransmit timer pinned. At least this approach
has the minimum impact.

Cc: Thomas Gleixner <tglx@...utronix.de>
Cc: Eric Dumazet <edumazet@...gle.com>
Signed-off-by: Cong Wang <xiyou.wangcong@...il.com>
---
 net/ipv4/inet_connection_sock.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index eb30fc1770de..de5510ddb1c8 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -507,7 +507,7 @@ void inet_csk_init_xmit_timers(struct sock *sk,
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 
-	timer_setup(&icsk->icsk_retransmit_timer, retransmit_handler, 0);
+	timer_setup(&icsk->icsk_retransmit_timer, retransmit_handler, TIMER_PINNED);
 	timer_setup(&icsk->icsk_delack_timer, delack_handler, 0);
 	timer_setup(&sk->sk_timer, keepalive_handler, 0);
 	icsk->icsk_pending = icsk->icsk_ack.pending = 0;
-- 
2.21.0

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ