[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAK6E8=ddfOpvr9wqb+3bo+-pQm2XLXuLhGSQk6xk0Tb8-UvHeA@mail.gmail.com>
Date: Wed, 25 Oct 2017 22:37:26 -0700
From: Yuchung Cheng <ycheng@...gle.com>
To: Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc: Oleksandr Natalenko <oleksandr@...alenko.name>,
Roman Gushchin <guro@...com>, netdev <netdev@...r.kernel.org>,
Eric Dumazet <edumazet@...gle.com>,
Neal Cardwell <ncardwell@...gle.com>,
"David S. Miller" <davem@...emloft.net>,
Lawrence Brakmo <brakmo@...com>
Subject: Re: [REGRESSION] Warning in tcp_fastretrans_alert() of net/ipv4/tcp_input.c
On Wed, Oct 25, 2017 at 7:07 PM, Alexei Starovoitov
<alexei.starovoitov@...il.com> wrote:
>
> On Thu, Sep 28, 2017 at 04:36:58PM -0700, Yuchung Cheng wrote:
> > On Thu, Sep 28, 2017 at 1:14 AM, Oleksandr Natalenko
> > <oleksandr@...alenko.name> wrote:
> > > Hi.
> > >
> > > Won't tell about panic in tcp_sacktag_walk() since I cannot trigger it
> > > intentionally, but setting net.ipv4.tcp_retrans_collapse to 0 *does not* fix
> > > warning in tcp_fastretrans_alert() for me.
> >
> > Hi Oleksandr: no retrans_collapse should not matter for that warning
> > in tcp_fstretrans_alert(). the warning as I explained earlier is
> > likely false. Neal and I are more concerned the panic in
> > tcp_sacktag_walk. This is just a blind shot but thx for retrying.
> >
> > We can submit a one-liner to remove the fast retrans warning but want
> > to nail the bigger issue first.
>
> we're still seeing the warnings followed by crashes and it's very concerning.
> We hoped that most recent Neal's patches from Sep 18 around this area may
> magically fix the issue, but no. The panics are still there.
> It's confirmed that net.ipv4.tcp_retrans_collapse=0 does not help
> whereas net.ipv4.tcp_recovery=0 works, but obviously undesirable.
> We're out of ideas on how to debug this.
Can you try Eric's latest SACK rb-tree patches?
https://patchwork.ozlabs.org/cover/822218/
Roman's SNMP data suggests MTU probing is enabled. Another blind shot
is to disable it.
>
> Warning + panic:
> [ 7678.309043] Call Trace:
> [ 7678.313925] <IRQ>
> [ 7678.317955] dump_stack+0x4d/0x70
> [ 7678.324573] __warn+0xd3/0xf0
> [ 7678.330497] warn_slowpath_null+0x1e/0x20
> [ 7678.338510] tcp_fastretrans_alert+0xacf/0xbd0
> [ 7678.347380] tcp_ack+0xbce/0x1390
> [ 7678.354006] tcp_rcv_established+0x1ce/0x740
> [ 7678.362536] tcp_v6_do_rcv+0x15b/0x400
> [ 7678.370025] tcp_v6_rcv+0x94c/0x9f0
> [ 7678.376989] ? tcp_rcv_established+0x42b/0x740
> [ 7678.385866] ip6_input_finish+0xea/0x430
> [ 7678.393702] ip6_input+0x32/0xa0
> [ 7678.400155] ? ip6_rcv_finish+0xa0/0xa0
> [ 7678.407834] ip6_rcv_finish+0x4b/0xa0
> [ 7678.415164] ipv6_rcv+0x2ec/0x4f0
> [ 7678.421778] ? ip6_make_skb+0x1c0/0x1c0
> [ 7678.429447] __netif_receive_skb_core+0x2d5/0x9a0
> [ 7678.438845] ? tcp6_gro_receive+0x11d/0x1c0
> [ 7678.447211] ? ipv6_gro_receive+0x1d0/0x380
> [ 7678.455566] __netif_receive_skb+0x16/0x70
> [ 7678.463748] netif_receive_skb_internal+0x32/0x3b0
> [ 7678.473318] ? dev_gro_receive+0x27c/0x470
> [ 7678.481502] napi_gro_receive+0x113/0x1d0
> [ 7678.489510] mlx5e_handle_rx_cqe_mpwrq+0x5b6/0x8d0
> [ 7678.499077] mlx5e_poll_rx_cq+0x79/0x830
> [ 7678.506911] mlx5e_napi_poll+0x8c/0x3d0
> [ 7678.514579] ? mlx5_cq_completion+0x54/0xb0
> [ 7678.522950] net_rx_action+0x22e/0x380
> [ 7678.530450] __do_softirq+0x106/0x2e8
> [ 7678.537768] irq_exit+0xb0/0xc0
> [ 7678.544042] do_IRQ+0x4f/0xd0
> [ 7678.549970] common_interrupt+0x86/0x86
> ...
> [ 7685.690697] BUG: unable to handle kernel
> [ 7685.698737] NULL pointer dereference
> [ 7685.705889] at 0000000000000035
> [ 7685.712357] IP: tcp_sacktag_walk+0xfd/0x460
> [ 7685.720726] PGD 1002d75067
> [ 7685.726320] PUD 102c506067
> [ 7685.731902] PMD 0
> ...
> [ 7686.070541] task: ffff880ff5959b00 task.stack: ffffc90024708000
> [ 7686.082363] RIP: 0010:tcp_sacktag_walk+0xfd/0x460
> [ 7686.091754] RSP: 0000:ffff88203eec38b0 EFLAGS: 00010217
> [ 7686.102188] RAX: ffff881953620a00 RBX: ffff881009273200 RCX: 00000000fbfa50ec
> [ 7686.116439] RDX: 00000000fbfa54a0 RSI: 0000000000000000 RDI: ffff881009273200
> [ 7686.130688] RBP: ffff88203eec3908 R08: 0000000000000048 R09: 00000000fbfa55fb
> [ 7686.144936] R10: 0000000000000000 R11: 0000000000000000 R12: ffff881ec9ff98c0
> [ 7686.159184] R13: ffff88203eec3a10 R14: 0000000000000000 R15: ffff881ec9ff9a18
> [ 7686.173431] FS: 00007f2b235ff700(0000) GS:ffff88203eec0000(0000) knlGS:0000000000000000
> [ 7686.189589] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 7686.201062] CR2: 0000000000000035 CR3: 0000000fd8229000 CR4: 00000000003406e0
> [ 7686.215310] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 7686.229560] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 7686.243807] Call Trace:
> [ 7686.248696] <IRQ>
> [ 7686.252727] tcp_sacktag_write_queue+0x5ff/0x9e0
> [ 7686.261943] tcp_ack+0x677/0x1390
> [ 7686.268569] ? select_task_rq_fair+0x9e3/0xf70
> [ 7686.277437] ? tcp_write_xmit+0x230/0x10a0
> [ 7686.285619] tcp_rcv_established+0x1ce/0x740
> [ 7686.294151] tcp_v6_do_rcv+0x15b/0x400
> [ 7686.301636] tcp_v6_rcv+0x94c/0x9f0
> [ 7686.308606] ? default_wake_function+0x12/0x20
> [ 7686.317483] ? __wake_up_common+0x53/0x90
> [ 7686.325492] ip6_input_finish+0xea/0x430
> [ 7686.333329] ip6_input+0x32/0xa0
> [ 7686.339791] ? ip6_rcv_finish+0xa0/0xa0
> [ 7686.347451] ip6_rcv_finish+0x4b/0xa0
> [ 7686.354767] ipv6_rcv+0x2ec/0x4f0
> [ 7686.361389] ? ip6_make_skb+0x1c0/0x1c0
> [ 7686.369050] __netif_receive_skb_core+0x2d5/0x9a0
> [ 7686.378449] ? tcp6_gro_receive+0x11d/0x1c0
> [ 7686.386797] ? ipv6_gro_receive+0x1d0/0x380
> [ 7686.395160] __netif_receive_skb+0x16/0x70
> [ 7686.403342] netif_receive_skb_internal+0x32/0x3b0
> [ 7686.412931] ? dev_gro_receive+0x27c/0x470
> [ 7686.421110] napi_gro_receive+0x113/0x1d0
> [ 7686.429136] mlx5e_handle_rx_cqe_mpwrq+0x5b6/0x8d0
> [ 7686.438722] mlx5e_poll_rx_cq+0x79/0x830
> [ 7686.446572] mlx5e_napi_poll+0x8c/0x3d0
> [ 7686.454233] ? mlx5_cq_completion+0x54/0xb0
> [ 7686.462591] net_rx_action+0x22e/0x380
> [ 7686.470099] __do_softirq+0x106/0x2e8
> [ 7686.477431] irq_exit+0xb0/0xc0
> [ 7686.483700] do_IRQ+0x4f/0xd0
> [ 7686.489629] common_interrupt+0x86/0x86
>
> Sometimes soft lockup is also seen:
>
> [2676627.345885] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:0]
> [2676627.717771] RIP: 0010:queued_spin_lock_slowpath+0x15c/0x180
> [2676627.729239] RSP: 0018:ffff88103f603e78 EFLAGS: 00000202
> [2676627.740019] ORIG_RAX: ffffffffffffff10
> [2676627.748027] RAX: 0000000000000101 RBX: ffff8811181d8840 RCX: 0000000000000101
> [2676627.762622] RDX: 0000000000000101 RSI: 0000000000000001 RDI: ffff8811181d88c8
> [2676627.777217] RBP: ffff88103f603e88 R08: 0000000000000001 R09: ffff88103f95a800
> [2676627.791811] R10: 0000000000037a13 R11: 000000001a37bb20 R12: 0000000000000100
> [2676627.806405] R13: ffff8811181d8c18 R14: ffffffff81857270 R15: ffff8811181d8840
> [2676627.821000] FS: 0000000000000000(0000) GS:ffff88103f600000(0000) knlGS:0000000000000000
> [2676627.837500] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [2676627.849318] CR2: 00007f6e44c1f000 CR3: 0000000001e09000 CR4: 00000000003406f0
> [2676627.863916] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [2676627.878508] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [2676627.893103] Call Trace:
> [2676627.898338] <IRQ>
> [2676627.902711] ? _raw_spin_lock+0x21/0x30
> [2676627.910715] tcp_write_timer+0x19/0x70
> [2676627.918548] call_timer_fn+0x35/0x150
> [2676627.926214] run_timer_softirq+0x1be/0x3e0
> [2676627.934742] ? timerqueue_add+0x59/0xb0
> [2676627.942745] ? ktime_get+0x3c/0xb0
> [2676627.949889] __do_softirq+0x106/0x2e8
> [2676627.957551] irq_exit+0xb0/0xc0
> [2676627.964174] smp_apic_timer_interrupt+0x68/0x90
> [2676627.973563] apic_timer_interrupt+0x86/0x90
> [2676628.084861] NMI watchdog: BUG: soft lockup - CPU#11 stuck for 21s! [swapper/11:0]
> [2676628.084896] RIP: 0010:tcp_match_skb_to_sack+0x2d/0xb0
> [2676628.084897] RSP: 0018:ffff88103f8c38a8 EFLAGS: 00000297
> [2676628.084897] ORIG_RAX: ffffffffffffff10
> [2676628.084898] RAX: 0000000000000000 RBX: ffff88010f9dbb18 RCX: 000000002d5da350
> [2676628.084899] RDX: 000000002d5da141 RSI: ffff88010f9dbb18 RDI: ffff8811181d8840
> [2676628.084899] RBP: ffff88103f8c3908 R08: 00000000ffffffff R09: 00000000d2a25ebe
> [2676628.084900] R10: 000000000000e323 R11: ffff8811181d8998 R12: ffff8811181d8840
> [2676628.084900] R13: ffff88103f8c3a10 R14: 0000000000000000 R15: ffff8811181d8998
> [2676628.084901] FS: 0000000000000000(0000) GS:ffff88103f8c0000(0000) knlGS:0000000000000000
> [2676628.084902] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [2676628.084903] CR2: 00007fbc07410000 CR3: 0000000001e09000 CR4: 00000000003406e0
> [2676628.084903] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [2676628.084904] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [2676628.084904] Call Trace:
> [2676628.084905] <IRQ>
> [2676628.084907] ? tcp_sacktag_walk+0x270/0x460
> [2676628.084909] tcp_sacktag_write_queue+0x5ff/0x9e0
> [2676628.084911] tcp_ack+0x677/0x1390
> [2676628.084913] ? tcp_write_xmit+0x230/0x10a0
> [2676628.084915] tcp_rcv_established+0x1ce/0x740
> [2676628.084919] tcp_v6_do_rcv+0x15b/0x400
> [2676628.084921] tcp_v6_rcv+0x94c/0x9f0
> [2676628.084924] ? __dev_queue_xmit+0x251/0x650
> [2676628.084928] ip6_input_finish+0xea/0x430
> [2676628.084930] ip6_input+0x32/0xa0
> [2676628.084932] ? ip6_rcv_finish+0xa0/0xa0
> [2676628.084933] ip6_rcv_finish+0x4b/0xa0
> [2676628.084935] ipv6_rcv+0x2ec/0x4f0
> [2676628.084937] ? ip6_make_skb+0x1c0/0x1c0
> [2676628.084938] __netif_receive_skb_core+0x2d5/0x9a0
> [2676628.084941] ? tcp6_gro_receive+0x11d/0x1c0
> [2676628.084943] ? ipv6_gro_receive+0x1d0/0x380
> [2676628.084944] __netif_receive_skb+0x16/0x70
> [2676628.084945] netif_receive_skb_internal+0x32/0x3b0
> [2676628.084947] ? dev_gro_receive+0x27c/0x470
> [2676628.084948] napi_gro_receive+0x113/0x1d0
> [2676628.084952] mlx5e_handle_rx_cqe_mpwrq+0x5b6/0x8d0
> [2676628.084954] mlx5e_poll_rx_cq+0x79/0x830
> [2676628.084956] mlx5e_napi_poll+0x8c/0x3d0
> [2676628.084959] ? mlx5_cq_completion+0x54/0xb0
> [2676628.084960] net_rx_action+0x22e/0x380
> [2676628.084962] __do_softirq+0x106/0x2e8
> [2676628.084965] irq_exit+0xb0/0xc0
> [2676628.084967] do_IRQ+0x4f/0xd0
> [2676656.083964] NMI watchdog: BUG: soft lockup - CPU#11 stuck for 23s! [swapper/11:0]
> [2676656.083992] RIP: 0010:tcp_sacktag_walk+0x25f/0x460
> [2676656.083993] RSP: 0018:ffff88103f8c38b0 EFLAGS: 00000207
> [2676656.083993] ORIG_RAX: ffffffffffffff10
> [2676656.083994] RAX: 0000000000021500 RBX: ffff8810d84c3600 RCX: 000000002d5d9e41
> [2676656.083994] RDX: 00000000f15de229 RSI: ffff8810d84c0200 RDI: ffff8811181d8840
> [2676656.083995] RBP: ffff88103f8c3908 R08: 000000000000ff21 R09: 0000000000000000
> [2676656.083996] R10: 000000000000e323 R11: ffff8811181d8998 R12: ffff8811181d8840
> [2676656.083996] R13: ffff88103f8c3a10 R14: 0000000000000000 R15: ffff8811181d8998
> [2676656.083997] FS: 0000000000000000(0000) GS:ffff88103f8c0000(0000) knlGS:0000000000000000
> [2676656.083998] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [2676656.083998] CR2: 00007fbc07410000 CR3: 0000000001e09000 CR4: 00000000003406e0
> [2676656.083999] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [2676656.083999] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [2676656.084000] Call Trace:
> [2676656.084001] <IRQ>
> [2676656.084003] tcp_sacktag_write_queue+0x5ff/0x9e0
> [2676656.084005] tcp_ack+0x677/0x1390
> [2676656.084007] ? tcp_write_xmit+0x230/0x10a0
> [2676656.084009] tcp_rcv_established+0x1ce/0x740
> [2676656.084013] tcp_v6_do_rcv+0x15b/0x400
> [2676656.084015] tcp_v6_rcv+0x94c/0x9f0
> [2676656.084018] ? __dev_queue_xmit+0x251/0x650
> [2676656.084022] ip6_input_finish+0xea/0x430
> [2676656.084024] ip6_input+0x32/0xa0
> [2676656.084026] ? ip6_rcv_finish+0xa0/0xa0
> [2676656.084027] ip6_rcv_finish+0x4b/0xa0
> [2676656.084029] ipv6_rcv+0x2ec/0x4f0
> [2676656.084031] ? ip6_make_skb+0x1c0/0x1c0
> [2676656.084032] __netif_receive_skb_core+0x2d5/0x9a0
> [2676656.084035] ? tcp6_gro_receive+0x11d/0x1c0
> [2676656.084037] ? ipv6_gro_receive+0x1d0/0x380
> [2676656.084038] __netif_receive_skb+0x16/0x70
> [2676656.084039] netif_receive_skb_internal+0x32/0x3b0
> [2676656.084041] ? dev_gro_receive+0x27c/0x470
> [2676656.084042] napi_gro_receive+0x113/0x1d0
> [2676656.084046] mlx5e_handle_rx_cqe_mpwrq+0x5b6/0x8d0
> [2676656.084048] mlx5e_poll_rx_cq+0x79/0x830
> [2676656.084050] mlx5e_napi_poll+0x8c/0x3d0
>
> Any help greatly appreciated.
>
Powered by blists - more mailing lists