lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAK6E8=ddfOpvr9wqb+3bo+-pQm2XLXuLhGSQk6xk0Tb8-UvHeA@mail.gmail.com>
Date:   Wed, 25 Oct 2017 22:37:26 -0700
From:   Yuchung Cheng <ycheng@...gle.com>
To:     Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc:     Oleksandr Natalenko <oleksandr@...alenko.name>,
        Roman Gushchin <guro@...com>, netdev <netdev@...r.kernel.org>,
        Eric Dumazet <edumazet@...gle.com>,
        Neal Cardwell <ncardwell@...gle.com>,
        "David S. Miller" <davem@...emloft.net>,
        Lawrence Brakmo <brakmo@...com>
Subject: Re: [REGRESSION] Warning in tcp_fastretrans_alert() of net/ipv4/tcp_input.c

On Wed, Oct 25, 2017 at 7:07 PM, Alexei Starovoitov
<alexei.starovoitov@...il.com> wrote:
>
> On Thu, Sep 28, 2017 at 04:36:58PM -0700, Yuchung Cheng wrote:
> > On Thu, Sep 28, 2017 at 1:14 AM, Oleksandr Natalenko
> > <oleksandr@...alenko.name> wrote:
> > > Hi.
> > >
> > > Won't tell about panic in tcp_sacktag_walk() since I cannot trigger it
> > > intentionally, but setting net.ipv4.tcp_retrans_collapse to 0 *does not* fix
> > > warning in tcp_fastretrans_alert() for me.
> >
> > Hi Oleksandr: no retrans_collapse should not matter for that warning
> > in tcp_fstretrans_alert(). the warning as I explained earlier is
> > likely false. Neal and I are more concerned the panic in
> > tcp_sacktag_walk. This is just a blind shot but thx for retrying.
> >
> > We can submit a one-liner to remove the fast retrans warning but want
> > to nail the bigger issue first.
>
> we're still seeing the warnings followed by crashes and it's very concerning.
> We hoped that most recent Neal's patches from Sep 18 around this area may
> magically fix the issue, but no. The panics are still there.
> It's confirmed that net.ipv4.tcp_retrans_collapse=0 does not help
> whereas net.ipv4.tcp_recovery=0 works, but obviously undesirable.
> We're out of ideas on how to debug this.
Can you try Eric's latest SACK rb-tree patches?
https://patchwork.ozlabs.org/cover/822218/

Roman's SNMP data suggests MTU probing is enabled. Another blind shot
is to disable it.
>
> Warning + panic:
> [ 7678.309043] Call Trace:
> [ 7678.313925]  <IRQ>
> [ 7678.317955]  dump_stack+0x4d/0x70
> [ 7678.324573]  __warn+0xd3/0xf0
> [ 7678.330497]  warn_slowpath_null+0x1e/0x20
> [ 7678.338510]  tcp_fastretrans_alert+0xacf/0xbd0
> [ 7678.347380]  tcp_ack+0xbce/0x1390
> [ 7678.354006]  tcp_rcv_established+0x1ce/0x740
> [ 7678.362536]  tcp_v6_do_rcv+0x15b/0x400
> [ 7678.370025]  tcp_v6_rcv+0x94c/0x9f0
> [ 7678.376989]  ? tcp_rcv_established+0x42b/0x740
> [ 7678.385866]  ip6_input_finish+0xea/0x430
> [ 7678.393702]  ip6_input+0x32/0xa0
> [ 7678.400155]  ? ip6_rcv_finish+0xa0/0xa0
> [ 7678.407834]  ip6_rcv_finish+0x4b/0xa0
> [ 7678.415164]  ipv6_rcv+0x2ec/0x4f0
> [ 7678.421778]  ? ip6_make_skb+0x1c0/0x1c0
> [ 7678.429447]  __netif_receive_skb_core+0x2d5/0x9a0
> [ 7678.438845]  ? tcp6_gro_receive+0x11d/0x1c0
> [ 7678.447211]  ? ipv6_gro_receive+0x1d0/0x380
> [ 7678.455566]  __netif_receive_skb+0x16/0x70
> [ 7678.463748]  netif_receive_skb_internal+0x32/0x3b0
> [ 7678.473318]  ? dev_gro_receive+0x27c/0x470
> [ 7678.481502]  napi_gro_receive+0x113/0x1d0
> [ 7678.489510]  mlx5e_handle_rx_cqe_mpwrq+0x5b6/0x8d0
> [ 7678.499077]  mlx5e_poll_rx_cq+0x79/0x830
> [ 7678.506911]  mlx5e_napi_poll+0x8c/0x3d0
> [ 7678.514579]  ? mlx5_cq_completion+0x54/0xb0
> [ 7678.522950]  net_rx_action+0x22e/0x380
> [ 7678.530450]  __do_softirq+0x106/0x2e8
> [ 7678.537768]  irq_exit+0xb0/0xc0
> [ 7678.544042]  do_IRQ+0x4f/0xd0
> [ 7678.549970]  common_interrupt+0x86/0x86
> ...
> [ 7685.690697] BUG: unable to handle kernel
> [ 7685.698737] NULL pointer dereference
> [ 7685.705889]  at 0000000000000035
> [ 7685.712357] IP: tcp_sacktag_walk+0xfd/0x460
> [ 7685.720726] PGD 1002d75067
> [ 7685.726320] PUD 102c506067
> [ 7685.731902] PMD 0
> ...
> [ 7686.070541] task: ffff880ff5959b00 task.stack: ffffc90024708000
> [ 7686.082363] RIP: 0010:tcp_sacktag_walk+0xfd/0x460
> [ 7686.091754] RSP: 0000:ffff88203eec38b0 EFLAGS: 00010217
> [ 7686.102188] RAX: ffff881953620a00 RBX: ffff881009273200 RCX: 00000000fbfa50ec
> [ 7686.116439] RDX: 00000000fbfa54a0 RSI: 0000000000000000 RDI: ffff881009273200
> [ 7686.130688] RBP: ffff88203eec3908 R08: 0000000000000048 R09: 00000000fbfa55fb
> [ 7686.144936] R10: 0000000000000000 R11: 0000000000000000 R12: ffff881ec9ff98c0
> [ 7686.159184] R13: ffff88203eec3a10 R14: 0000000000000000 R15: ffff881ec9ff9a18
> [ 7686.173431] FS:  00007f2b235ff700(0000) GS:ffff88203eec0000(0000) knlGS:0000000000000000
> [ 7686.189589] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 7686.201062] CR2: 0000000000000035 CR3: 0000000fd8229000 CR4: 00000000003406e0
> [ 7686.215310] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 7686.229560] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 7686.243807] Call Trace:
> [ 7686.248696]  <IRQ>
> [ 7686.252727]  tcp_sacktag_write_queue+0x5ff/0x9e0
> [ 7686.261943]  tcp_ack+0x677/0x1390
> [ 7686.268569]  ? select_task_rq_fair+0x9e3/0xf70
> [ 7686.277437]  ? tcp_write_xmit+0x230/0x10a0
> [ 7686.285619]  tcp_rcv_established+0x1ce/0x740
> [ 7686.294151]  tcp_v6_do_rcv+0x15b/0x400
> [ 7686.301636]  tcp_v6_rcv+0x94c/0x9f0
> [ 7686.308606]  ? default_wake_function+0x12/0x20
> [ 7686.317483]  ? __wake_up_common+0x53/0x90
> [ 7686.325492]  ip6_input_finish+0xea/0x430
> [ 7686.333329]  ip6_input+0x32/0xa0
> [ 7686.339791]  ? ip6_rcv_finish+0xa0/0xa0
> [ 7686.347451]  ip6_rcv_finish+0x4b/0xa0
> [ 7686.354767]  ipv6_rcv+0x2ec/0x4f0
> [ 7686.361389]  ? ip6_make_skb+0x1c0/0x1c0
> [ 7686.369050]  __netif_receive_skb_core+0x2d5/0x9a0
> [ 7686.378449]  ? tcp6_gro_receive+0x11d/0x1c0
> [ 7686.386797]  ? ipv6_gro_receive+0x1d0/0x380
> [ 7686.395160]  __netif_receive_skb+0x16/0x70
> [ 7686.403342]  netif_receive_skb_internal+0x32/0x3b0
> [ 7686.412931]  ? dev_gro_receive+0x27c/0x470
> [ 7686.421110]  napi_gro_receive+0x113/0x1d0
> [ 7686.429136]  mlx5e_handle_rx_cqe_mpwrq+0x5b6/0x8d0
> [ 7686.438722]  mlx5e_poll_rx_cq+0x79/0x830
> [ 7686.446572]  mlx5e_napi_poll+0x8c/0x3d0
> [ 7686.454233]  ? mlx5_cq_completion+0x54/0xb0
> [ 7686.462591]  net_rx_action+0x22e/0x380
> [ 7686.470099]  __do_softirq+0x106/0x2e8
> [ 7686.477431]  irq_exit+0xb0/0xc0
> [ 7686.483700]  do_IRQ+0x4f/0xd0
> [ 7686.489629]  common_interrupt+0x86/0x86
>
> Sometimes soft lockup is also seen:
>
> [2676627.345885] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:0]
> [2676627.717771] RIP: 0010:queued_spin_lock_slowpath+0x15c/0x180
> [2676627.729239] RSP: 0018:ffff88103f603e78 EFLAGS: 00000202
> [2676627.740019]  ORIG_RAX: ffffffffffffff10
> [2676627.748027] RAX: 0000000000000101 RBX: ffff8811181d8840 RCX: 0000000000000101
> [2676627.762622] RDX: 0000000000000101 RSI: 0000000000000001 RDI: ffff8811181d88c8
> [2676627.777217] RBP: ffff88103f603e88 R08: 0000000000000001 R09: ffff88103f95a800
> [2676627.791811] R10: 0000000000037a13 R11: 000000001a37bb20 R12: 0000000000000100
> [2676627.806405] R13: ffff8811181d8c18 R14: ffffffff81857270 R15: ffff8811181d8840
> [2676627.821000] FS:  0000000000000000(0000) GS:ffff88103f600000(0000) knlGS:0000000000000000
> [2676627.837500] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [2676627.849318] CR2: 00007f6e44c1f000 CR3: 0000000001e09000 CR4: 00000000003406f0
> [2676627.863916] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [2676627.878508] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [2676627.893103] Call Trace:
> [2676627.898338]  <IRQ>
> [2676627.902711]  ? _raw_spin_lock+0x21/0x30
> [2676627.910715]  tcp_write_timer+0x19/0x70
> [2676627.918548]  call_timer_fn+0x35/0x150
> [2676627.926214]  run_timer_softirq+0x1be/0x3e0
> [2676627.934742]  ? timerqueue_add+0x59/0xb0
> [2676627.942745]  ? ktime_get+0x3c/0xb0
> [2676627.949889]  __do_softirq+0x106/0x2e8
> [2676627.957551]  irq_exit+0xb0/0xc0
> [2676627.964174]  smp_apic_timer_interrupt+0x68/0x90
> [2676627.973563]  apic_timer_interrupt+0x86/0x90
> [2676628.084861] NMI watchdog: BUG: soft lockup - CPU#11 stuck for 21s! [swapper/11:0]
> [2676628.084896] RIP: 0010:tcp_match_skb_to_sack+0x2d/0xb0
> [2676628.084897] RSP: 0018:ffff88103f8c38a8 EFLAGS: 00000297
> [2676628.084897]  ORIG_RAX: ffffffffffffff10
> [2676628.084898] RAX: 0000000000000000 RBX: ffff88010f9dbb18 RCX: 000000002d5da350
> [2676628.084899] RDX: 000000002d5da141 RSI: ffff88010f9dbb18 RDI: ffff8811181d8840
> [2676628.084899] RBP: ffff88103f8c3908 R08: 00000000ffffffff R09: 00000000d2a25ebe
> [2676628.084900] R10: 000000000000e323 R11: ffff8811181d8998 R12: ffff8811181d8840
> [2676628.084900] R13: ffff88103f8c3a10 R14: 0000000000000000 R15: ffff8811181d8998
> [2676628.084901] FS:  0000000000000000(0000) GS:ffff88103f8c0000(0000) knlGS:0000000000000000
> [2676628.084902] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [2676628.084903] CR2: 00007fbc07410000 CR3: 0000000001e09000 CR4: 00000000003406e0
> [2676628.084903] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [2676628.084904] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [2676628.084904] Call Trace:
> [2676628.084905]  <IRQ>
> [2676628.084907]  ? tcp_sacktag_walk+0x270/0x460
> [2676628.084909]  tcp_sacktag_write_queue+0x5ff/0x9e0
> [2676628.084911]  tcp_ack+0x677/0x1390
> [2676628.084913]  ? tcp_write_xmit+0x230/0x10a0
> [2676628.084915]  tcp_rcv_established+0x1ce/0x740
> [2676628.084919]  tcp_v6_do_rcv+0x15b/0x400
> [2676628.084921]  tcp_v6_rcv+0x94c/0x9f0
> [2676628.084924]  ? __dev_queue_xmit+0x251/0x650
> [2676628.084928]  ip6_input_finish+0xea/0x430
> [2676628.084930]  ip6_input+0x32/0xa0
> [2676628.084932]  ? ip6_rcv_finish+0xa0/0xa0
> [2676628.084933]  ip6_rcv_finish+0x4b/0xa0
> [2676628.084935]  ipv6_rcv+0x2ec/0x4f0
> [2676628.084937]  ? ip6_make_skb+0x1c0/0x1c0
> [2676628.084938]  __netif_receive_skb_core+0x2d5/0x9a0
> [2676628.084941]  ? tcp6_gro_receive+0x11d/0x1c0
> [2676628.084943]  ? ipv6_gro_receive+0x1d0/0x380
> [2676628.084944]  __netif_receive_skb+0x16/0x70
> [2676628.084945]  netif_receive_skb_internal+0x32/0x3b0
> [2676628.084947]  ? dev_gro_receive+0x27c/0x470
> [2676628.084948]  napi_gro_receive+0x113/0x1d0
> [2676628.084952]  mlx5e_handle_rx_cqe_mpwrq+0x5b6/0x8d0
> [2676628.084954]  mlx5e_poll_rx_cq+0x79/0x830
> [2676628.084956]  mlx5e_napi_poll+0x8c/0x3d0
> [2676628.084959]  ? mlx5_cq_completion+0x54/0xb0
> [2676628.084960]  net_rx_action+0x22e/0x380
> [2676628.084962]  __do_softirq+0x106/0x2e8
> [2676628.084965]  irq_exit+0xb0/0xc0
> [2676628.084967]  do_IRQ+0x4f/0xd0
> [2676656.083964] NMI watchdog: BUG: soft lockup - CPU#11 stuck for 23s! [swapper/11:0]
> [2676656.083992] RIP: 0010:tcp_sacktag_walk+0x25f/0x460
> [2676656.083993] RSP: 0018:ffff88103f8c38b0 EFLAGS: 00000207
> [2676656.083993]  ORIG_RAX: ffffffffffffff10
> [2676656.083994] RAX: 0000000000021500 RBX: ffff8810d84c3600 RCX: 000000002d5d9e41
> [2676656.083994] RDX: 00000000f15de229 RSI: ffff8810d84c0200 RDI: ffff8811181d8840
> [2676656.083995] RBP: ffff88103f8c3908 R08: 000000000000ff21 R09: 0000000000000000
> [2676656.083996] R10: 000000000000e323 R11: ffff8811181d8998 R12: ffff8811181d8840
> [2676656.083996] R13: ffff88103f8c3a10 R14: 0000000000000000 R15: ffff8811181d8998
> [2676656.083997] FS:  0000000000000000(0000) GS:ffff88103f8c0000(0000) knlGS:0000000000000000
> [2676656.083998] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [2676656.083998] CR2: 00007fbc07410000 CR3: 0000000001e09000 CR4: 00000000003406e0
> [2676656.083999] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [2676656.083999] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [2676656.084000] Call Trace:
> [2676656.084001]  <IRQ>
> [2676656.084003]  tcp_sacktag_write_queue+0x5ff/0x9e0
> [2676656.084005]  tcp_ack+0x677/0x1390
> [2676656.084007]  ? tcp_write_xmit+0x230/0x10a0
> [2676656.084009]  tcp_rcv_established+0x1ce/0x740
> [2676656.084013]  tcp_v6_do_rcv+0x15b/0x400
> [2676656.084015]  tcp_v6_rcv+0x94c/0x9f0
> [2676656.084018]  ? __dev_queue_xmit+0x251/0x650
> [2676656.084022]  ip6_input_finish+0xea/0x430
> [2676656.084024]  ip6_input+0x32/0xa0
> [2676656.084026]  ? ip6_rcv_finish+0xa0/0xa0
> [2676656.084027]  ip6_rcv_finish+0x4b/0xa0
> [2676656.084029]  ipv6_rcv+0x2ec/0x4f0
> [2676656.084031]  ? ip6_make_skb+0x1c0/0x1c0
> [2676656.084032]  __netif_receive_skb_core+0x2d5/0x9a0
> [2676656.084035]  ? tcp6_gro_receive+0x11d/0x1c0
> [2676656.084037]  ? ipv6_gro_receive+0x1d0/0x380
> [2676656.084038]  __netif_receive_skb+0x16/0x70
> [2676656.084039]  netif_receive_skb_internal+0x32/0x3b0
> [2676656.084041]  ? dev_gro_receive+0x27c/0x470
> [2676656.084042]  napi_gro_receive+0x113/0x1d0
> [2676656.084046]  mlx5e_handle_rx_cqe_mpwrq+0x5b6/0x8d0
> [2676656.084048]  mlx5e_poll_rx_cq+0x79/0x830
> [2676656.084050]  mlx5e_napi_poll+0x8c/0x3d0
>
> Any help greatly appreciated.
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ