linux-kernel - Re: Introduction Email and RCU CPU Stalls

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200428235207.GA22595@paulmck-ThinkPad-P72>
Date:   Tue, 28 Apr 2020 16:52:07 -0700
From:   "Paul E. McKenney" <paulmck@...nel.org>
To:     Sal Carrasco <scarrasco@...ode.com>
Cc:     linux-kernel@...r.kernel.org
Subject: Re: Introduction Email and RCU CPU Stalls

On Tue, Apr 28, 2020 at 04:50:48PM -0700, Paul E. McKenney wrote:
> On Tue, Apr 28, 2020 at 06:21:30PM -0400, Sal Carrasco wrote:
> > Hi Paul,
> > 
> > As for the stalls, we've seen them either be self-detected or detected by
> > another CPU on the system. We've also seen stalls in various
> > situations where we remove ebtables out of the equation which I think
> > validates your theory that get_counters() is an innocent bystander. I've
> > added a different trace from the same server which completely locks up once
> > we hit this issue. As for using BPF or some other observability tool,
> > unfortunately the system locks up completely so we are unable to access it
> > to perform any type of analysis. This brings up my next question, in your
> > experience have you seen a situation like this before where seemingly 1 cpu
> > stalls and then the system completely locks up after that?
> > 
> > - Sal
> 
> This one looks like IOMMU lock contention.  There was a patch set by
> Shaohua Li intended to address this, but I don't see any sign that this
> patch was accepted.  It might be worthwhile trying Shaohua's patch
> series and seeing if it helps.  I would suggest talking to Shaohua,
> but https://lwn.net/Articles/775861/...

And this time with a URL for that patch series...

https://lists.linuxfoundation.org/pipermail/iommu/2016-January/015337.html

							Thanx, Paul

> > ========== New stall traces ==========
> > 
> > [Thu Apr 23 17:00:17 2020] br0: port 81(tap.20217676_0) entered blocking
> > state
> > [Thu Apr 23 17:00:17 2020] br0: port 81(tap.20217676_0) entered forwarding
> > state
> > [Thu Apr 23 17:01:42 2020] hrtimer: interrupt took 9631 ns
> > [Thu Apr 23 19:04:25 2020] ------------[ cut here ]------------
> > [Thu Apr 23 19:04:25 2020] NETDEV WATCHDOG: eth1 (mlx5_core): transmit
> > queue 41 timed out
> > [Thu Apr 23 19:04:25 2020] WARNING: CPU: 76 PID: 0 at
> > net/sched/sch_generic.c:443 dev_watchdog+0x22f/0x240
> > [Thu Apr 23 19:04:25 2020] Modules linked in: vhost_net vhost cls_u32
> > sch_ingress act_police nft_objref nft_counter nft_limit nf_tables_set
> > xt_multiport xt_set ip_set nf_conntrack_netlink xt_CT xt_conntrack tun
> > nft_ct nf_tables nf_conntrack_bridge nf_conntrack nf_defrag_ipv6
> > nf_defrag_ipv4 8021q mrp bonding amd64_edac_mod kvm_amd kvm irqbypass
> > crct10dif_pclmul crc32_pclmul mlx5_core crc32c_intel evdev fuse autofs4
> > [Thu Apr 23 19:04:25 2020] CPU: 76 PID: 0 Comm: swapper/76 Not tainted
> > 5.5.9-1 #1
> > [Thu Apr 23 19:04:25 2020] Hardware name: Supermicro Super
> > Server/H11DSi-NT, BIOS 1.1 04/13/2018
> > [Thu Apr 23 19:04:25 2020] RIP: 0010:dev_watchdog+0x22f/0x240
> > [Thu Apr 23 19:04:25 2020] Code: c0 75 e4 eb 98 4c 89 ef c6 05 f3 19 eb 00
> > 01 e8 17 7a fb ff 89 d9 48 89 c2 4c 89 ee 48 c7 c7 48 d3 61 8e 31 c0 e8 a1
> > ec 5b ff <0f> 0b e9 75 ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
> > [Thu Apr 23 19:04:25 2020] RSP: 0018:ffffb4b30d658ea0 EFLAGS: 00010286
> > [Thu Apr 23 19:04:25 2020] RAX: 000000000000003e RBX: 0000000000000029 RCX:
> > 0000000000000000
> > [Thu Apr 23 19:04:25 2020] RDX: ffff8f252fd27480 RSI: ffff8f252fd18a18 RDI:
> > ffff8f252fd18a18
> > [Thu Apr 23 19:04:25 2020] RBP: ffff8f3521800440 R08: 0000000000000000 R09:
> > 0000000000000b09
> > [Thu Apr 23 19:04:25 2020] R10: 000006ff5a468c40 R11: 0000000000000000 R12:
> > 00000000000001f8
> > [Thu Apr 23 19:04:25 2020] R13: ffff8f3521800000 R14: 000000010070d200 R15:
> > 0000000000000000
> > [Thu Apr 23 19:04:25 2020] FS:  0000000000000000(0000)
> > GS:ffff8f252fd00000(0000) knlGS:0000000000000000
> > [Thu Apr 23 19:04:25 2020] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [Thu Apr 23 19:04:25 2020] CR2: 00007f652ae64e98 CR3: 00000073ef80a000 CR4:
> > 00000000003406e0
> > [Thu Apr 23 19:04:25 2020] Call Trace:
> > [Thu Apr 23 19:04:25 2020]  <IRQ>
> > [Thu Apr 23 19:04:25 2020]  ? dev_deactivate_queue.constprop.60+0x90/0x90
> > [Thu Apr 23 19:04:25 2020]  call_timer_fn+0x2d/0x140
> > [Thu Apr 23 19:04:25 2020]  run_timer_softirq+0x16a/0x1e0
> > [Thu Apr 23 19:04:25 2020]  ? timerqueue_add+0x5e/0x70
> > [Thu Apr 23 19:04:25 2020]  ? enqueue_hrtimer+0x3a/0x90
> > [Thu Apr 23 19:04:25 2020]  ? __hrtimer_run_queues+0x11c/0x260
> > [Thu Apr 23 19:04:25 2020]  __do_softirq+0x117/0x2d0
> > [Thu Apr 23 19:04:25 2020]  irq_exit+0x5a/0x60
> > [Thu Apr 23 19:04:25 2020]  smp_apic_timer_interrupt+0x74/0x140
> > [Thu Apr 23 19:04:25 2020]  apic_timer_interrupt+0xf/0x20
> > [Thu Apr 23 19:04:25 2020]  </IRQ>
> > [Thu Apr 23 19:04:25 2020] RIP: 0010:cpuidle_enter_state+0xc5/0x3d0
> > [Thu Apr 23 19:04:25 2020] Code: c7 0f 1f 44 00 00 31 ff e8 d8 f1 69 ff 80
> > 7c 24 0f 00 74 12 9c 58 f6 c4 02 0f 85 db 02 00 00 31 ff e8 bf 5d 6f ff fb
> > 45 85 f6 <0f> 88 8e 02 00 00 49 63 d6 4c 2b 7c 24 10 48 8d 04 52 48 8d 04 82
> > [Thu Apr 23 19:04:25 2020] RSP: 0018:ffffb4b30c76fe90 EFLAGS: 00000202
> > ORIG_RAX: ffffffffffffff13
> > [Thu Apr 23 19:04:25 2020] RAX: ffff8f252fd2a0c0 RBX: ffff8f152781b000 RCX:
> > 000000000000001f
> > [Thu Apr 23 19:04:25 2020] RDX: 0000000000000000 RSI: 00000fa78ce9df82 RDI:
> > 0000000000000000
> > [Thu Apr 23 19:04:25 2020] RBP: ffffffff8e8fdc00 R08: ffffffe1dd14beda R09:
> > 000001d1a94a2000
> > [Thu Apr 23 19:04:25 2020] R10: 0000000000000140 R11: 0000000000000032 R12:
> > 000000000000004c
> > [Thu Apr 23 19:04:25 2020] R13: 000000000000004c R14: 0000000000000002 R15:
> > 000006ff8cdebd0d
> > [Thu Apr 23 19:04:25 2020]  cpuidle_enter+0x2f/0x40
> > [Thu Apr 23 19:04:25 2020]  do_idle+0x10c/0x1d0
> > [Thu Apr 23 19:04:25 2020]  cpu_startup_entry+0x19/0x20
> > [Thu Apr 23 19:04:25 2020]  secondary_startup_64+0xa4/0xb0
> > [Thu Apr 23 19:04:25 2020] ---[ end trace 51079cc92cafa850 ]---
> > [Thu Apr 23 19:04:25 2020] mlx5_core 0000:13:00.1 eth1: TX timeout detected
> > [Thu Apr 23 19:04:25 2020] mlx5_core 0000:13:00.1 eth1: TX timeout on
> > queue: 41, SQ: 0xb24, CQ: 0xda2, SQ Cons: 0x4b6 SQ Prod: 0x4bd, usecs since
> > last trans: 19370000
> > [Thu Apr 23 19:04:25 2020] mlx5_core 0000:13:00.1 eth1: EQ 0x30: Cons =
> > 0x3b055, irqn = 0x136
> > [Thu Apr 23 19:04:44 2020] rcu: INFO: rcu_sched detected stalls on
> > CPUs/tasks:
> > [Thu Apr 23 19:04:44 2020] rcu: 107-....: (1 GPs behind)
> > idle=c66/1/0x4000000000000002 softirq=7423822/7423854 fqs=11724
> > [Thu Apr 23 19:04:44 2020] (detected by 44, t=60015 jiffies, g=796105,
> > q=3217349)
> > [Thu Apr 23 19:04:44 2020] Sending NMI from CPU 44 to CPUs 107:
> > [Thu Apr 23 19:04:44 2020] NMI backtrace for cpu 107
> > [Thu Apr 23 19:04:44 2020] CPU: 107 PID: 29507 Comm: qemu-system-x86
> > Tainted: G        W         5.5.9-1 #1
> > [Thu Apr 23 19:04:44 2020] Hardware name: Supermicro Super
> > Server/H11DSi-NT, BIOS 1.1 04/13/2018
> > [Thu Apr 23 19:04:44 2020] RIP: 0010:queued_spin_lock_slowpath+0x118/0x1c0
> > [Thu Apr 23 19:04:44 2020] Code: c1 e8 12 48 c1 ee 0c 83 e8 01 83 e6 30 48
> > 98 48 81 c6 80 ad 02 00 48 03 34 c5 80 b7 62 8e 48 89 16 8b 42 08 85 c0 75
> > 09 f3 90 <8b> 42 08 85 c0 74 f7 48 8b 32 48 85 f6 74 07 0f 0d 0e eb 02 f3 90
> > [Thu Apr 23 19:04:44 2020] RSP: 0018:ffffb4b30dbace18 EFLAGS: 00000046
> > [Thu Apr 23 19:04:44 2020] RAX: 0000000000000000 RBX: 0000000000000086 RCX:
> > 0000000001b00000
> > [Thu Apr 23 19:04:44 2020] RDX: ffff8f652fcead80 RSI: ffff8f252fcaad80 RDI:
> > ffff8f652ca76808
> > [Thu Apr 23 19:04:44 2020] RBP: 0007fffffffdf2c0 R08: 0000000001b00000 R09:
> > ffffffff8d726d00
> > [Thu Apr 23 19:04:44 2020] R10: ffff8f415a03b2c0 R11: 0000000000000001 R12:
> > ffff8f652ca76808
> > [Thu Apr 23 19:04:44 2020] R13: 0000000000087bdb R14: 0000000000000048 R15:
> > 0000000000000286
> > [Thu Apr 23 19:04:44 2020] FS:  00007f4749ed6f00(0000)
> > GS:ffff8f652fcc0000(0000) knlGS:ffff9d49afa40000
> > [Thu Apr 23 19:04:44 2020] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [Thu Apr 23 19:04:44 2020] CR2: 000055bac3c0c5d0 CR3: 0000007fb7de6000 CR4:
> > 00000000003406e0
> > [Thu Apr 23 19:04:44 2020] Call Trace:
> > [Thu Apr 23 19:04:44 2020]  <IRQ>
> > [Thu Apr 23 19:04:44 2020]  _raw_spin_lock_irqsave+0x27/0x30
> > [Thu Apr 23 19:04:44 2020]  find_iova+0x14/0x40
> > [Thu Apr 23 19:04:44 2020]  free_iova+0xe/0x30
> > [Thu Apr 23 19:04:44 2020]  fq_ring_free+0x99/0xd0
> > [Thu Apr 23 19:04:44 2020]  fq_flush_timeout+0x61/0x90
> > [Thu Apr 23 19:04:44 2020]  ? fq_ring_free+0xd0/0xd0
> > [Thu Apr 23 19:04:44 2020]  call_timer_fn+0x2d/0x140
> > [Thu Apr 23 19:04:44 2020]  run_timer_softirq+0x16a/0x1e0
> > [Thu Apr 23 19:04:44 2020]  ? timerqueue_add+0x5e/0x70
> > [Thu Apr 23 19:04:44 2020]  ? enqueue_hrtimer+0x3a/0x90
> > [Thu Apr 23 19:04:44 2020]  ? __hrtimer_run_queues+0x11c/0x260
> > [Thu Apr 23 19:04:44 2020]  __do_softirq+0x117/0x2d0
> > [Thu Apr 23 19:04:44 2020]  irq_exit+0x5a/0x60
> > [Thu Apr 23 19:04:44 2020]  smp_apic_timer_interrupt+0x74/0x140
> > [Thu Apr 23 19:04:44 2020]  apic_timer_interrupt+0xf/0x20
> > [Thu Apr 23 19:04:44 2020]  </IRQ>
> > [Thu Apr 23 19:04:44 2020] RIP: 0010:kvm_arch_vcpu_ioctl_run+0xbea/0x1c80
> > [kvm]
> > [Thu Apr 23 19:04:44 2020] Code: 24 00 00 48 8b 05 ce 01 07 00 4c 89 ef 48
> > 8b 80 d0 02 00 00 e8 a7 c0 ad cd 65 4c 89 2d 57 65 b0 3f fb 49 83 85 d8 00
> > 00 00 01 <fa> 65 48 8b 04 25 00 7d 01 00 65 48 c7 05 38 65 b0 3f 00 00 00 00
> > [Thu Apr 23 19:04:44 2020] RSP: 0018:ffffb4b36d747d20 EFLAGS: 00000206
> > ORIG_RAX: ffffffffffffff13
> > [Thu Apr 23 19:04:44 2020] RAX: ffffffffc0750850 RBX: 00000f926c809262 RCX:
> > ffff8f84b6c98000
> > [Thu Apr 23 19:04:44 2020] RDX: 00000f9200000000 RSI: ffffff69cb05ade0 RDI:
> > ffff8f84b6c98000
> > [Thu Apr 23 19:04:44 2020] RBP: ffffb4b36d747dc0 R08: 0000000000000010 R09:
> > 0000000000000010
> > [Thu Apr 23 19:04:44 2020] R10: ffffb4b36d747d20 R11: 0000000000000000 R12:
> > 0000000000000001
> > [Thu Apr 23 19:04:44 2020] R13: ffff8f84b6c98000 R14: 0000000000000001 R15:
> > ffff8f84b6c98038
> > [Thu Apr 23 19:04:44 2020]  ? svm_pku_supported+0x10/0x10 [kvm_amd]
> > [Thu Apr 23 19:04:44 2020]  ? kvm_vcpu_ioctl+0x405/0x590 [kvm]
> > [Thu Apr 23 19:04:44 2020]  kvm_vcpu_ioctl+0x405/0x590 [kvm]
> > [Thu Apr 23 19:04:44 2020]  ? __handle_mm_fault+0xae6/0x1340
> > [Thu Apr 23 19:04:44 2020]  ? __fpu__restore_sig+0x17f/0x5a0
> > [Thu Apr 23 19:04:44 2020]  do_vfs_ioctl+0x8e/0x620
> > [Thu Apr 23 19:04:44 2020]  ? __x64_sys_futex+0x7c/0x170
> > [Thu Apr 23 19:04:44 2020]  ksys_ioctl+0x5e/0x90
> > [Thu Apr 23 19:04:44 2020]  __x64_sys_ioctl+0x16/0x20
> > [Thu Apr 23 19:04:44 2020]  do_syscall_64+0x54/0x280
> > [Thu Apr 23 19:04:44 2020]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [Thu Apr 23 19:04:44 2020] RIP: 0033:0x7f64d84951c7
> > [Thu Apr 23 19:04:44 2020] Code: 44 00 00 48 8b 05 c1 3c 2c 00 64 c7 00 26
> > 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00
> > 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 91 3c 2c 00 f7 d8 64 89 01 48
> > [Thu Apr 23 19:04:44 2020] RSP: 002b:00007f64cb467448 EFLAGS: 00000246
> > ORIG_RAX: 0000000000000010
> > [Thu Apr 23 19:04:44 2020] RAX: ffffffffffffffda RBX: 000000000000ae80 RCX:
> > 00007f64d84951c7
> > [Thu Apr 23 19:04:44 2020] RDX: 0000000000000000 RSI: 000000000000ae80 RDI:
> > 000000000000001a
> > [Thu Apr 23 19:04:44 2020] RBP: 0000559004f19980 R08: 0000559004dbf3a0 R09:
> > 0000559004f19980
> > [Thu Apr 23 19:04:44 2020] R10: 0000000000000005 R11: 0000000000000246 R12:
> > 0000000000000000
> > [Thu Apr 23 19:04:44 2020] R13: 0000000000000001 R14: 00007f64db764000 R15:
> > 0000559004f19980
> > ______________________________________
> > *Salvador Carrasco*  |  Manager of Systems Operations  |  Linode
> > w: linode.com  |  e: scarrasco@...ode.com
> > p: 609-380-7100 ext.1310 <(609)%20380-7100>
> > 
> > 
> > On Fri, Apr 24, 2020 at 8:02 PM Paul E. McKenney <paulmck@...nel.org> wrote:
> > 
> > > Hello, Sal,
> > >
> > > I will reply to this CCing LKML, but dropping your last paragraph.
> > >
> > > I really do appreciate the offer, especially given there might well come
> > > a time when that sort of arrangement is quite attractive to me, but I
> > > am currently fortunate enough to be able to prioritize other people who
> > > might be having similar issues over compensation, charitable or otherwise.
> > >
> > >                                                         Thanx, Paul
> > >
> > > On Fri, Apr 24, 2020 at 03:05:27PM -0400, Sal Carrasco wrote:
> > > > Hi Paul,
> > > >
> > > > I love your talks and all of the work that you’ve done around RCU CPU
> > > stall
> > > > warnings. I recently watched a youtube video where you talk about
> > > decoding
> > > > these inscrutable warnings (https://youtu.be/23_GOr8Sz-E) - they really
> > > are
> > > > inscrutable. It was really insightful and it is because of this video
> > > that
> > > > I am reaching out to you today.
> > > >
> > > > My name is Sal Carrasco and I work for Linode(linode.com) as a part of
> > > the
> > > > Systems Engineering team. We’re a cloud hosting company that provides a
> > > > variety of infrastructure products with our main one being our compute
> > > > product which are Linux vms.
> > > >
> > > > The reason that I am reaching out to you is because in your talk you
> > > > encourage people to reach out if they encounter these RCU stall warnings.
> > > > Therefore, I was wondering if you’d be willing to give us some of your
> > > time
> > > > to help decipher what we’re currently seeing across our fleet. We’ve
> > > > submitted a Kernel mailing list post with some relevant information (
> > > > https://lkml.org/lkml/2020/4/8/980) but no response yet. Also we’ve been
> > > > able to reliably trigger a NETDEV WATCHDOG followed by a WARNING on
> > > servers
> > > > using AMD Epyc processors. After the warnings some combination of [NMIs,
> > > > stuck kworkers, network device timeouts, rcu stalls] happens, and, in 99%
> > > > of cases, the system descends into madness and needs to be rebooted.
> > > >
> > > > We’d really appreciate any help you can provide as this is a huge thorn
> > > in
> > > > our sides. Additionally, I’ve provided some potentially relevant
> > > debugging
> > > > information below for context. I know you’re probably busy so we'd be
> > > happy
> > > > to pay a consulting fee or donate to a charity of your choosing for your
> > > > time. If you are interested in helping please let me know.
> > > >
> > > > Thanks,
> > > > Sal
> > > >
> > > > ---------- Debugging information --------
> > > >
> > > > We’ve encountered this on servers using AMD processors with any
> > > combination
> > > > of the following kernels: 4.14.x, 4.19.x, 5.4.x, 5.5.9, 5.6.1.
> > > >
> > > > Steps to reproduce:
> > > >
> > > > Host loaded with VMs running the following in a loop:
> > > > fio --name=randrw --rw=randrw --direct=1 --ioengine=libaio --bs=16k
> > > > --numjobs=8 --rwmixread=90 --size=1G --runtime=600
> > > > VMs running  playing back a pcap of outbound udp traffic in a loop:
> > > > timeout 600000 tcpreplay -t -i eth0 --loop 100000 /converted.pcap
> > > > On the host, in a loop:
> > > > stress-ng --cpu 120 --io 120 --hdd 5 --timeout 600
> > > >
> > > > The host will stall within a few hours.
> > > >
> > > > Example kernel output once the machine begins to stall:
> > > >
> > > > [ 6755.833166] ------------[ cut here ]------------
> > > > [ 6755.836716] NETDEV WATCHDOG: eth1 (mlx5_core): transmit queue 19 timed
> > > > out
> > > > [ 6755.843728] WARNING: CPU: 15 PID: 30568 at net/sched/sch_generic.c:443
> > > > dev_watchdog+0x22f/0x240
> > > > [ 6755.851487] Modules linked in: cls_u32 sch_ingress act_police xt_u32
> > > > xt_physdev ebt_set xt_multiport xt_set nf_conntrack_netlink xt_CT
> > > > xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 tun ebt_comment
> > > > ebt_limit ebt_arp ebt_ip6 ebt_ip ip_set_hash_net ip_set ip6table_raw
> > > > ip6table_mangle ip6_tables iptable_raw iptable_filter ip_tables bpfilter
> > > > ebtable_nat 8021q mrp bonding amd64_edac_mod kvm_amd kvm irqbypass
> > > > crct10dif_pclmul crc32_pclmul mlx5_core crc32c_intel evdev fuse autofs4
> > > > [ 6755.893442] CPU: 15 PID: 30568 Comm: qemu-system-x86 Not tainted
> > > 5.5.9-1
> > > > #1
> > > > [ 6755.899410] Hardware name: Supermicro Super Server/H11DSi-NT, BIOS 1.1
> > > > 04/13/2018
> > > > [ 6755.905729] RIP: 0010:dev_watchdog+0x22f/0x240
> > > > [ 6755.908994] Code: c0 75 e4 eb 98 4c 89 ef c6 05 f3 19 eb 00 01 e8 17
> > > 7a
> > > > fb ff 89 d9 48 89 c2 4c 89 ee 48 c7 c7 48 d3 61 92 31 c0 e8 a1 ec 5b ff
> > > > <0f> 0b e9 75 ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
> > > > [ 6755.926764] RSP: 0018:ffff8df80cbacec0 EFLAGS: 00010282
> > > > [ 6755.930823] RAX: 000000000000003e RBX: 0000000000000013 RCX:
> > > > 0000000000000000
> > > > [ 6755.936789] RDX: ffff8b3dafbe7480 RSI: ffff8b3dafbd8a18 RDI:
> > > > ffff8b3dafbd8a18
> > > > [ 6755.942753] RBP: ffff8b3d99d80440 R08: 0000000000000000 R09:
> > > > 0000000000000a4a
> > > > [ 6755.948717] R10: ffff8df81a243c40 R11: 0000000000000000 R12:
> > > > 00000000000001f8
> > > > [ 6755.954680] R13: ffff8b3d99d80000 R14: 0000000100628200 R15:
> > > > 0000000000000000
> > > > [ 6755.960643] FS:  00007f2ef5387000(0000) GS:ffff8b3dafbc0000(0000)
> > > > knlGS:0000000000000000
> > > > [ 6755.967570] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > [ 6755.972283] CR2: 00007fffbd746d28 CR3: 0000001f119cc000 CR4:
> > > > 00000000003406e0
> > > > [ 6755.978403] Call Trace:
> > > > [ 6755.979673]  <IRQ>
> > > > [ 6755.980510]  ? dev_deactivate_queue.constprop.60+0x90/0x90
> > > > [ 6755.984829]  call_timer_fn+0x2d/0x140
> > > > [ 6755.987443]  run_timer_softirq+0x16a/0x1e0
> > > > [ 6755.990438]  ? __hrtimer_run_queues+0x11c/0x260
> > > > [ 6755.993800]  ? ktime_get+0x38/0x90
> > > > [ 6755.996028]  __do_softirq+0x117/0x2d0
> > > > [ 6755.998514]  do_softirq_own_stack+0x2a/0x40
> > > > [ 6756.001524]  </IRQ>
> > > > [ 6756.002469]  do_softirq.part.20+0x2b/0x30
> > > > [ 6756.005439]  __local_bh_enable_ip+0x59/0x60
> > > > [ 6756.008490]  tun_get_user+0xe87/0xfd0 [tun]
> > > > [ 6756.011511]  tun_chr_write_iter+0x4d/0x70 [tun]
> > > > [ 6756.014874]  do_iter_readv_writev+0x177/0x1b0
> > > > [ 6756.018061]  do_iter_write+0x87/0x1b0
> > > > [ 6756.020557]  vfs_writev+0x83/0x100
> > > > [ 6756.022790]  ? do_vfs_ioctl+0x8e/0x620
> > > > [ 6756.025408]  ? do_writev+0x51/0xe0
> > > > [ 6756.027637]  ? ksys_ioctl+0x6f/0x90
> > > > [ 6756.029948]  do_writev+0x51/0xe0
> > > > [ 6756.032003]  do_syscall_64+0x54/0x280
> > > > [ 6756.034505]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > [ 6756.038406] RIP: 0033:0x7f2ef20a72f0
> > > > [ 6756.040811] Code: 54 24 1c 48 89 74 24 10 89 7c 24 08 e8 09 39 01 00
> > > 48
> > > > 63 54 24 1c 41 89 c0 48 8b 74 24 10 48 63 7c 24 08 b8 14 00 00 00 0f 05
> > > > <48> 3d 00 f0 ff ff 77 17 44 89 c7 48 89 44 24 08 e8 3b 39 01 00 48
> > > > [ 6756.058563] RSP: 002b:00007fffbd748e20 EFLAGS: 00000293 ORIG_RAX:
> > > > 0000000000000014
> > > > [ 6756.064967] RAX: ffffffffffffffda RBX: 000055a3ce26ecd0 RCX:
> > > > 00007f2ef20a72f0
> > > > [ 6756.070940] RDX: 0000000000000001 RSI: 000055a3ce259e38 RDI:
> > > > 0000000000000012
> > > > [ 6756.076909] RBP: 000055a3ce26ecd0 R08: 0000000000000000 R09:
> > > > 000055a3cd0715c0
> > > > [ 6756.082881] R10: 0000000000000001 R11: 0000000000000293 R12:
> > > > 0000000000000001
> > > > [ 6756.088848] R13: 000055a3ce259e38 R14: 000055a3ce259e38 R15:
> > > > 0000000000000001
> > > > [ 6756.094813] ---[ end trace a91e624ca8f4292c ]---
> > > > [ 6756.098429] mlx5_core 0000:13:00.1 eth1: TX timeout detected
> > > > [ 6756.102952] mlx5_core 0000:13:00.1 eth1: TX timeout on queue: 19, SQ:
> > > > 0xab6, CQ: 0xd4a, SQ Cons: 0x3fd7 SQ Prod: 0x3fe6, usecs since last
> > > trans:
> > > > 27541000
> > > > [ 6756.115632] mlx5_core 0000:13:00.1 eth1: EQ 0x1a: Cons = 0x9c0e, irqn
> > > =
> > > > 0x120
> > > > [ 6775.300229] rcu: INFO: rcu_sched self-detected stall on CPU
> > > > [ 6775.304787] rcu: 85-....: (59999 ticks this GP)
> > > > idle=ca2/1/0x4000000000000002 softirq=7985603/7985603 fqs=14090
> > > > [ 6775.313959] (t=60013 jiffies g=687873 q=106209)
> > > > [ 6775.317527] NMI backtrace for cpu 85
> > > > [ 6775.319932] CPU: 85 PID: 23160 Comm: ebtables Tainted: G        W
> > > >   5.5.9-1 #1
> > > > [ 6775.326520] Hardware name: Supermicro Super Server/H11DSi-NT, BIOS 1.1
> > > > 04/13/2018
> > > > [ 6775.332839] Call Trace:
> > > > [ 6775.334112]  <IRQ>
> > > > [ 6775.334957]  dump_stack+0x50/0x6b
> > > > [ 6775.337099]  nmi_cpu_backtrace+0x90/0xa0
> > > > [ 6775.339848]  ? lapic_can_unplug_cpu+0xa0/0xa0
> > > > [ 6775.343038]  nmi_trigger_cpumask_backtrace+0x6f/0x100
> > > > [ 6775.346926]  rcu_dump_cpu_stacks+0x8f/0xbf
> > > > [ 6775.349849]  rcu_sched_clock_irq+0x517/0x790
> > > > [ 6775.352947]  ? tick_init_highres+0x20/0x20
> > > > [ 6775.355875]  update_process_times+0x24/0x50
> > > > [ 6775.358891]  tick_sched_timer+0x51/0x160
> > > > [ 6775.361654]  __hrtimer_run_queues+0xec/0x260
> > > > [ 6775.364764]  hrtimer_interrupt+0x122/0x270
> > > > [ 6775.367689]  ? rcu_irq_enter+0x35/0x110
> > > > [ 6775.370498]  smp_apic_timer_interrupt+0x6a/0x140
> > > > [ 6775.373951]  apic_timer_interrupt+0xf/0x20
> > > > [ 6775.376887]  </IRQ>
> > > > [ 6775.377810] RIP: 0010:queued_write_lock_slowpath+0x51/0x80
> > > > [ 6775.382133] Code: ff 00 00 00 f0 0f b1 13 85 c0 74 32 f0 81 03 00 01
> > > 00
> > > > 00 ba ff 00 00 00 b9 00 01 00 00 8b 03 3d 00 01 00 00 74 0b f3 90 8b 03
> > > > <3d> 00 01 00 00 75 f5 89 c8 f0 0f b1 13 3d 00 01 00 00 75 df c6 43
> > > > [ 6775.399890] RSP: 0018:ffff8df85fc9bc70 EFLAGS: 00000206 ORIG_RAX:
> > > > ffffffffffffff13
> > > > [ 6775.406369] RAX: 0000000000003b00 RBX: ffff8b3d8743b39c RCX:
> > > > 0000000000000100
> > > > [ 6775.412419] RDX: 00000000000000ff RSI: ffff8df824d4c000 RDI:
> > > > ffff8b3d8743b3a0
> > > > [ 6775.418504] RBP: ffff8b3d8743b39c R08: 000ffffffffff000 R09:
> > > > 0000000000000000
> > > > [ 6775.424528] R10: ffff8df824d4c000 R11: 000000000000000a R12:
> > > > ffff8df824d42000
> > > > [ 6775.430505] R13: 00000000000009f3 R14: ffff8df8f73d5080 R15:
> > > > 00000000013ea8d0
> > > > [ 6775.436513]  copy_counters_to_user.part.13+0x3d/0xa0
> > > > [ 6775.440364]  copy_everything_to_user+0x4ca/0x510
> > > > [ 6775.443813]  do_ebt_get_ctl+0x9a/0x1b0
> > > > [ 6775.446487]  ? mem_cgroup_commit_charge+0x4a/0xa0
> > > > [ 6775.450027]  nf_getsockopt+0x48/0x60
> > > > [ 6775.452496]  ip_getsockopt+0x6d/0xb0
> > > > [ 6775.454906]  __sys_getsockopt+0x8d/0x100
> > > > [ 6775.457662]  __x64_sys_getsockopt+0x20/0x30
> > > > [ 6775.460673]  do_syscall_64+0x54/0x280
> > > > [ 6775.463172]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > [ 6775.467056] RIP: 0033:0x7f35e1d9f40a
> > > > [ 6775.469520] Code: 48 8b 0d 81 ba 2b 00 f7 d8 64 89 01 48 83 c8 ff c3
> > > 66
> > > > 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 37 00 00 00 0f 05
> > > > <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 4e ba 2b 00 f7 d8 64 89 01 48
> > > > [ 6775.487403] RSP: 002b:00007fff5491c5e8 EFLAGS: 00000217 ORIG_RAX:
> > > > 0000000000000037
> > > > [ 6775.493810] RAX: ffffffffffffffda RBX: 00007fff5491c630 RCX:
> > > > 00007f35e1d9f40a
> > > > [ 6775.499781] RDX: 0000000000000081 RSI: 0000000000000000 RDI:
> > > > 0000000000000003
> > > > [ 6775.505750] RBP: 0000000000000000 R08: 00007fff5491c6ac R09:
> > > > 0000000000000000
> > > > [ 6775.511720] R10: 00007fff5491c630 R11: 0000000000000217 R12:
> > > > 00007f35e509a648
> > > > [ 6775.517695] R13: 00007fff5491c6ac R14: 0000000000000000 R15:
> > > > 0000000000000000
> > > >
> > > > ______________________________________
> > > > *Salvador Carrasco*  |  Manager of Systems Operations  |  Linode
> > > > w: linode.com  |  e: scarrasco@...ode.com
> > > > p: 609-380-7100 ext.1310 <(609)%20380-7100>
> > >