lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Date:   Thu, 17 May 2018 16:24:51 -0700
From:   Ben Greear <greearb@...delatech.com>
To:     Eric Dumazet <edumazet@...gle.com>
Cc:     netdev <netdev@...r.kernel.org>
Subject: Regression bisected to: softirq: Let ksoftirqd do its job

One of my out-of-tree patches is a network impairment tool that acts a lot like
an Ethernet bridge with latency, jitter, etc.

We noticed recently that we were seeing igb adapter errors when testing with our emulator
at high speeds.  For whatever reason, it is only easily reproduced when we add jitter
to our emulator.  This would cause a bit more CPU usage and lock contention in our software,
and would increase the skb pkts allocated at any given time.

I bisected the problem to the commit below:

Author: Eric Dumazet <edumazet@...gle.com>
Date:   Wed Aug 31 10:42:29 2016 -0700

     softirq: Let ksoftirqd do its job

     A while back, Paolo and Hannes sent an RFC patch adding threaded-able
     napi poll loop support : (https://patchwork.ozlabs.org/patch/620657/)
....

If I replace my emulator with a bridge, then I do not see the problem.  But, I also do not
(or very rarely?) see the problem when configuring the emulator with zero latency and jitter,
which is how the bridge would act.

Any idea what sort of (bad?) behaviour would be able to cause this tx q timeout?

If you have any interest, I will be happy to email you my out-of-tree patches and
instructions to reproduce the problem.


The kernel splat looks like this, and repeats often:


May 17 16:03:09 localhost.localdomain kernel: audit: type=1131 audit(1526598189.492:159): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-hostnamed 
comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
May 17 16:03:39 localhost.localdomain kernel: ------------[ cut here ]------------
May 17 16:03:39 localhost.localdomain kernel: WARNING: CPU: 5 PID: 0 at /home/greearb/git/linux-bisect/net/sched/sch_generic.c:316 dev_watchdog+0x234/0x240
May 17 16:03:39 localhost.localdomain kernel: NETDEV WATCHDOG: eth5 (igb): transmit queue 0 timed out
May 17 16:03:39 localhost.localdomain kernel: Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 fuse macvlan wanlink(O) pktgen 
cfg80211 sunrpc coretemp intel_rapl x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm irqbypass ipmi_ssif iTCO_wdt iTCO_vendor_support joydev i2c_i801 lpc_ich 
i2c_smbus ioatdma shpchp wmi ipmi_si ipmi_msghandler tpm_tis tpm_tis_core tpm acpi_power_meter acpi_pad sch_fq_codel ast drm_kms_helper ttm drm igb hwmon ptp 
pps_core dca i2c_algo_bit i2c_core fjes ipv6 crc_ccitt [last unloaded: nf_conntrack]
May 17 16:03:39 localhost.localdomain kernel: CPU: 5 PID: 0 Comm: swapper/5 Tainted: G           O    4.8.0-rc7+ #132
May 17 16:03:39 localhost.localdomain kernel: Hardware name: Iron_Systems,Inc CS-CAD-2U-A02/X10SRL-F, BIOS 2.0b 05/02/2017
May 17 16:03:39 localhost.localdomain kernel:  0000000000000000 ffff88087fd43d78 ffffffff81417eb1 ffff88087fd43dc8
May 17 16:03:39 localhost.localdomain kernel:  0000000000000000 ffff88087fd43db8 ffffffff81103556 0000013c7fd43da8
May 17 16:03:39 localhost.localdomain kernel:  0000000000000000 ffff880854221940 0000000000000005 ffff880854bb8000
May 17 16:03:39 localhost.localdomain kernel: Call Trace:
May 17 16:03:39 localhost.localdomain kernel:  <IRQ>  [<ffffffff81417eb1>] dump_stack+0x63/0x82
May 17 16:03:39 localhost.localdomain kernel:  [<ffffffff81103556>] __warn+0xc6/0xe0
May 17 16:03:39 localhost.localdomain kernel:  [<ffffffff811035ba>] warn_slowpath_fmt+0x4a/0x50
May 17 16:03:39 localhost.localdomain kernel:  [<ffffffff817b3844>] dev_watchdog+0x234/0x240
May 17 16:03:39 localhost.localdomain kernel:  [<ffffffff817b3610>] ? qdisc_rcu_free+0x40/0x40
May 17 16:03:39 localhost.localdomain kernel:  [<ffffffff8116ea50>] call_timer_fn+0x30/0x150
May 17 16:03:39 localhost.localdomain kernel:  [<ffffffff817b3610>] ? qdisc_rcu_free+0x40/0x40
May 17 16:03:39 localhost.localdomain kernel:  [<ffffffff8116f35a>] run_timer_softirq+0x1ea/0x450
May 17 16:03:39 localhost.localdomain kernel:  [<ffffffff81176d97>] ? ktime_get+0x37/0xa0
May 17 16:03:39 localhost.localdomain kernel:  [<ffffffff8104fd21>] ? lapic_next_deadline+0x21/0x30
May 17 16:03:39 localhost.localdomain kernel:  [<ffffffff8117cffd>] ? clockevents_program_event+0x7d/0x120
May 17 16:03:39 localhost.localdomain kernel:  [<ffffffff81108b7a>] __do_softirq+0xca/0x2d0
May 17 16:03:39 localhost.localdomain kernel:  [<ffffffff81108ee3>] irq_exit+0xb3/0xc0
May 17 16:03:39 localhost.localdomain kernel:  [<ffffffff8105099d>] smp_apic_timer_interrupt+0x3d/0x50
May 17 16:03:39 localhost.localdomain kernel:  [<ffffffff81867882>] apic_timer_interrupt+0x82/0x90
May 17 16:03:39 localhost.localdomain kernel:  <EOI>  [<ffffffff816f9c06>] ? cpuidle_enter_state+0x126/0x300
May 17 16:03:39 localhost.localdomain kernel:  [<ffffffff816f9e02>] cpuidle_enter+0x12/0x20
May 17 16:03:39 localhost.localdomain kernel:  [<ffffffff81144ba5>] call_cpuidle+0x25/0x40
May 17 16:03:39 localhost.localdomain kernel:  [<ffffffff81144f6a>] cpu_startup_entry+0x2ba/0x380
May 17 16:03:39 localhost.localdomain kernel:  [<ffffffff8104e8d9>] start_secondary+0x149/0x170
May 17 16:03:39 localhost.localdomain kernel: ---[ end trace f62c6dd947785e8f ]---


Thanks,
Ben

-- 
Ben Greear <greearb@...delatech.com>
Candela Technologies Inc  http://www.candelatech.com

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ