[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <fa12d66e-de52-3e2e-154c-90c775bb4fe4@ametek.com>
Date: Mon, 14 Sep 2020 13:53:05 +0000
From: James Jurack <James.Jurack@...tek.com>
To: "claudiu.manoil@...escale.com" <claudiu.manoil@...escale.com>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: PROBLEM: Crash when timestamping outgoing PTP packets under heavy
network load (ppc, gianfar)
[1.] One line summary of the problem:
Crash when timestamping outgoing PTP packets under heavy network load
(ppc, gianfar)
[2.] Full description of the problem/report:
I have a custom embedded platform running a QorIQ P2020 PPC processor
and Freescale eTSEC for gigabit Ethernet. When I have heavy network load
(full gigabit Ethernet usage), and send PTP Event packets with gianfar’s
hardware timestamping enabled, memory/stack corruption seems to occur,
leading to many different symptoms, including DMA API Debug warnings,
CPU stalls, and oopses/panics due to null pointer dereference or invalid
page access. The crash can also happen under lower load, but below
100MBps it can take hours for the issue to occur.
[3.] Keywords (i.e., modules, networking, kernel):
networking, ptp, ieee 1588, hardware timestamping, gianfar, ppc, p2020
[4.] Kernel information
[4.1.] Kernel version (from /proc/version):
# uname -a
Linux version 4.4.235 (jjurack@...val-eng-12) (gcc version 4.9.3
(crosstool-NG crosstool-ng-1.22.0) ) #13 SMP Thu Sep 10 08:28:37 EDT 2020
[4.2.] Kernel .config file:
linux.config is attached
[5.] Most recent kernel version which did not have the bug:
The latest version of this system to not have this crash was using
kernel 3.2 (Linux morlun 3.2.0 #1 SMP Fri Jun 10 12:43:28 PDT 2016 ppc
GNU/Linux). However, many other parts of our system have changed since
then and I have so far not been able to create modified versions of that
build for 1:1 comparison testing. I have gone back as far as 4.4.129 on
the 4.4 branch; that is where I first discovered the issue.
[6.] Output of Oops.. message (if applicable) with symbolic information
resolved (see Documentation/oops-tracing.txt)
Console logs from a few different crashes are attached (console{1..5}.log)
[7.] A small shell script or example program which triggers the
problem (if possible)
I have attached minimal programs that I used to generate network load
and send timestamped PTP packets:
* netdump.c generates network traffic.
* dump.sh runs 16 instances of netdump.
* test.py connects to the netdump instances from another system to
consume the traffic.
* hwts.c sends timestamped PTP packets.
To reproduce I usually do these steps:
* run dump.sh
* start test.py on another system
* run `hwts 32` to send 32 ptp packets
* a crash usually happens within 10-20 seconds
[8.] Environment
[8.1.] Software (add the output of the ver_linux script here)
$ scripts/ver_linux
If some fields are empty or look unusual you may have an old version.
Compare to the current minimal requirements in Documentation/Changes.
Linux oh-val-eng-12 5.8.5-arch1-1 #1 SMP PREEMPT Thu, 27 Aug 2020
18:53:02 +0000 x86_64 GNU/Linux
GNU C 10.2.0
GNU Make 4.3
Binutils 2.35
Util-linux 2.36
Mount 2.36
Module-init-tools 27
E2fsprogs 1.45.6
Jfsutils 1.1.15
Reiserfsprogs 3.6.27
Xfsprogs 5.7.0
Pcmciautils 018
PPP 2.4.7
Linux C Library 2.32
Dynamic linker (ldd) 2.32
Linux C++ Library 6.0.28
Net-tools 2.10
Kbd 2.3.0
Console-tools 2.3.0
Sh-utils 8.32
Udev 246
Wireless-tools 30
Modules Loaded acpi_cpufreq agpgart at24 auth_rpcgss bluetooth
cdrom cec cfg80211 coretemp crc16 crc32c_generic crc32c_intel
crypto_user dcdbas dell_smbios dell_wmi dell_wmi_descriptor drm
drm_kms_helper e1000e ecc ecdh_generic ehci_hcd ehci_pci evdev ext4
fb_sys_fops fuse gpio_ich grace hid hid_generic hid_plantronics
i2c_algo_bit i2c_i801 i2c_smbus i7core_edac input_leds intel_cstate
intel_pmc_bxt intel_powerclamp intel_uncore ip_tables irqbypass
iTCO_vendor_support iTCO_wdt jbd2 joydev kvm kvm_intel ledtrig_audio
lockd loop lpc_ich mac_hid mbcache mc mousedev nfnetlink nfnetlink_log
nfnetlink_queue nfs_acl nfsd parport parport_pc pcspkr ppdev radeon
rc_core rfkill sg snd snd_hda_codec snd_hda_codec_generic
snd_hda_codec_realtek snd_hda_core snd_hda_intel snd_hwdep
snd_intel_dspcfg snd_pcm snd_rawmidi snd_seq_device snd_timer
snd_usb_audio snd_usbmidi_lib soundcore sparse_keymap sr_mod sunrpc
syscopyarea sysfillrect sysimgblt ttm uas usbhid usb_storage vboxdrv
vboxnetadp vboxnetflt wmi wmi_bmof x_tables
[8.2.] Processor information (from /proc/cpuinfo):
# cat /proc/cpuinfo
processor : 0
cpu : e500v2
clock : 1200.000000MHz
revision : 5.1 (pvr 8021 1051)
bogomips : 150.00
processor : 1
cpu : e500v2
clock : 1200.000000MHz
revision : 5.1 (pvr 8021 1051)
bogomips : 150.00
total bogomips : 300.00
timebase : 75000000
platform : P2020 RDB
model : EMX-2500
Memory : 512 MB
[8.3.] Module information (from /proc/modules):
# cat /proc/modules
[8.4.] Loaded driver and hardware information (/proc/ioports, /proc/iomem)
# cat /proc/ioports
00000000-0000ffff : /pcie@...08000
00000000-00000fff : Legacy IO
00020000-0002ffff : /pcie@...09000
00020000-0002ffff : PCI Bus 0001:12
00040000-0004ffff : /pcie@...0a000
00040000-0004ffff : PCI Bus 0002:17
# cat /proc/iomem
00000000-1fffffff : System RAM
80000000-bfffffff : /pcie@...0a000
80000000-bfffffff : PCI Bus 0002:17
a0000000-bfffffff : /pcie@...09000
a0000000-bfffffff : PCI Bus 0001:12
e0000000-ffffffff : /pcie@...08000
e0000000-ffffffff : PCI Bus 0000:01
ff704500-ff704507 : serial
ff707000-ff707fff : /soc@...00000/spi@...0
ff722000-ff722fff : /soc@...00000/usb@...00
ff722000-ff722fff : /soc@...00000/usb@...00
ff722000-ff722fff : /soc@...00000/usb@...00
[8.5.] PCI information ('lspci -vvv' as root)
lspci.log is attached
[8.6.] SCSI information (from /proc/scsi/scsi)
# cat /proc/scsi/scsi
[8.7.] Other information that might be relevant to the problem
(please look in /proc and include all information that you
think to be relevant):
I’ve attached our system’s device tree source file (emx-2500.dts).
[X.] Other notes, patches, fixes, workarounds:
[X.1.] I have tried commenting gianfar.c:965 (priv->hwts_tx_en = 1;).
This causes the crash to disappear.
[X.2.] Just calling the SIOCSHWTSTAMP ioctl to turn on hardware
timestamping while under network load causes a DMA API Debug warning,
but does not seem to destabilize the system otherwise. I’m not sure if
this is the same issue or separate. Examples of this are included in
each console log, as well as below:
fsl-gianfar ff724000.ethernet: DMA-API: device driver frees DMA memory
with wrong function [device address=0x000000001ed90000] [size=232 bytes]
[mapped as page] [unmapped as single]
------------[ cut here ]------------
WARNING: at lib/dma-debug.c:1116
Modules linked in:
CPU: 1 PID: 1589 Comm: hwts Tainted: G W 4.4.235 #13
task: d9e41900 ti: db960000 task.ti: db960000
NIP: c030f5c0 LR: c030f5c0 CTR: c0367c44
REGS: db961c20 TRAP: 0700 Tainted: G W (4.4.235)
MSR: 00021000 <CE,ME> CR: 28002822 XER: 20000000
GPR00: c030f5c0 db961cd0 d9e41900 000000b5 dffd12f0 dffd2e0c 1f85f000
db960000
GPR08: 00000007 c0772d4c 1f85f000 00000297 42002884 100192ac 00000000
00000000
GPR16: 00000000 d9da443c 00000002 d9da4420 00000000 df980900 00000020
d9da4000
GPR24: 00029000 c07a8314 c07fe728 c07b0000 c07d9ec0 db961d28 c0808e20
d9ddbd20
NIP [c030f5c0] check_unmap+0x948/0xa90
LR [c030f5c0] check_unmap+0x948/0xa90
Call Trace:
[db961cd0] [c030f5c0] check_unmap+0x948/0xa90 (unreliable)
[db961d20] [c030f7a4] debug_dma_unmap_page+0x9c/0xb0
[db961da0] [c03eeb70] free_skb_resources+0xf4/0x3e4
[db961df0] [c03f354c] reset_gfar+0x68/0x9c
[db961e00] [c03f378c] gfar_ioctl+0x20c/0x210
[db961e30] [c04a2d14] dev_ifsioc+0x308/0x31c
[db961e60] [c04a2f94] dev_ioctl+0x1c0/0x624
[db961ec0] [c014b4d0] do_vfs_ioctl+0x38c/0x6b4
[db961f20] [c014b844] SyS_ioctl+0x4c/0x80
[db961f40] [c0011004] ret_from_syscall+0x0/0x3c
--- interrupt: c01 at 0xff40194
LR = 0xffed0a8
Instruction dump:
554a103a 7c69402e 7cc9502e 811d001c 813d0020 815d0024 90610008 3c60c06b
90c1000c 3863b7a8 4cc63182 482c99b9 <0fe00000> 4bfffa60 3c80c06b 3884b0f8
---[ end trace 2398b56cb968a2e0 ]---
Mapped at:
[<c03f00dc>] gfar_start_xmit+0x888/0x9f0
[<c0489b74>] dev_hard_start_xmit+0x27c/0x47c
[<c04ae520>] sch_direct_xmit+0xe4/0x278
[<c04ae748>] __qdisc_run+0x94/0x1dc
[<c048a22c>] __dev_queue_xmit+0x384/0x70c
---
James Jurack
Systems Engineer
VTI Instruments / Ametek Programmable Power
View attachment "console1.log" of type "text/x-log" (9883 bytes)
View attachment "console2.log" of type "text/x-log" (5893 bytes)
View attachment "console3.log" of type "text/x-log" (12284 bytes)
View attachment "console4.log" of type "text/x-log" (6726 bytes)
View attachment "console5.log" of type "text/x-log" (10439 bytes)
Download attachment "dump.sh" of type "application/x-shellscript" (147 bytes)
Download attachment "emx-2500.dts" of type "audio/vnd.dts" (16685 bytes)
View attachment "hwts.c" of type "text/x-csrc" (4326 bytes)
View attachment "linux.config" of type "text/plain" (70323 bytes)
View attachment "lspci.log" of type "text/x-log" (156353 bytes)
View attachment "netdump.c" of type "text/x-csrc" (2202 bytes)
View attachment "test.py" of type "text/x-python" (771 bytes)
Powered by blists - more mailing lists