lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8739349e-d61c-9127-b9ff-530382036e4a@itcare.pl>
Date:   Sun, 15 Oct 2017 17:03:30 +0200
From:   Paweł Staszewski <pstaszewski@...are.pl>
To:     Alexander Duyck <alexander.duyck@...il.com>
Cc:     "Anders K. Pedersen | Cohaesio" <akp@...aesio.com>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        "intel-wired-lan@...ts.osuosl.org" <intel-wired-lan@...ts.osuosl.org>,
        "alexander.h.duyck@...el.com" <alexander.h.duyck@...el.com>
Subject: Re: Linux 4.12+ memory leak on router with i40e NICs

Previously attached graphs was for:

4.14.0-rc4-next-20171012

from git:

git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git


In kernel drivers.
Just tested by replacing cards in server from 8x10G based on 82599 to
02:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 
for 10GbE SFP+ (rev 01)
02:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 
for 10GbE SFP+ (rev 01)
02:00.2 Ethernet controller: Intel Corporation Ethernet Controller X710 
for 10GbE SFP+ (rev 01)
02:00.3 Ethernet controller: Intel Corporation Ethernet Controller X710 
for 10GbE SFP+ (rev 01)
03:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 
for 10GbE SFP+ (rev 02)
03:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 
for 10GbE SFP+ (rev 02)
03:00.2 Ethernet controller: Intel Corporation Ethernet Controller X710 
for 10GbE SFP+ (rev 02)
03:00.3 Ethernet controller: Intel Corporation Ethernet Controller X710 
for 10GbE SFP+ (rev 02)

And with same configuration - have leaking memory somewhere - there is 
no process that can
  ps aux --sort -rss
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      5103 41.2 31.1 10357508 10242860 ?   Sl   Oct14 1010:33 
/usr/local/sbin/bgpd -d -A 127.0.0.1 -u root -g root -I 
--ignore_warnings -F /usr/local/etc/Quagga.conf -t
root      5094  0.0  0.8 295372 270868 ?       Ss   Oct14   1:26 
/usr/local/sbin/zebra -d -A 127.0.0.1 -u root -g root -I 
--ignore_warnings -F /usr/local/etc/Quagga.conf
root      4356  3.4  0.2  98780 75852 ?        S    Oct14  84:21 
/usr/sbin/snmpd -p /var/run/snmpd.pid -Ln -I -smux
root      3448  0.0  0.0  32172  6204 ?        Ss   Oct14   0:00 
/sbin/udevd --daemon
root      5385  0.0  0.0  61636  5044 ?        Ss   Oct14   0:00 sshd: 
paol [priv]
root      4116  0.0  0.0 346312  4804 ?        Ss   Oct14   0:33 
/usr/sbin/syslog-ng --persist-file /var/lib/syslog-ng/syslog-ng.persist 
--cfgfile /etc/syslog-ng/syslog-ng.conf --pidfile /run/syslog-ng.pid
paol      5390  0.0  0.0  61636  3564 ?        S    Oct14   0:06 sshd: 
paol@.../1
root      5403  0.0  0.0 709344  3520 ?        Ssl  Oct14   0:26 
/opt/collectd/sbin/collectd
root      5397  0.0  0.0  18280  3288 pts/1    S    Oct14   0:00 -su
root      4384  0.0  0.0  30472  3016 ?        Ss   Oct14   0:00 
/usr/sbin/sshd
paol      5391  0.0  0.0  18180  2884 pts/1    Ss   Oct14   0:00 -bash
root      5394  0.0  0.0  43988  2376 pts/1    S    Oct14   0:00 su -
root     20815  0.0  0.0  17744  2312 pts/1    R+   16:58   0:00 ps aux 
--sort -rss
root      4438  0.0  0.0  28820  2256 ?        S    Oct14   0:00 teamd 
-d -f /etc/teamd.conf
root      4030  0.0  0.0   6976  2024 ?        Ss   Oct14   0:00 mdadm 
--monitor --scan --daemonise --pid-file /var/run/mdadm.pid --syslog
root      4408  0.0  0.0  16768  1884 ?        Ss   Oct14   0:00 
/usr/sbin/cron
root      5357  0.0  0.0 120532  1724 tty6     Ss+  Oct14   0:00 
/sbin/agetty 38400 tty6 linux
root      5352  0.0  0.0 120532  1712 tty1     Ss+  Oct14   0:00 
/sbin/agetty 38400 tty1 linux
root      5356  0.0  0.0 120532  1692 tty5     Ss+  Oct14   0:00 
/sbin/agetty 38400 tty5 linux
root      5353  0.0  0.0 120532  1648 tty2     Ss+  Oct14   0:00 
/sbin/agetty 38400 tty2 linux
root      5355  0.0  0.0 120532  1628 tty4     Ss+  Oct14   0:00 
/sbin/agetty 38400 tty4 linux
root      5354  0.0  0.0 120532  1620 tty3     Ss+  Oct14   0:00 
/sbin/agetty 38400 tty3 linux
root         1  0.0  0.0   4184  1420 ?        Ss   Oct14   0:02 init [3]
root      4115  0.0  0.0  34336   608 ?        S    Oct14   0:00 
supervising syslog-ng
root         2  0.0  0.0      0     0 ?        S    Oct14   0:00 [kthreadd]
root         4  0.0  0.0      0     0 ?        I<   Oct14   0:00 
[kworker/0:0H]
root         6  0.0  0.0      0     0 ?        I<   Oct14   0:00 
[mm_percpu_wq]
root         7  0.8  0.0      0     0 ?        S    Oct14  22:01 
[ksoftirqd/0]
root         8  0.0  0.0      0     0 ?        I    Oct14   1:36 [rcu_sched]
root         9  0.0  0.0      0     0 ?        I    Oct14   0:00 [rcu_bh]
root        10  0.0  0.0      0     0 ?        S    Oct14   0:00 
[migration/0]
root        11  0.0  0.0      0     0 ?        S    Oct14   0:00 [cpuhp/0]
root        12  0.0  0.0      0     0 ?        S    Oct14   0:00 [cpuhp/1]
root        13  0.0  0.0      0     0 ?        S    Oct14   0:00 
[migration/1]
root        14  0.8  0.0      0     0 ?        S    Oct14  21:39 
[ksoftirqd/1]
root        16  0.0  0.0      0     0 ?        I<   Oct14   0:00 
[kworker/1:0H]
root        17  0.0  0.0      0     0 ?        S    Oct14   0:00 [cpuhp/2]
root        18  0.0  0.0      0     0 ?        S    Oct14   0:00 
[migration/2]
root        19  0.8  0.0      0     0 ?        S    Oct14  20:48 
[ksoftirqd/2]
root        21  0.0  0.0      0     0 ?        I<   Oct14   0:00 
[kworker/2:0H]
root        22  0.0  0.0      0     0 ?        S    Oct14   0:00 [cpuhp/3]
root        23  0.0  0.0      0     0 ?        S    Oct14   0:00 
[migration/3]
root        24  1.0  0.0      0     0 ?        S    Oct14  24:36 
[ksoftirqd/3]
root        26  0.0  0.0      0     0 ?        I<   Oct14   0:00 
[kworker/3:0H]
root        27  0.0  0.0      0     0 ?        S    Oct14   0:00 [cpuhp/4]
root        28  0.0  0.0      0     0 ?        S    Oct14   0:00 
[migration/4]
root        29  0.8  0.0      0     0 ?        S    Oct14  20:14 
[ksoftirqd/4]
root        31  0.0  0.0      0     0 ?        I<   Oct14   0:00 
[kworker/4:0H]
root        32  0.0  0.0      0     0 ?        S    Oct14   0:00 [cpuhp/5]
root        33  0.0  0.0      0     0 ?        S    Oct14   0:00 
[migration/5]
root        34  0.8  0.0      0     0 ?        S    Oct14  20:22 
[ksoftirqd/5]
root        36  0.0  0.0      0     0 ?        I<   Oct14   0:00 
[kworker/5:0H]
root        37  0.0  0.0      0     0 ?        S    Oct14   0:00 [cpuhp/6]
root        38  0.0  0.0      0     0 ?        S    Oct14   0:00 
[migration/6]
root        39  0.8  0.0      0     0 ?        S    Oct14  20:43 
[ksoftirqd/6]
root        41  0.0  0.0      0     0 ?        I<   Oct14   0:00 
[kworker/6:0H]
root        42  0.0  0.0      0     0 ?        S    Oct14   0:00 [cpuhp/7]
root        43  0.0  0.0      0     0 ?        S    Oct14   0:00 
[migration/7]
root        44  0.8  0.0      0     0 ?        S    Oct14  21:51 
[ksoftirqd/7]
root        46  0.0  0.0      0     0 ?        I<   Oct14   0:00 
[kworker/7:0H]
root        47  0.0  0.0      0     0 ?        S    Oct14   0:00 [cpuhp/8]
root        48  0.0  0.0      0     0 ?        S    Oct14   0:00 
[migration/8]
root        49  0.7  0.0      0     0 ?        S    Oct14  18:49 
[ksoftirqd/8]
root        51  0.0  0.0      0     0 ?        I<   Oct14   0:00 
[kworker/8:0H]
root        52  0.0  0.0      0     0 ?        S    Oct14   0:00 [cpuhp/9]
root        53  0.0  0.0      0     0 ?        S    Oct14   0:00 
[migration/9]
root        54  0.8  0.0      0     0 ?        S    Oct14  20:48 
[ksoftirqd/9]
root        56  0.0  0.0      0     0 ?        I<   Oct14   0:00 
[kworker/9:0H]
root        57  0.0  0.0      0     0 ?        S    Oct14   0:00 [cpuhp/10]
root        58  0.0  0.0      0     0 ?        S    Oct14   0:00 
[migration/10]
root        59  0.7  0.0      0     0 ?        S    Oct14  19:07 
[ksoftirqd/10]
root        61  0.0  0.0      0     0 ?        I<   Oct14   0:00 
[kworker/10:0H]
root        62  0.0  0.0      0     0 ?        S    Oct14   0:00 [cpuhp/11]
root        63  0.0  0.0      0     0 ?        S    Oct14   0:00 
[migration/11]
root        64  0.8  0.0      0     0 ?        S    Oct14  19:54 
[ksoftirqd/11]
root        66  0.0  0.0      0     0 ?        I<   Oct14   0:00 
[kworker/11:0H]
root        67  0.0  0.0      0     0 ?        S    Oct14   0:00 [kdevtmpfs]
root        70  0.0  0.0      0     0 ?        S    Oct14   0:00 [kauditd]
root       410  0.0  0.0      0     0 ?        S    Oct14   0:00 
[khungtaskd]
root       411  0.0  0.0      0     0 ?        S    Oct14   0:00 
[oom_reaper]
root       412  0.0  0.0      0     0 ?        I<   Oct14   0:00 [writeback]
root       414  0.0  0.0      0     0 ?        S    Oct14   0:00 
[kcompactd0]
root       415  0.0  0.0      0     0 ?        SN   Oct14   0:00 [ksmd]
root       416  0.0  0.0      0     0 ?        SN   Oct14   0:00 
[khugepaged]
root       417  0.0  0.0      0     0 ?        I<   Oct14   0:00 [crypto]
root       419  0.0  0.0      0     0 ?        I<   Oct14   0:00 [kblockd]
root      1314  0.0  0.0      0     0 ?        I<   Oct14   0:00 [ata_sff]
root      1329  0.0  0.0      0     0 ?        I<   Oct14   0:00 [md]
root      1425  0.0  0.0      0     0 ?        I<   Oct14   0:00 [rpciod]
root      1426  0.0  0.0      0     0 ?        I<   Oct14   0:00 [xprtiod]
root      1515  0.0  0.0      0     0 ?        S    Oct14   0:00 [kswapd0]
root      1614  0.0  0.0      0     0 ?        I<   Oct14   0:00 [nfsiod]
root      1684  0.0  0.0      0     0 ?        I<   Oct14   0:00 
[acpi_thermal_pm]
root      1801  0.0  0.0      0     0 ?        S    Oct14   0:00 [scsi_eh_0]
root      1802  0.0  0.0      0     0 ?        I<   Oct14   0:00 
[scsi_tmf_0]
root      1806  0.0  0.0      0     0 ?        S    Oct14   0:00 [scsi_eh_1]
root      1807  0.0  0.0      0     0 ?        I<   Oct14   0:00 
[scsi_tmf_1]
root      1810  0.0  0.0      0     0 ?        S    Oct14   0:00 [scsi_eh_2]
root      1811  0.0  0.0      0     0 ?        I<   Oct14   0:00 
[scsi_tmf_2]
root      1814  0.0  0.0      0     0 ?        S    Oct14   0:00 [scsi_eh_3]
root      1815  0.0  0.0      0     0 ?        I<   Oct14   0:00 
[scsi_tmf_3]
root      1838  0.0  0.0      0     0 ?        S    Oct14   0:00 [scsi_eh_4]
root      1839  0.0  0.0      0     0 ?        I<   Oct14   0:00 
[scsi_tmf_4]
root      1842  0.0  0.0      0     0 ?        S    Oct14   0:00 [scsi_eh_5]
root      1843  0.0  0.0      0     0 ?        I<   Oct14   0:00 
[scsi_tmf_5]
root      1846  0.0  0.0      0     0 ?        S    Oct14   0:00 [scsi_eh_6]
root      1847  0.0  0.0      0     0 ?        I<   Oct14   0:00 
[scsi_tmf_6]
root      1850  0.0  0.0      0     0 ?        S    Oct14   0:00 [scsi_eh_7]
root      1851  0.0  0.0      0     0 ?        I<   Oct14   0:00 
[scsi_tmf_7]
root      1854  0.0  0.0      0     0 ?        S    Oct14   0:00 [scsi_eh_8]
root      1855  0.0  0.0      0     0 ?        I<   Oct14   0:00 
[scsi_tmf_8]
root      1858  0.0  0.0      0     0 ?        S    Oct14   0:00 [scsi_eh_9]
root      1859  0.0  0.0      0     0 ?        I<   Oct14   0:00 
[scsi_tmf_9]
root      1921  0.0  0.0      0     0 ?        I<   Oct14   0:00 [ixgbe]
root      1923  0.0  0.0      0     0 ?        I<   Oct14   0:00 [i40e]
root      3058  0.0  0.0      0     0 ?        I<   Oct14   0:00 
[ipv6_addrconf]
root      3087  0.0  0.0      0     0 ?        S    Oct14   0:01 [md3_raid1]
root      3092  0.0  0.0      0     0 ?        S    Oct14   0:00 [md2_raid1]
root      3097  0.0  0.0      0     0 ?        S    Oct14   0:00 [md1_raid1]
root      3099  0.0  0.0      0     0 ?        I<   Oct14   0:00 
[reiserfs/md2]
root      3100  0.0  0.0      0     0 ?        I<   Oct14   0:00 
[kworker/0:1H]
root      3124  0.0  0.0      0     0 ?        I<   Oct14   0:00 
[kworker/6:1H]
root      3155  0.0  0.0      0     0 ?        I<   Oct14   0:00 
[kworker/1:1H]
root      3244  0.0  0.0      0     0 ?        I<   Oct14   0:00 
[kworker/11:1H]
root      3351  0.0  0.0      0     0 ?        I<   Oct14   0:00 
[kworker/5:1H]
root      3467  0.0  0.0      0     0 ?        I<   Oct14   0:00 
[kworker/8:1H]
root      3502  0.0  0.0      0     0 ?        I<   Oct14   0:00 
[kworker/9:1H]
root      3503  0.0  0.0      0     0 ?        I<   Oct14   0:00 
[kworker/2:1H]
root      3521  0.0  0.0      0     0 ?        SN   Oct14   0:00 [kipmi0]
root      3757  0.0  0.0      0     0 ?        I<   Oct14   0:00 
[reiserfs/md3]
root      5096  0.0  0.0      0     0 ?        I<   Oct14   0:00 
[kworker/10:1H]
root      5280  0.0  0.0      0     0 ?        I<   Oct14   0:00 
[kworker/7:1H]
root      5447  0.0  0.0      0     0 ?        I<   Oct14   0:00 
[kworker/3:1H]
root      6573  0.0  0.0      0     0 ?        I    16:10   0:00 
[kworker/3:0]
root      6584  0.0  0.0      0     0 ?        I    16:11   0:00 
[kworker/6:0]
root      6660  0.0  0.0      0     0 ?        I<   Oct14   0:00 
[kworker/4:1H]
root      6967  0.0  0.0      0     0 ?        I    16:15   0:00 
[kworker/5:2]
root      6976  0.0  0.0      0     0 ?        I    16:17   0:00 
[kworker/10:1]
root      7036  0.0  0.0      0     0 ?        I    06:19   0:02 
[kworker/0:4]
root      7750  0.0  0.0      0     0 ?        I    16:22   0:00 
[kworker/4:2]
root      9555  0.0  0.0      0     0 ?        I    16:25   0:00 
[kworker/2:2]
root      9557  0.0  0.0      0     0 ?        I    16:26   0:00 
[kworker/6:2]
root     10146  0.0  0.0      0     0 ?        I    16:27   0:00 
[kworker/8:2]
root     10148  0.0  0.0      0     0 ?        I    16:27   0:00 
[kworker/1:2]
root     13804  0.0  0.0      0     0 ?        I    13:42   0:00 
[kworker/0:1]
root     16109  0.0  0.0      0     0 ?        I    16:33   0:00 
[kworker/9:2]
root     16156  0.0  0.0      0     0 ?        I    16:39   0:00 
[kworker/4:0]
root     16422  0.0  0.0      0     0 ?        I    16:39   0:00 
[kworker/5:1]
root     16423  0.0  0.0      0     0 ?        I    16:39   0:00 
[kworker/9:0]
root     17118  0.0  0.0      0     0 ?        I    16:40   0:00 
[kworker/11:2]
root     17250  0.0  0.0      0     0 ?        I    16:42   0:00 
[kworker/3:1]
root     17620  0.0  0.0      0     0 ?        I    16:43   0:00 
[kworker/0:0]
root     17629  0.0  0.0      0     0 ?        I    16:45   0:00 
[kworker/2:1]
root     17639  0.0  0.0      0     0 ?        I    16:47   0:00 
[kworker/u24:0]
root     17640  0.0  0.0      0     0 ?        I    16:47   0:00 
[kworker/10:0]
root     17642  0.0  0.0      0     0 ?        I    16:48   0:00 
[kworker/0:5]
root     19577  0.0  0.0      0     0 ?        I    16:49   0:00 
[kworker/8:1]
root     19578  0.0  0.0      0     0 ?        I    16:49   0:00 
[kworker/8:3]
root     19819  0.0  0.0      0     0 ?        I    16:49   0:00 
[kworker/1:1]
root     19820  0.0  0.0      0     0 ?        I    16:49   0:00 
[kworker/1:3]
root     19972  0.0  0.0      0     0 ?        I    16:52   0:00 
[kworker/7:1]
root     19973  0.0  0.0      0     0 ?        I    16:52   0:00 
[kworker/7:3]
root     19974  0.0  0.0      0     0 ?        I    16:52   0:00 
[kworker/11:1]
root     19976  0.0  0.0      0     0 ?        I    16:52   0:00 
[kworker/u24:1]
root     20106  0.0  0.0      0     0 ?        I    16:53   0:00 
[kworker/4:1]
root     20107  0.0  0.0      0     0 ?        I    16:53   0:00 
[kworker/4:3]
root     20108  0.0  0.0      0     0 ?        I    16:54   0:00 
[kworker/3:2]
root     20109  0.0  0.0      0     0 ?        I    16:54   0:00 
[kworker/3:3]
root     20110  0.0  0.0      0     0 ?        I    16:54   0:00 
[kworker/0:6]
root     20217  0.0  0.0      0     0 ?        I    16:55   0:00 
[kworker/1:0]
root     20219  0.0  0.0      0     0 ?        I    16:56   0:00 
[kworker/9:1]
root     20222  0.0  0.0      0     0 ?        I    16:56   0:00 
[kworker/9:3]
root     20354  0.0  0.0      0     0 ?        I    16:57   0:00 
[kworker/5:0]
root     20355  0.0  0.0      0     0 ?        I    16:57   0:00 
[kworker/5:3]
root     20814  0.0  0.0      0     0 ?        I    16:57   0:00 
[kworker/u24:2]
root     26845  0.0  0.0      0     0 ?        I    15:40   0:00 
[kworker/7:2]
root     26979  0.0  0.0      0     0 ?        I    15:43   0:00 
[kworker/0:3]
root     27375  0.0  0.0      0     0 ?        I    15:48   0:00 
[kworker/0:2]

but free -m
free -m
               total        used        free      shared buff/cache   
available
Mem:          32113       18345       13598           0 169       13419
Swap:          3911           0        3911


less and less about 0.5MB per hour

it looks like this commit:

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972


Is not included in:
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git


Ok will upgrade tomorrow - and will check with that fix.



W dniu 2017-10-15 o 02:58, Alexander Duyck pisze:
> Hi Pawel,
>
> To clarify is that Dave Miller's tree or Linus's that you are talking
> about? If it is Dave's tree how long ago was it you pulled it since I
> think the fix was just pushed by Jeff Kirsher a few days ago.
>
> The issue should be fixed in the following commit:
> https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/drivers/net/ethernet/intel/i40e/i40e_txrx.c?id=2b9478ffc550f17c6cd8c69057234e91150f5972
>
> Thanks.
>
> - Alex
>
> On Sat, Oct 14, 2017 at 3:03 PM, Paweł Staszewski <pstaszewski@...are.pl> wrote:
>> Forgot to add - this graphs are tested with Kernel 4.14-rc4-next
>>
>>
>> W dniu 2017-10-15 o 00:00, Paweł Staszewski pisze:
>>
>> Same problem here
>>
>> Also only difference is change 82599 intel to x710 and have memleak
>>
>> mem with ixgbe driver over time - same config saame kernel
>>
>>
>>
>> changed NIC's to x710 i40e driver (this is the only change)
>>
>> And mem over time:
>>
>>
>>
>> There is no process that is eating memory - looks like there is some problem
>> with i40e driver - but it not a surprise :) this driver is really buggy -
>> with many things - most tickets on e1000e sourceforge that i openned have no
>> reply for year or more - or if somebody reply after year they are closing
>> ticket after 1 day with info about no activity :)
>>
>>
>>
>> W dniu 2017-10-05 o 07:19, Anders K. Pedersen | Cohaesio pisze:
>>
>> On ons, 2017-10-04 at 08:32 -0700, Alexander Duyck wrote:
>>
>> On Wed, Oct 4, 2017 at 5:56 AM, Anders K. Pedersen | Cohaesio
>> <akp@...aesio.com> wrote:
>>
>> Hello,
>>
>> After updating one of our Linux based routers to kernel 4.13 it
>> began
>> leaking memory quite fast (about 1 GB every half hour). To narrow
>> we
>> tried various kernel versions and found that 4.11.12 is okay, while
>> 4.12 also leaks, so we did a bisection between 4.11 and 4.12.
>>
>> The first bisection ended at
>> "[6964e53f55837b0c49ed60d36656d2e0ee4fc27b] i40e: fix handling of
>> HW
>> ATR eviction", which fixes some flag handling that was broken by
>> 47994c119a36 "i40e: remove hw_disabled_flags in favor of using
>> separate
>> flag bits", so I did a second bisection, where I added 6964e53f5583
>> "i40e: fix handling of HW ATR eviction" to the steps that had
>> 47994c119a36 "i40e: remove hw_disabled_flags in favor of using
>> separate
>> flag bits" in them.
>>
>> The second bisection ended at
>> "[0e626ff7ccbfc43c6cc4aeea611c40b899682382] i40e: Fix support for
>> flow
>> director programming status", where I don't see any obvious
>> problems,
>> so I'm hoping for some assistance.
>>
>> The router is a PowerEdge R730 server (Haswell based) with three
>> Intel
>> NICs (all using the i40e driver):
>>
>> X710 quad port 10 GbE SFP+: eth0 eth1 eth2 eth3
>> X710 quad port 10 GbE SFP+: eth4 eth5 eth6 eth7
>> XL710 dual port 40 GbE QSFP+: eth8 eth9
>>
>> The NICs are aggregated with LACP with the team driver:
>>
>> team0: eth9 (40 GbE selected primary), and eth3, eth7 (10 GbE non-
>> selected backups)
>> team1: eth0, eth1, eth4, eth5 (all 10 GbE selected)
>>
>> team0 is used for internal networks and has one untagged and four
>> tagged VLAN interfaces, while team1 has an external uplink
>> connection
>> without any VLANs.
>>
>> The router runs an eBGP session on team1 to one of our uplinks, and
>> iBGP via team0 to our other border routers. It also runs OSPF on
>> the
>> internal VLANs on team0. One thing I've noticed is that when OSPF
>> is
>> not announcing a default gateway to the internal networks, so there
>> is
>> almost no traffic coming in on team0 and out on team1, but still
>> plenty
>> of traffic coming in on team1 and out via team0, there's no memory
>> leak
>> (or at least it is so small that we haven't detected it). But as
>> soon
>> as we configure OSPF to announce a default gateway to the internal
>> VLANs, so we get traffic from team0 to team1 the leaking begins.
>> Stopping the OSPF default gateway announcement again also stops the
>> leaking, but does not release already leaked memory.
>>
>> So this leads to me suspect that the leaking is related to RX on
>> team0
>> (where XL710 eth9 is normally the only active interface) or TX on
>> team1
>> (X710 eth0, eth1, eth4, eth5). The first bad commit is related to
>> RX
>> cleaning, which suggests RX on team0. Since we're only seeing the
>> leak
>> for our outbound traffic, I suspect either a difference between the
>> X710 vs. XL710 NICs, or that the inbound traffic is for relatively
>> few
>> destination addresses (only our own systems) while the outbound
>> traffic
>> is for many different addresses on the internet. But I'm just
>> guessing
>> here.
>>
>> I've tried kmemleak, but it only found a few kB of suspected memory
>> leaks (several of which disappeared again after a while).
>>
>> Below I've included more details - git bisect logs, ethtool -i,
>> dmesg,
>> Kernel .config, and various memory related /proc files. Any help or
>> suggestions would be much appreciated, and please let me know if
>> more
>> information is needed or there's something I should try.
>>
>> Regards,
>> Anders K. Pedersen
>>
>> Hi Anders,
>>
>> I think I see the problem and should have a patch submitted shortly
>> to
>> address it. From what I can tell it looks like the issue is that we
>> weren't properly recycling the pages associated with descriptors that
>> contained an Rx programming status. For now the workaround would be
>> to
>> try disabling ATR via the "ethtool --set-priv-flags" command. I
>> should
>> have a patch out in the next hour or so that you can try testing to
>> verify if it addresses the issue.
>>
>> Thanks.
>>
>> - Alex
>>
>> Thanks Alex,
>>
>> I will test the patch in our next service window on Tuesday morning.
>>
>> Regards,
>> Anders
>>
>>
>>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ