netdev - Sticky packet drops on mlx5 RX queue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <BYAPR15MB2278A68E25BE057956B7D46BC0840@BYAPR15MB2278.namprd15.prod.outlook.com>
Date:   Thu, 10 Jan 2019 22:41:24 +0000
From:   Pieter Noordhuis <pietern@...com>
To:     "saeedm@...lanox.com" <saeedm@...lanox.com>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>
CC:     Jes Sorensen <jsorensen@...com>, Alexei Starovoitov <ast@...com>,
        "John Reumann" <bold@...com>, Mark Marchukov <march@...com>,
        Yonghong Song <yhs@...com>,
        "jonathan.lemon@...il.com" <jonathan.lemon@...il.com>
Subject: Sticky packet drops on mlx5 RX queue

I'm looking into an issue with mlx5 on 4.11.3. It is triggered by high memory pressure but continues for long after the memory pressure is gone. It starts to continuously use pfmemalloc pages, some of which appear to be coming from an RX queue's page cache.

Attached is a log file showing a second-by-second diff of ethtool counters for a single RX queue that was showing this behavior. This log doesn't capture the start of these drops, because the ethtool monitoring is only started until after the first drops are detected. Every increase of the “cache_waive” counter means mlx5 refused to add a page to its page cache because it was a pfmemalloc page. It also means the corresponding packet gets dropped in sk_filter_trim_cap.

Initially, the log shows the “cache_busy” counter increasing, meaning that the first page in the page cache has >1 references, so can't be used. Then after roughly a minute, it switches to increasing the “cache_reuse” and “cache_waive” counters. This means that the pages are coming from the RX queue's page cache *and** *are not put back because they are pfmemalloc pages. This is highly suspicious, as they shouldn't end up in the page cache in the first place. Then, after reusing 255 pages from the page cache, the “cache_empty” counter starts to increase, in lock step with the “cache_waive” counter. This means that the pages are allocated with dev_alloc_pages and not placed in the page cache, because they are pfmemalloc pages. This is also suspicious, because with the memory pressure gone, dev_alloc_pages shouldn't be returning pfmemalloc pages. By the time it stops incrementing “cache_waive”, a total of 3804 pages were waived (and packets were dropped), over a duration of 1895 seconds.

What I would expect to happen is the “cache_reuse” and “cache_waive” to never be incremented in lock step, as pfmemalloc pages must never be added to the RX queue page cache to begin with. Similarly, I would expect “cache_empty” and “cache_waive” to never be incremented in lock step if there is no memory pressure.

Static analysis of mlx5 on 4.11.3 has so far not lead to any insights as to why this is happening. Any help in this investigation is much appreciated. If there is any additional information I can provide please me know.

Pieter
View attachment "rx25_ethtool.txt" of type "text/plain" (455900 bytes)