lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:   Thu, 20 Aug 2020 11:13:27 -0700
From:   Josh Hunt <johunt@...mai.com>
To:     Jike Song <albcamus@...il.com>
Cc:     Paolo Abeni <pabeni@...hat.com>,
        Jonas Bonn <jonas.bonn@...rounds.com>,
        Cong Wang <xiyou.wangcong@...il.com>,
        Michael Zhivich <mzhivich@...mai.com>,
        David Miller <davem@...emloft.net>,
        John Fastabend <john.fastabend@...il.com>,
        LKML <linux-kernel@...r.kernel.org>,
        Linux Kernel Network Developers <netdev@...r.kernel.org>,
        kehuan.feng@...il.com
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

Hi Jike

On 8/20/20 12:43 AM, Jike Song wrote:
> Hi Josh,
> 
> 
> We met possibly the same problem when testing nvidia/mellanox's
> GPUDirect RDMA product, we found that changing NET_SCH_DEFAULT to
> DEFAULT_FQ_CODEL mitigated the problem, having no idea why. Maybe you
> can also have a try?

We also did something similar where we've switched over to using the fq 
scheduler everywhere for now. We believe the bug is in the nolock code 
which only pfifo_fast uses atm, but we've been unable to come up with a 
satisfactory solution.

> 
> Besides, our testing is pretty complex, do you have a quick test to
> reproduce it?
> 

Unfortunately we don't have a simple test case either. Our current 
reproducer is complex as well, although it would seem like we should be 
able to come up with something where you have maybe 2 threads trying to 
send on the same tx queue running pfifo_fast every few hundred 
milliseconds and not much else/no other tx traffic on that queue. IIRC 
we believe the scenario is when one thread is in the process of 
dequeuing a packet while another is enqueuing, the enqueue-er (word? :)) 
sees the dequeue is in progress and so does not xmit the packet assuming 
the dequeue operation will take care of it. However b/c the dequeue is 
in the process of completing it doesn't and the newly enqueued packet 
stays in the qdisc until another packet is enqueued pushing both out.

Given that we have a workaround with using fq or any other qdisc not 
named pfifo_fast this has gotten bumped down in priority for us. I would 
like to work on a reproducer at some point, but won't likely be for a 
few weeks :(

Josh

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ