linux-kernel - Re: Regression: Failed boots bisected to 4cd13c21b207 "softirq: Let ksoftirqd do its job"

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Mon, 6 Feb 2017 18:46:19 +0000
From:   Will Deacon <will.deacon@....com>
To:     Brian Starkey <brian.starkey@....com>
Cc:     Eric Dumazet <edumazet@...gle.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        LKML <linux-kernel@...r.kernel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...nel.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Alexander Potapenko <glider@...gle.com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
        linux@...linux.org.uk
Subject: Re: Regression: Failed boots bisected to 4cd13c21b207 "softirq: Let
 ksoftirqd do its job"

Hi all,

I've also stumbled over this issue with the ARM fastmodel and, somewhat
embarrassingly, blamed the model developers for the regression. I'm using
NFS and copying ~14MB file from NFS to a virtio-blk device which takes
over 20 minutes with 4cd13c21b207, but <1 min with it reverted.

I also think I've figured out what's going on. See below.

On Fri, Nov 25, 2016 at 01:14:03PM +0000, Brian Starkey wrote:
> On Wed, Nov 23, 2016 at 12:03:28PM -0800, Eric Dumazet wrote:
> >On Wed, Nov 23, 2016 at 10:21 AM, Brian Starkey <brian.starkey@....com> wrote:
> >
> >>This patch didn't help.
> >>
> >>I did get some new traces though - I've attached the diff for the
> >>trace_printks I added.
> >>
> >>Before 4cd13c21b207:
> >>https://drive.google.com/open?id=0B8siaK6ZjvEwcEtOeFQzTmY0Nnc
> >>After 4cd13c21b207:
> >>https://drive.google.com/open?id=0B8siaK6ZjvEwZnQ4MVg1d3d1Tm8
> >>
> >>It looks like the difference is that after 4cd13c21b207 the RX softirq
> >>isn't running, and RX interrupts don't call softirq_raise anymore -
> >>presumably because there's one pending, but I didn't have time to
> >>track that down to a code-path.
> >>
> >>Cheers,
> >>-Brian
> >>
> >
> >Hi Brian
> >
> >Looks like netif_rx() drops the incoming packets then ?
> >
> >Maybe netif_running() is not happy :(
> >
> >Could you trace netif_rx() return value (NET_RX_SUCCESS or NET_RX_DROP)
> 
> Some packets are dropped, but not very many:
> 
>   $ grep NET_RX_SUCCESS trace_netif_rx.txt | wc -l
>   14399
>   $ grep NET_RX_DROP trace_netif_rx.txt | wc -l
>   22
> 
> Without the ksoftirqd change there were zero NET_RX_DROPs.

The SMC91x has an on-chip 8KB FIFO (i.e. there's no DMA going on here).
When the FIFO is full (every 4 TCP packets in my case), we get an
interrupt and run down the smc_rcv path. There, we allocate an skb for
the data (netdev_alloc_skb) and copy the data out of the FIFO
(SMC_PULL_DATA) into the buffer, which we hand over the network core via
netif_rx.

The problem is that netif_rx defers to ksoftirqd to process the packet
and more crucially *free* the skb after it's been consumed. Since the
thing was allocated in IRQ context, we end up exhausting our GFP_ATOMIC
memory because ksoftirqd gets interrupted so frequently due to the tiny
FIFO depth that buffers are allocated at a much higher frequency than
they are freed. This may be exagerated by the relative speed of the model
emulated CPU with respect to the network interface, but I'd expect this
to be reproducible on real hardware too (rmk, cc'd, was going to give that
a go).

Prior to 4cd13c21b207, we'd always run softirqs synchronously on the
hardirq exit path and therefore have a chance to free some skbs before
actually EOI'ing the hardirq and allowing the FIFO-full interrupt to
interrupt us again.

Converting the smc91x driver over to NAPI would probably solve this problem,
but given the "vintage" of this code, I'd be more tempted by a simpler
point fix if only I could think of one.

Any ideas?

Will