linux-kernel - Re: [RFC PATCH 0/2] net: threadable napi poll loop

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1462977516.4444.73.camel@redhat.com>
Date:	Wed, 11 May 2016 16:38:36 +0200
From:	Paolo Abeni <pabeni@...hat.com>
To:	Eric Dumazet <eric.dumazet@...il.com>
Cc:	Hannes Frederic Sowa <hannes@...essinduktion.org>,
	Eric Dumazet <edumazet@...gle.com>,
	netdev <netdev@...r.kernel.org>,
	"David S. Miller" <davem@...emloft.net>,
	Jiri Pirko <jiri@...lanox.com>,
	Daniel Borkmann <daniel@...earbox.net>,
	Alexei Starovoitov <ast@...mgrid.com>,
	Alexander Duyck <aduyck@...antis.com>,
	Tom Herbert <tom@...bertland.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...nel.org>, Rik van Riel <riel@...hat.com>,
	LKML <linux-kernel@...r.kernel.org>
Subject: Re: [RFC PATCH 0/2] net: threadable napi poll loop

On Wed, 2016-05-11 at 06:08 -0700, Eric Dumazet wrote:
> On Wed, 2016-05-11 at 11:48 +0200, Paolo Abeni wrote:
> > Hi Eric,
> > On Tue, 2016-05-10 at 15:51 -0700, Eric Dumazet wrote:
> > > On Wed, 2016-05-11 at 00:32 +0200, Hannes Frederic Sowa wrote:
> > > 
> > > > Not only did we want to present this solely as a bugfix but also as as
> > > > performance enhancements in case of virtio (as you can see in the cover
> > > > letter). Given that a long time ago there was a tendency to remove
> > > > softirqs completely, we thought it might be very interesting, that a
> > > > threaded napi in general seems to be absolutely viable nowadays and
> > > > might offer new features.
> > > 
> > > Well, you did not fix the bug, you worked around by adding yet another
> > > layer, with another sysctl that admins or programs have to manage.
> > > 
> > > If you have a special need for virtio, do not hide it behind a 'bug fix'
> > > but add it as a features request.
> > > 
> > > This ksoftirqd issue is real and a fix looks very reasonable.
> > > 
> > > Please try this patch, as I had very good success with it.
> > 
> > Thank you for your time and your effort.
> > 
> > I tested your patch on the bare metal "single core" scenario, disabling
> > the unneeded cores with:
> > CPUS=`nproc`
> > for I in `seq 1 $CPUS`; do echo 0  >  /sys/devices/system/node/node0/cpu$I/online; done
> > 
> > And I got a:
> > 
> > [   86.925249] Broke affinity for irq <num>
> > 
> 
> Was it fatal, or simply a warning that you are removing the cpu that was
> the only allowed cpu in an affinity_mask ?

The above message is emitted with pr_notice() by the x86 version of
fixup_irqs(). It's not fatal, the host is alive and well after that. The
un-patched kernel does not emit it on cpus disabling.

I'll try to look into this later.

> Looks another bug to fix then ? We disabled CPU hotplug here at Google
> for our production, as it was notoriously buggy. No time to fix dozens
> of issues added by a crowd of developers that do not even know a cpu can
> be unplugged.
> 
> Maybe some caller of local_bh_disable()/local_bh_enable() expected that
> current softirq would be processed. Obviously flaky even before the
> patches.
> 
> > for each irq number generated by a network device.
> > 
> > In this scenario, your patch solves the ksoftirqd issue, performing
> > comparable to the napi threaded patches (with a negative delta in the
> > noise range) and introducing a minor regression with a single flow, in
> > the noise range (3%).
> > 
> > As said in a previous mail, we actually experimented something similar,
> > but it felt quite hackish.
> 
> Right, we are networking guys, and we feel that messing with such core
> infra is not for us. So we feel comfortable adding a pure networking
> patch.
> 
> > 
> > AFAICS this patch adds three more tests in the fast path and affect all
> > other softirq use case. I'm not sure how to check for regression there.
> 
> It is obvious to me that ksoftird mechanism is not working as intended.
> 
> Fixing it might uncover bugs from parts of the kernel relying on the
> bug, indirectly or directly. Is it a good thing ?
> 
> I can not tell before trying.
> 
> Just by looking at /proc/{ksoftirqs_pid}/sched you can see the problem,
> as we normally schedule ksoftird under stress but most of the time,
> the softirq items were processed by another tasks as you found out.
> 
> 
> > 
> > The napi thread patches are actually a new feature, that also fixes the
> > ksoftirqd issue: hunting the ksoftirqd issue has been the initial
> > trigger for this work. I'm sorry for not being clear enough in the cover
> > letter.
> > 
> > The napi thread patches offer additional benefits, i.e. an additional
> > relevant gain in the described test scenario, and do not impact on other
> > subsystems/kernel entities. 
> > 
> > I still think they are worthy, and I bet you would disagree, but could
> > you please articulate more which parts concern you most and/or are more
> > bloated ?
> 
> Just look at the added code. napi_threaded_poll() is very buggy, but
> honestly I do not want to fix the bugs you added there. If you have only
> one vcpu, how jiffies can ever change since you block BH ?

Uh, we have likely the same issue in the net_rx_action() function, which
also execute with bh disabled and check for jiffies changes even on
single core hosts ?!?

Aren't jiffies updated by the timer interrupt ? and thous even with
bh_disabled ?!?

> I was planning to remove cond_resched_softirq() that we no longer use
> after my recent changes to TCP stack,
> and you call it again (while it is obviously buggy since it does not
> check if a BH is pending, only if a thread needs the cpu)

I missed that, thank you for pointing out.

> I prefer fixing the existing code, really. It took us years to
> understand it and maybe fix it.
> 
> Just think of what will happen if you have 10 devices (10 new threads in
> your model) and one cpu.
> 
> Instead of the nice existing netif_rx() doing 64 packets per device
> rounds, you'll now rely on process scheduler behavior that has no such
> granularity.
> 
> Adding more threads is the natural answer of userland programmers, but
> in the kernel it is not the right answer. We already have mechanism,
> just use them and fix them if they are broken.
> 
> Sorry, I really do not think your patches are the way to go.
> But this thread is definitely interesting.

Oh, this is a far better comment that I would have expected ;-)

Cheers,

Paolo