[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20180109133623.10711-2-dima@arista.com>
Date: Tue, 9 Jan 2018 13:36:22 +0000
From: Dmitry Safonov <dima@...sta.com>
To: linux-kernel@...r.kernel.org
Cc: 0x7f454c46@...il.com, Dmitry Safonov <dima@...sta.com>,
Andrew Morton <akpm@...ux-foundation.org>,
David Miller <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>,
Frederic Weisbecker <fweisbec@...il.com>,
Hannes Frederic Sowa <hannes@...essinduktion.org>,
Ingo Molnar <mingo@...nel.org>,
"Levin, Alexander (Sasha Levin)" <alexander.levin@...izon.com>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Paolo Abeni <pabeni@...hat.com>,
"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
Peter Zijlstra <peterz@...radead.org>,
Radu Rendec <rrendec@...sta.com>,
Rik van Riel <riel@...hat.com>,
Stanislaw Gruszka <sgruszka@...hat.com>,
Thomas Gleixner <tglx@...utronix.de>,
Wanpeng Li <wanpeng.li@...mail.com>
Subject: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context
Warning: Not merge-ready
I. Current workflow of ksoftirqd.
Softirqs are processed in the context of ksoftirqd iff they are
being raised very frequently. How it works:
do_softirq() and invoke_softirq() deffer pending softirq iff
ksoftirqd is in runqueue. Ksoftirqd is scheduled mostly in the
end of processed softirqs if 2ms were not enough to process all
pending softirqs.
Here is pseudo-picture of the workflow (for simplicity on UMP):
------------- ------------------ ------------------
| ksoftirqd | | User's process | | Softirqs |
------------- ------------------ ------------------
Not scheduled Running
|
o------------------------o
|
__do_softirq()
|
2ms & softirq pending?
Schedule ksoftirqd
|
Scheduled o------------------------o
|
o--------------------o
|
Running Scheduled
|
o--------------------o
|
Not scheduled Running
Timegraph for the workflow,
dash (-) means ksoftirqd not scheduled;
equal(=) ksoftirqd is scheduled, a softirq may still be pending
Pending softirqs
| | | | | | | | |
v v v v | | | | v
Processing o-----o | | | | o--o
softirqs | | | | | | | |
| | | | | | | |
| | | | | | | |
Userspace o-o o=========o | | | | o----o o---------o
<-2ms-> | | | | | |
| v v v v |
Ksoftirqd o----------o
II. Corner-conditions.
During testing of commit [1] on some non-mainstream driver,
I've found that due to platform specifics, the IRQ is being
raised too late (after softirq has been processed).
In result softirqs steal time from userspace process, leaving
it starving for CPU time and never/rarely scheduling ksoftirqd:
Pending softirqs
| | | | | |
v v v v v v
Processing o-----o o-----o o-----o o-----o o-----o o ...
softirqs | | | | | | | | | | |
| | | | | | | | | | |
| | | | | | | | | | |
Userspace o-o o-o o-o o-o o-o o-o (starving)
Ksoftirqd (rarely scheduled)
Afterwards I thought that the same may happen to mainstream
if PPS rate is selected to raise an IRQ just after previous
softirq was processed. I managed to reproduce the conjecture,
see (IV).
III. RFC proposal.
Firstly, I tried to count all time spent in softirq processing to
ksoftirqd thread that serves local CPU and add comparison of
vruntime for ksoftirqd and current task to decide if softirq
should be delayed. You may imagine what a disgraceful hacks were
involved. Current RFC has nothing of that kind and relies on
fair scheduling of ksoftirqd and other tasks.
To do that we check pending softirqs and serve them on current
context only if there are non-net softirqs pending.
The following patch adds a mask to __do_softirq() to process
net-softirqs only on ksoftirqd context if multiply softirqs
are pending.
IV. Test results.
Unfortunately, I wasn't able to test it on hardware with mainstream
kernel. So, I've only results from Qemu VMs with fedora 26.
The first VM stresses the second with UDP packages by pktgen.
The receiver VM is running udp_sink[2] program and prints the
amount of PPS served.
Vms have virtio as network cards, have rt priority and are
assigned to different CPUs on the host.
Host's CPU is Intel Core i7-7600U @ 2.80GHz.
RFC definitely needs some testing on the real HW (because I
don't expect anyone would quite believe VM perf testing) - any
help with testing it would be appreciated.
Source | Destination
--------|------------------------------------
| master | RFC |
| (4.15-rc4) | |
--------|------------------|----------------|
5000 | 5000.7 | 4999.7 |
--------|------------------|----------------|
7000 | 6997.42 | 6995.88 |
--------|------------------|----------------|
8000 | 7999.55 | 7999.86 |
--------|------------------|----------------|
9000 | 8951.37 | 8986.30 |
--------|------------------|----------------|
10000 | 9864.96 | 9972.05 |
--------|------------------|----------------|
11000 | 10711.92 | 10976.26 |
--------|------------------|----------------|
12000 | 11494.79 | 11962.40 |
--------|------------------|----------------|
13000 | 12161.76 | 12946.91 |
--------|------------------|----------------|
14000 | 11152.07 | 13942.96 |
--------|------------------|----------------|
15000 | 8650.22 | 14878.26 |
--------|------------------|----------------|
16000 | 7662.55 | 15880.60 |
--------|------------------|----------------|
17000 | 6485.49 | 16814.07 |
--------|------------------|----------------|
18000 | 5489.48 | 17679.69 |
--------|------------------|----------------|
19000 | 4679.59 | 18543.60 |
--------|------------------|----------------|
20000 | 4738.24 | 19233.56 |
--------|------------------|----------------|
21000 | 4015.00 | 20247.50 |
--------|------------------|----------------|
22000 | 4376.99 | 20654.62 |
--------|------------------|----------------|
23000 | 9429.80 | 20925.07 |
--------|------------------|----------------|
24000 | 8872.33 | 21336.31 |
--------|------------------|----------------|
25000 | 19824.67 | 21486.84 |
--------|------------------|----------------|
30000 | 20779.49 | 21487.15 |
--------|------------------|----------------|
40000 | 24559.83 | 21452.74 |
--------|------------------|----------------|
50000 | 18469.20 | 21191.34 |
--------|------------------|----------------|
100000 | 19773.00 | 22592.28 |
--------|------------------|----------------|
Note, that I tested in VMs and I've found that if I produce more
hw irqs on the host, than the results for master are not that
dramatically bad, but still much worse then with RFC.
By that reason I have qualms if my test's results are correct.
V. References:
[1] 4cd13c21b207 ("softirq: Let ksoftirqd do its job")
[2] https://github.com/netoptimizer/network-testing/blob/master/src/udp_sink.c
Signed-off-by: Dmitry Safonov <dima@...sta.com>
---
kernel/softirq.c | 29 ++++++++++++++++++++++++-----
1 file changed, 24 insertions(+), 5 deletions(-)
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 2f5e87f1bae2..ee48f194dcec 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -88,6 +88,28 @@ static bool ksoftirqd_running(void)
return tsk && (tsk->state == TASK_RUNNING);
}
+static bool defer_softirq(void)
+{
+ __u32 pending = local_softirq_pending();
+
+ if (!pending)
+ return true;
+
+ if (ksoftirqd_running())
+ return true;
+
+ /*
+ * Defer net-rx softirqs to ksoftirqd processing as they may
+ * make userspace starving cpu time.
+ */
+ if (pending & (NET_RX_SOFTIRQ | NET_TX_SOFTIRQ)) {
+ wakeup_softirqd();
+ return true;
+ }
+
+ return false;
+}
+
/*
* preempt_count and SOFTIRQ_OFFSET usage:
* - preempt_count is changed by SOFTIRQ_OFFSET on entering or leaving
@@ -315,7 +337,6 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
asmlinkage __visible void do_softirq(void)
{
- __u32 pending;
unsigned long flags;
if (in_interrupt())
@@ -323,9 +344,7 @@ asmlinkage __visible void do_softirq(void)
local_irq_save(flags);
- pending = local_softirq_pending();
-
- if (pending && !ksoftirqd_running())
+ if (!defer_softirq())
do_softirq_own_stack();
local_irq_restore(flags);
@@ -352,7 +371,7 @@ void irq_enter(void)
static inline void invoke_softirq(void)
{
- if (ksoftirqd_running())
+ if (defer_softirq())
return;
if (!force_irqthreads) {
--
2.13.6
Powered by blists - more mailing lists