[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <30dc6fa9-ea5e-50d6-56f9-fbc9627d8c29@grimberg.me>
Date: Fri, 20 Sep 2019 10:14:51 -0700
From: Sagi Grimberg <sagi@...mberg.me>
To: Long Li <longli@...rosoft.com>, Ming Lei <ming.lei@...hat.com>
Cc: Jens Axboe <axboe@...com>, Hannes Reinecke <hare@...e.com>,
John Garry <john.garry@...wei.com>,
Bart Van Assche <bvanassche@....org>,
"linux-scsi@...r.kernel.org" <linux-scsi@...r.kernel.org>,
Peter Zijlstra <peterz@...radead.org>,
Daniel Lezcano <daniel.lezcano@...aro.org>,
LKML <linux-kernel@...r.kernel.org>,
"linux-nvme@...ts.infradead.org" <linux-nvme@...ts.infradead.org>,
Keith Busch <keith.busch@...el.com>,
Ingo Molnar <mingo@...hat.com>,
Thomas Gleixner <tglx@...utronix.de>,
Christoph Hellwig <hch@....de>
Subject: Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
>> Hey Ming,
>>
>>>>> Ok, so the real problem is per-cpu bounded tasks.
>>>>>
>>>>> I share Thomas opinion about a NAPI like approach.
>>>>
>>>> We already have that, its irq_poll, but it seems that for this
>>>> use-case, we get lower performance for some reason. I'm not entirely
>>>> sure why that is, maybe its because we need to mask interrupts
>>>> because we don't have an "arm" register in nvme like network devices
>>>> have?
>>>
>>> Long observed that IOPS drops much too by switching to threaded irq.
>>> If softirqd is waken up for handing softirq, the performance shouldn't
>>> be better than threaded irq.
>>
>> Its true that it shouldn't be any faster, but what irqpoll already has and we
>> don't need to reinvent is a proper budgeting mechanism that needs to occur
>> when multiple devices map irq vectors to the same cpu core.
>>
>> irqpoll already maintains a percpu list and dispatch the ->poll with a budget
>> that the backend enforces and irqpoll multiplexes between them.
>> Having this mechanism in irq (hard or threaded) context sounds unnecessary a
>> bit.
>>
>> It seems like we're attempting to stay in irq context for as long as we can
>> instead of scheduling to softirq/thread context if we have more than a
>> minimal amount of work to do. Without at least understanding why
>> softirq/thread degrades us so much this code seems like the wrong approach
>> to me. Interrupt context will always be faster, but it is not a sufficient reason
>> to spend as much time as possible there, is it?
>>
>> We should also keep in mind, that the networking stack has been doing this
>> for years, I would try to understand why this cannot work for nvme before
>> dismissing.
>>
>>> Especially, Long found that context
>>> switch is increased a lot after applying your irq poll patch.
>>>
>>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists
>>> .infradead.org%2Fpipermail%2Flinux-nvme%2F2019-
>> August%2F026788.html&am
>>>
>> p;data=02%7C01%7Clongli%40microsoft.com%7C20391b0810844821325908d73
>> 59c
>>>
>> 64d2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637036818140279
>> 742&a
>>>
>> mp;sdata=GyBWILwPvwHYvrTGSAVZbdl%2Fcoz3twSXe2DrH2t1MeQ%3D&am
>> p;reserved
>>> =0
>>
>> Oh, I didn't see that one, wonder why... thanks!
>>
>> 5% improvement, I guess we can buy that for other users as is :)
>>
>> If we suffer from lots of context switches while the CPU is flooded with
>> interrupts, then I would argue that we're re-raising softirq too much.
>> In this use-case, my assumption is that the cpu cannot keep up with the
>> interrupts and not that it doesn't reap enough (we also reap the first batch in
>> interrupt context...)
>>
>> Perhaps making irqpoll continue until it must resched would improve things
>> further? Although this is a latency vs. efficiency tradeoff, looks like
>> MAX_SOFTIRQ_TIME is set to 2ms:
>>
>> "
>> * The MAX_SOFTIRQ_TIME provides a nice upper bound in most cases, but in
>> * certain cases, such as stop_machine(), jiffies may cease to
>> * increment and so we need the MAX_SOFTIRQ_RESTART limit as
>> * well to make sure we eventually return from this method.
>> *
>> * These limits have been established via experimentation.
>> * The two things to balance is latency against fairness -
>> * we want to handle softirqs as soon as possible, but they
>> * should not be able to lock up the box.
>> "
>>
>> Long, does this patch make any difference?
>
> Sagi,
>
> Sorry it took a while to bring my system back online.
>
> With the patch, the IOPS is about the same drop with the 1st patch. I think the excessive context switches are causing the drop in IOPS.
>
> The following are captured by "perf sched record" for 30 seconds during tests.
>
> "perf sched latency"
> With patch:
> fio:(82) | 937632.706 ms | 1782255 | avg: 0.209 ms | max: 63.123 ms | max at: 768.274023 s
>
> without patch:
> fio:(82) |2348323.432 ms | 18848 | avg: 0.295 ms | max: 28.446 ms | max at: 6447.310255 s
Without patch means the proposed hard-irq patch?
If we are context switching too much, it means the soft-irq operation is
not efficient, not necessarily the fact that the completion path is
running in soft-irq..
Is your kernel compiled with full preemption or voluntary preemption?
> Look closer at each CPU, we can see ksoftirqd is competing CPU with fio (and effectively throttle other fio processes)
> (captured in /sys/kernel/debug/tracing, echo sched:* >set_event)
>
> On CPU1 with patch: (note that the prev_state for fio is "R", it's preemptively scheduled)
> <...>-4077 [001] d... 66456.805062: sched_switch: prev_comm=fio prev_pid=4077 prev_prio=120 prev_state=R ==> next_comm=ksoftirqd/1 next_pid=17 next_prio=120
> <...>-17 [001] d... 66456.805859: sched_switch: prev_comm=ksoftirqd/1 prev_pid=17 prev_prio=120 prev_state=S ==> next_comm=fio next_pid=4077 next_prio=120
> <...>-4077 [001] d... 66456.844049: sched_switch: prev_comm=fio prev_pid=4077 prev_prio=120 prev_state=R ==> next_comm=ksoftirqd/1 next_pid=17 next_prio=120
> <...>-17 [001] d... 66456.844607: sched_switch: prev_comm=ksoftirqd/1 prev_pid=17 prev_prio=120 prev_state=S ==> next_comm=fio next_pid=4077 next_prio=120
>
> On CPU1 without patch: (the prev_state for fio is "S", it's voluntarily scheduled)
> <idle>-0 [001] d... 6725.392308: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=fio next_pid=14342 next_prio=120
> fio-14342 [001] d... 6725.392332: sched_switch: prev_comm=fio prev_pid=14342 prev_prio=120 prev_state=S ==> next_comm=swapper/1 next_pid=0 next_prio=120
> <idle>-0 [001] d... 6725.392356: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=fio next_pid=14342 next_prio=120
> fio-14342 [001] d... 6725.392425: sched_switch: prev_comm=fio prev_pid=14342 prev_prio=120 prev_state=S ==> next_comm=swapper/1 next_pid=0 next_prio=12
Powered by blists - more mailing lists