linux-kernel - Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <30dc6fa9-ea5e-50d6-56f9-fbc9627d8c29@grimberg.me>
Date:   Fri, 20 Sep 2019 10:14:51 -0700
From:   Sagi Grimberg <sagi@...mberg.me>
To:     Long Li <longli@...rosoft.com>, Ming Lei <ming.lei@...hat.com>
Cc:     Jens Axboe <axboe@...com>, Hannes Reinecke <hare@...e.com>,
        John Garry <john.garry@...wei.com>,
        Bart Van Assche <bvanassche@....org>,
        "linux-scsi@...r.kernel.org" <linux-scsi@...r.kernel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Daniel Lezcano <daniel.lezcano@...aro.org>,
        LKML <linux-kernel@...r.kernel.org>,
        "linux-nvme@...ts.infradead.org" <linux-nvme@...ts.infradead.org>,
        Keith Busch <keith.busch@...el.com>,
        Ingo Molnar <mingo@...hat.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Christoph Hellwig <hch@....de>
Subject: Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism


>> Hey Ming,
>>
>>>>> Ok, so the real problem is per-cpu bounded tasks.
>>>>>
>>>>> I share Thomas opinion about a NAPI like approach.
>>>>
>>>> We already have that, its irq_poll, but it seems that for this
>>>> use-case, we get lower performance for some reason. I'm not entirely
>>>> sure why that is, maybe its because we need to mask interrupts
>>>> because we don't have an "arm" register in nvme like network devices
>>>> have?
>>>
>>> Long observed that IOPS drops much too by switching to threaded irq.
>>> If softirqd is waken up for handing softirq, the performance shouldn't
>>> be better than threaded irq.
>>
>> Its true that it shouldn't be any faster, but what irqpoll already has and we
>> don't need to reinvent is a proper budgeting mechanism that needs to occur
>> when multiple devices map irq vectors to the same cpu core.
>>
>> irqpoll already maintains a percpu list and dispatch the ->poll with a budget
>> that the backend enforces and irqpoll multiplexes between them.
>> Having this mechanism in irq (hard or threaded) context sounds unnecessary a
>> bit.
>>
>> It seems like we're attempting to stay in irq context for as long as we can
>> instead of scheduling to softirq/thread context if we have more than a
>> minimal amount of work to do. Without at least understanding why
>> softirq/thread degrades us so much this code seems like the wrong approach
>> to me. Interrupt context will always be faster, but it is not a sufficient reason
>> to spend as much time as possible there, is it?
>>
>> We should also keep in mind, that the networking stack has been doing this
>> for years, I would try to understand why this cannot work for nvme before
>> dismissing.
>>
>>> Especially, Long found that context
>>> switch is increased a lot after applying your irq poll patch.
>>>
>>> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists
>>> .infradead.org%2Fpipermail%2Flinux-nvme%2F2019-
>> August%2F026788.html&am
>>>
>> p;data=02%7C01%7Clongli%40microsoft.com%7C20391b0810844821325908d73
>> 59c
>>>
>> 64d2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637036818140279
>> 742&a
>>>
>> mp;sdata=GyBWILwPvwHYvrTGSAVZbdl%2Fcoz3twSXe2DrH2t1MeQ%3D&am
>> p;reserved
>>> =0
>>
>> Oh, I didn't see that one, wonder why... thanks!
>>
>> 5% improvement, I guess we can buy that for other users as is :)
>>
>> If we suffer from lots of context switches while the CPU is flooded with
>> interrupts, then I would argue that we're re-raising softirq too much.
>> In this use-case, my assumption is that the cpu cannot keep up with the
>> interrupts and not that it doesn't reap enough (we also reap the first batch in
>> interrupt context...)
>>
>> Perhaps making irqpoll continue until it must resched would improve things
>> further? Although this is a latency vs. efficiency tradeoff, looks like
>> MAX_SOFTIRQ_TIME is set to 2ms:
>>
>> "
>>   * The MAX_SOFTIRQ_TIME provides a nice upper bound in most cases, but in
>>   * certain cases, such as stop_machine(), jiffies may cease to
>>   * increment and so we need the MAX_SOFTIRQ_RESTART limit as
>>   * well to make sure we eventually return from this method.
>>   *
>>   * These limits have been established via experimentation.
>>   * The two things to balance is latency against fairness -
>>   * we want to handle softirqs as soon as possible, but they
>>   * should not be able to lock up the box.
>> "
>>
>> Long, does this patch make any difference?
> 
> Sagi,
> 
> Sorry it took a while to bring my system back online.
> 
> With the patch, the IOPS is about the same drop with the 1st patch. I think the excessive context switches are causing the drop in IOPS.
> 
> The following are captured by "perf sched record" for 30 seconds during tests.
> 
> "perf sched latency"
> With patch:
>    fio:(82)              | 937632.706 ms |  1782255 | avg:    0.209 ms | max:   63.123 ms | max at:    768.274023 s
> 
> without patch:
>    fio:(82)              |2348323.432 ms |    18848 | avg:    0.295 ms | max:   28.446 ms | max at:   6447.310255 s

Without patch means the proposed hard-irq patch?

If we are context switching too much, it means the soft-irq operation is
not efficient, not necessarily the fact that the completion path is
running in soft-irq..

Is your kernel compiled with full preemption or voluntary preemption?

> Look closer at each CPU, we can see ksoftirqd is competing CPU with fio (and effectively throttle other fio processes)
> (captured in /sys/kernel/debug/tracing, echo sched:* >set_event)
> 
> On CPU1 with patch: (note that the prev_state for fio is "R", it's preemptively scheduled)
>             <...>-4077  [001] d... 66456.805062: sched_switch: prev_comm=fio prev_pid=4077 prev_prio=120 prev_state=R ==> next_comm=ksoftirqd/1 next_pid=17 next_prio=120
>             <...>-17    [001] d... 66456.805859: sched_switch: prev_comm=ksoftirqd/1 prev_pid=17 prev_prio=120 prev_state=S ==> next_comm=fio next_pid=4077 next_prio=120
>             <...>-4077  [001] d... 66456.844049: sched_switch: prev_comm=fio prev_pid=4077 prev_prio=120 prev_state=R ==> next_comm=ksoftirqd/1 next_pid=17 next_prio=120
>             <...>-17    [001] d... 66456.844607: sched_switch: prev_comm=ksoftirqd/1 prev_pid=17 prev_prio=120 prev_state=S ==> next_comm=fio next_pid=4077 next_prio=120
> 
> On CPU1 without patch: (the prev_state for fio is "S", it's voluntarily scheduled)
>            <idle>-0     [001] d...  6725.392308: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=fio next_pid=14342 next_prio=120
>               fio-14342 [001] d...  6725.392332: sched_switch: prev_comm=fio prev_pid=14342 prev_prio=120 prev_state=S ==> next_comm=swapper/1 next_pid=0 next_prio=120
>            <idle>-0     [001] d...  6725.392356: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=fio next_pid=14342 next_prio=120
>               fio-14342 [001] d...  6725.392425: sched_switch: prev_comm=fio prev_pid=14342 prev_prio=120 prev_state=S ==> next_comm=swapper/1 next_pid=0 next_prio=12