[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CY4PR21MB07410E84C8C7C1D64BD7BF41CEAA0@CY4PR21MB0741.namprd21.prod.outlook.com>
Date: Wed, 21 Aug 2019 16:27:00 +0000
From: Long Li <longli@...rosoft.com>
To: Ming Lei <ming.lei@...hat.com>
CC: John Garry <john.garry@...wei.com>,
Ming Lei <tom.leiming@...il.com>,
"longli@...uxonhyperv.com" <longli@...uxonhyperv.com>,
Jens Axboe <axboe@...com>, Sagi Grimberg <sagi@...mberg.me>,
chenxiang <chenxiang66@...ilicon.com>,
Peter Zijlstra <peterz@...radead.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
linux-nvme <linux-nvme@...ts.infradead.org>,
Keith Busch <keith.busch@...el.com>,
Ingo Molnar <mingo@...hat.com>,
Thomas Gleixner <tglx@...utronix.de>,
Christoph Hellwig <hch@....de>
Subject: RE: [PATCH 0/3] fix interrupt swamp in NVMe
>>>Subject: Re: [PATCH 0/3] fix interrupt swamp in NVMe
>>>
>>>On Wed, Aug 21, 2019 at 07:47:44AM +0000, Long Li wrote:
>>>> >>>Subject: Re: [PATCH 0/3] fix interrupt swamp in NVMe
>>>> >>>
>>>> >>>On 20/08/2019 09:25, Ming Lei wrote:
>>>> >>>> On Tue, Aug 20, 2019 at 2:14 PM <longli@...uxonhyperv.com> wrote:
>>>> >>>>>
>>>> >>>>> From: Long Li <longli@...rosoft.com>
>>>> >>>>>
>>>> >>>>> This patch set tries to fix interrupt swamp in NVMe devices.
>>>> >>>>>
>>>> >>>>> On large systems with many CPUs, a number of CPUs may share
>>>one
>>>> >>>NVMe
>>>> >>>>> hardware queue. It may have this situation where several CPUs
>>>> >>>>> are issuing I/Os, and all the I/Os are returned on the CPU where
>>>> >>>>> the
>>>> >>>hardware queue is bound to.
>>>> >>>>> This may result in that CPU swamped by interrupts and stay in
>>>> >>>>> interrupt mode for extended time while other CPUs continue to
>>>> >>>>> issue I/O. This can trigger Watchdog and RCU timeout, and make
>>>> >>>>> the system
>>>> >>>unresponsive.
>>>> >>>>>
>>>> >>>>> This patch set addresses this by enforcing scheduling and
>>>> >>>>> throttling I/O when CPU is starved in this situation.
>>>> >>>>>
>>>> >>>>> Long Li (3):
>>>> >>>>> sched: define a function to report the number of context switches
>>>on a
>>>> >>>>> CPU
>>>> >>>>> sched: export idle_cpu()
>>>> >>>>> nvme: complete request in work queue on CPU with flooded
>>>> >>>>> interrupts
>>>> >>>>>
>>>> >>>>> drivers/nvme/host/core.c | 57
>>>> >>>>> +++++++++++++++++++++++++++++++++++++++-
>>>> >>>>> drivers/nvme/host/nvme.h | 1 +
>>>> >>>>> include/linux/sched.h | 2 ++
>>>> >>>>> kernel/sched/core.c | 7 +++++
>>>> >>>>> 4 files changed, 66 insertions(+), 1 deletion(-)
>>>> >>>>
>>>> >>>> Another simpler solution may be to complete request in threaded
>>>> >>>> interrupt handler for this case. Meantime allow scheduler to run
>>>> >>>> the interrupt thread handler on CPUs specified by the irq
>>>> >>>> affinity mask, which was discussed by the following link:
>>>> >>>>
>>>> >>>>
>>>> >>>https://lor
>>>> >>>e
>>>> >>>> .kernel.org%2Flkml%2Fe0e9478e-62a5-ca24-3b12-
>>>> >>>58f7d056383e%40huawei.com
>>>> >>>> %2F&data=02%7C01%7Clongli%40microsoft.com%7Cc7f46d3e2
>>>73f45
>>>> >>>176d1c08
>>>> >>>>
>>>> >>>d7254cc69e%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C63
>>>70188
>>>> >>>8401588
>>>> >>>>
>>>> >>>9866&sdata=h5k6HoGoyDxuhmDfuKLZUwgmw17PU%2BT%2FCb
>>>awfxV
>>>> >>>Er3U%3D&
>>>> >>>> reserved=0
>>>> >>>>
>>>> >>>> Could you try the above solution and see if the lockup can be
>>>avoided?
>>>> >>>> John Garry
>>>> >>>> should have workable patch.
>>>> >>>
>>>> >>>Yeah, so we experimented with changing the interrupt handling in
>>>> >>>the SCSI driver I maintain to use a threaded handler IRQ handler
>>>> >>>plus patch below, and saw a significant throughput boost:
>>>> >>>
>>>> >>>--->8
>>>> >>>
>>>> >>>Subject: [PATCH] genirq: Add support to allow thread to use hard
>>>> >>>irq affinity
>>>> >>>
>>>> >>>Currently the cpu allowed mask for the threaded part of a threaded
>>>> >>>irq handler will be set to the effective affinity of the hard irq.
>>>> >>>
>>>> >>>Typically the effective affinity of the hard irq will be for a
>>>> >>>single cpu. As such, the threaded handler would always run on the
>>>same cpu as the hard irq.
>>>> >>>
>>>> >>>We have seen scenarios in high data-rate throughput testing that
>>>> >>>the cpu handling the interrupt can be totally saturated handling
>>>> >>>both the hard interrupt and threaded handler parts, limiting
>>>throughput.
>>>> >>>
>>>> >>>Add IRQF_IRQ_AFFINITY flag to allow the driver requesting the
>>>> >>>threaded interrupt to decide on the policy of which cpu the threaded
>>>handler may run.
>>>> >>>
>>>> >>>Signed-off-by: John Garry <john.garry@...wei.com>
>>>>
>>>> Thanks for pointing me to this patch. This fixed the interrupt swamp and
>>>make the system stable.
>>>>
>>>> However I'm seeing reduced performance when using threaded
>>>interrupts.
>>>>
>>>> Here are the test results on a system with 80 CPUs and 10 NVMe disks
>>>> (32 hardware queues for each disk) Benchmark tool is FIO, I/O pattern:
>>>> 4k random reads on all NVMe disks, with queue depth = 64, num of jobs
>>>> = 80, direct=1
>>>>
>>>> With threaded interrupts: 1320k IOPS
>>>> With just interrupts: 3720k IOPS
>>>> With just interrupts and my patch: 3700k IOPS
>>>
>>>This gap looks too big wrt. threaded interrupts vs. interrupts.
>>>
>>>>
>>>> At the peak IOPS, the overall CPU usage is at around 98-99%. I think the
>>>cost of doing wake up and context switch for NVMe threaded IRQ handler
>>>takes some CPU away.
>>>>
>>>
>>>In theory, it shouldn't be so because most of times the thread should be
>>>running on CPUs of this hctx, and the wakeup cost shouldn't be so big.
>>>Maybe there is performance problem somewhere wrt. threaded interrupt.
>>>
>>>Could you share us your test script and environment? I will see if I can
>>>reproduce it in my environment.
Ming, do you have access to L80s_v2 in Azure? This test needs to run on that VM size.
Here is the command to benchmark it:
fio --bs=4k --ioengine=libaio --iodepth=128 --filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1 --direct=1 --runtime=120 --numjobs=80 --rw=randread --name=test --group_reporting --gtod_reduce=1
Thanks
Long
>>>
>>>> In this test, I made the following change to make use of
>>>IRQF_IRQ_AFFINITY for NVMe:
>>>>
>>>> diff --git a/drivers/pci/irq.c b/drivers/pci/irq.c index
>>>> a1de501a2729..3fb30d16464e 100644
>>>> --- a/drivers/pci/irq.c
>>>> +++ b/drivers/pci/irq.c
>>>> @@ -86,7 +86,7 @@ int pci_request_irq(struct pci_dev *dev, unsigned int
>>>nr, irq_handler_t handler,
>>>> va_list ap;
>>>> int ret;
>>>> char *devname;
>>>> - unsigned long irqflags = IRQF_SHARED;
>>>> + unsigned long irqflags = IRQF_SHARED | IRQF_IRQ_AFFINITY;
>>>>
>>>> if (!handler)
>>>> irqflags |= IRQF_ONESHOT;
>>>>
>>>
>>>I don't see why IRQF_IRQ_AFFINITY is needed.
>>>
>>>John, could you explain it a bit why you need changes on
>>>IRQF_IRQ_AFFINITY?
>>>
>>>The following patch should be enough:
>>>
>>>diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index
>>>e8f7f179bf77..1e7cffc1c20c 100644
>>>--- a/kernel/irq/manage.c
>>>+++ b/kernel/irq/manage.c
>>>@@ -968,7 +968,11 @@ irq_thread_check_affinity(struct irq_desc *desc,
>>>struct irqaction *action)
>>> if (cpumask_available(desc->irq_common_data.affinity)) {
>>> const struct cpumask *m;
>>>
>>>- m = irq_data_get_effective_affinity_mask(&desc-
>>>>irq_data);
>>>+ if (irqd_affinity_is_managed(&desc->irq_data))
>>>+ m = desc->irq_common_data.affinity;
>>>+ else
>>>+ m = irq_data_get_effective_affinity_mask(
>>>+ &desc->irq_data);
>>> cpumask_copy(mask, m);
>>> } else {
>>> valid = false;
>>>
>>>
>>>Thanks,
>>>Ming
Powered by blists - more mailing lists