lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20191219161115.GA18672@ming.t460p>
Date:   Fri, 20 Dec 2019 00:11:15 +0800
From:   Ming Lei <ming.lei@...hat.com>
To:     Peter Xu <peterx@...hat.com>
Cc:     Thomas Gleixner <tglx@...utronix.de>,
        Juri Lelli <juri.lelli@...hat.com>,
        Ming Lei <minlei@...hat.com>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: Kernel-managed IRQ affinity (cont)

On Thu, Dec 19, 2019 at 09:32:14AM -0500, Peter Xu wrote:
> On Thu, Dec 19, 2019 at 04:28:19PM +0800, Ming Lei wrote:
> > Hi Peter,
> 
> Hi, Ming,
> 
> > 
> > On Mon, Dec 16, 2019 at 02:57:12PM -0500, Peter Xu wrote:
> > > Hi, Thomas,
> > > 
> > > (Sorry I must have lost the discussion during an email migration, so
> > >  I'll start with a new one)
> > > 
> > > This is a continued discussion of previous one on kernel managed IRQ
> > > affinity [1].  I think at that time the conclusion is that we don't
> > > have a usage scenario to change current policy [2].  However recently
> > > I noticed that it is probably a very fundamental requirement for some
> > > real-time scenarios, even when there's no multi-queue involved.
> > > 
> > > In my test case, it was a very common realtime guest with 10 vcpus,
> > > 0-1 are housekeeping vcpus, 2-9 are realtime vcpus.  The guest has one
> > > virtio-blk device as boot disk.  With a distribution very close to
> > > latest upstream, we can observe high spikes, probably due to the IRQs.
> > > 
> > > To guarantee realtime responsiveness, we need to make sure the IRQs
> > > will be managable, say, when I run a real-time workload on vcpu9, we
> > > should be able to move all the IRQs from vcpu9 to the other vcpus
> > > (most probably vcpu0 and vcpu1).  However with the kernel managed IRQs
> > > we can't echo to /proc/irq/N/smp_affinity.  Here, vcpu9 gets IRQ 38
> > > from the virtio-blk device:
> > > 
> > >   # cat /proc/interrupts | grep -w 38
> > >   38: 0 0 0 0 0 0 0 0 0 15206 PCI-MSI 2621441-edge virtio2-req.0
> > >   # cat /proc/irq/38/smp_affinity
> > >   3ff
> > >   # cat /proc/irq/38/effective_affinity
> > >   200
> > > 
> > > Meanwhile, I don't think there's anything special for VMs, so this
> > > issue should exist even for hosts as long as the IRQ is managed in the
> > > same way here as the virtio-blk device.
> > > 
> > > As Ming has mentioned in previous discussions [3], I think it would be
> > > at least good if the kernel IRQ system can respect "irqaffinity=" when
> > > assigning IRQs to the cores.  Currently it's not.  What would you
> > > suggest in this case?  Do you think this is a valid user scenario?
> > > 
> > > Thanks,
> > > 
> > > [1] https://lkml.org/lkml/2019/3/18/15
> > > [2] https://lkml.org/lkml/2019/3/25/562
> > > [3] https://lkml.org/lkml/2019/3/25/308
> > 
> > The following patch supposes to implementation the requirement for you,
> > can you test it by passing 'isolcpus=managed_irq,X-Y'?
> 
> I really appreciate your patch!  I'll keep this version, while before
> I start to test it...
> 
> > 
> > With this kind of change, you can't run any IO from any isolated
> > CPU core, otherwise, unpredictable error may be triggered, either oops or
> > IO hang.
> 
> ... I'm not sure whether this can be acceptable for a production
> environment.
> 
> In our case, the IRQ should come from virtio-blk which is the root
> disk, so I assume even the RT core could use it at least when loading
> the executable into RAM.  So...
> 
> > 
> > Another conservative approach is to only select effective CPU from
> > non-isolated cpus, however, the assigned CPUs may not be balanced among
> > interrupt vectors. But it is safer, since the system still works even if
> > someone submits IO from any isolated cpu core.
> 
> ... this one seems to be more appealing at least to me.

OK, please try the following patch:


diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index 6c8512d3be88..0fbcbacd1b29 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -13,6 +13,7 @@ enum hk_flags {
 	HK_FLAG_TICK		= (1 << 4),
 	HK_FLAG_DOMAIN		= (1 << 5),
 	HK_FLAG_WQ		= (1 << 6),
+	HK_FLAG_MANAGED_IRQ	= (1 << 7),
 };
 
 #ifdef CONFIG_CPU_ISOLATION
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 1753486b440c..0a75a09cc4e8 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -20,6 +20,7 @@
 #include <linux/sched/task.h>
 #include <uapi/linux/sched/types.h>
 #include <linux/task_work.h>
+#include <linux/sched/isolation.h>
 
 #include "internals.h"
 
@@ -212,12 +213,33 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
 {
 	struct irq_desc *desc = irq_data_to_desc(data);
 	struct irq_chip *chip = irq_data_get_irq_chip(data);
+	const struct cpumask *housekeeping_mask =
+		housekeeping_cpumask(HK_FLAG_MANAGED_IRQ);
 	int ret;
+	cpumask_var_t tmp_mask;
 
 	if (!chip || !chip->irq_set_affinity)
 		return -EINVAL;
 
-	ret = chip->irq_set_affinity(data, mask, force);
+	if (!zalloc_cpumask_var(&tmp_mask, GFP_KERNEL))
+		return -EINVAL;
+
+	/*
+	 * Userspace can't change managed irq's affinity, make sure
+	 * that isolated CPU won't be selected as the effective CPU
+	 * if this irq's affinity includes both isolated CPU and
+	 * housekeeping CPU.
+	 *
+	 * This way guarantees that isolated CPU won't be interrupted
+	 * by IO submitted from housekeeping CPU.
+	 */
+	if (irqd_affinity_is_managed(data) &&
+			cpumask_intersects(mask, housekeeping_mask))
+		cpumask_and(tmp_mask, mask, housekeeping_mask);
+	else
+		cpumask_copy(tmp_mask, mask);
+
+	ret = chip->irq_set_affinity(data, tmp_mask, force);
 	switch (ret) {
 	case IRQ_SET_MASK_OK:
 	case IRQ_SET_MASK_OK_DONE:
@@ -229,6 +251,8 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
 		ret = 0;
 	}
 
+	free_cpumask_var(tmp_mask);
+
 	return ret;
 }
 
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 9fcb2a695a41..008d6ac2342b 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -163,6 +163,12 @@ static int __init housekeeping_isolcpus_setup(char *str)
 			continue;
 		}
 
+		if (!strncmp(str, "managed_irq,", 12)) {
+			str += 12;
+			flags |= HK_FLAG_MANAGED_IRQ;
+			continue;
+		}
+
 		pr_warn("isolcpus: Error, unknown flag\n");
 		return 0;
 	}

-- 
Ming

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ