linux-kernel - Re: Kernel-managed IRQ affinity (cont)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20191219143214.GA50561@xz-x1>
Date:   Thu, 19 Dec 2019 09:32:14 -0500
From:   Peter Xu <peterx@...hat.com>
To:     Ming Lei <ming.lei@...hat.com>
Cc:     Thomas Gleixner <tglx@...utronix.de>,
        Juri Lelli <juri.lelli@...hat.com>,
        Ming Lei <minlei@...hat.com>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: Kernel-managed IRQ affinity (cont)

On Thu, Dec 19, 2019 at 04:28:19PM +0800, Ming Lei wrote:
> Hi Peter,

Hi, Ming,

> 
> On Mon, Dec 16, 2019 at 02:57:12PM -0500, Peter Xu wrote:
> > Hi, Thomas,
> > 
> > (Sorry I must have lost the discussion during an email migration, so
> >  I'll start with a new one)
> > 
> > This is a continued discussion of previous one on kernel managed IRQ
> > affinity [1].  I think at that time the conclusion is that we don't
> > have a usage scenario to change current policy [2].  However recently
> > I noticed that it is probably a very fundamental requirement for some
> > real-time scenarios, even when there's no multi-queue involved.
> > 
> > In my test case, it was a very common realtime guest with 10 vcpus,
> > 0-1 are housekeeping vcpus, 2-9 are realtime vcpus.  The guest has one
> > virtio-blk device as boot disk.  With a distribution very close to
> > latest upstream, we can observe high spikes, probably due to the IRQs.
> > 
> > To guarantee realtime responsiveness, we need to make sure the IRQs
> > will be managable, say, when I run a real-time workload on vcpu9, we
> > should be able to move all the IRQs from vcpu9 to the other vcpus
> > (most probably vcpu0 and vcpu1).  However with the kernel managed IRQs
> > we can't echo to /proc/irq/N/smp_affinity.  Here, vcpu9 gets IRQ 38
> > from the virtio-blk device:
> > 
> >   # cat /proc/interrupts | grep -w 38
> >   38: 0 0 0 0 0 0 0 0 0 15206 PCI-MSI 2621441-edge virtio2-req.0
> >   # cat /proc/irq/38/smp_affinity
> >   3ff
> >   # cat /proc/irq/38/effective_affinity
> >   200
> > 
> > Meanwhile, I don't think there's anything special for VMs, so this
> > issue should exist even for hosts as long as the IRQ is managed in the
> > same way here as the virtio-blk device.
> > 
> > As Ming has mentioned in previous discussions [3], I think it would be
> > at least good if the kernel IRQ system can respect "irqaffinity=" when
> > assigning IRQs to the cores.  Currently it's not.  What would you
> > suggest in this case?  Do you think this is a valid user scenario?
> > 
> > Thanks,
> > 
> > [1] https://lkml.org/lkml/2019/3/18/15
> > [2] https://lkml.org/lkml/2019/3/25/562
> > [3] https://lkml.org/lkml/2019/3/25/308
> 
> The following patch supposes to implementation the requirement for you,
> can you test it by passing 'isolcpus=managed_irq,X-Y'?

I really appreciate your patch!  I'll keep this version, while before
I start to test it...

> 
> With this kind of change, you can't run any IO from any isolated
> CPU core, otherwise, unpredictable error may be triggered, either oops or
> IO hang.

... I'm not sure whether this can be acceptable for a production
environment.

In our case, the IRQ should come from virtio-blk which is the root
disk, so I assume even the RT core could use it at least when loading
the executable into RAM.  So...

> 
> Another conservative approach is to only select effective CPU from
> non-isolated cpus, however, the assigned CPUs may not be balanced among
> interrupt vectors. But it is safer, since the system still works even if
> someone submits IO from any isolated cpu core.

... this one seems to be more appealing at least to me.

Thanks,

-- 
Peter Xu