lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
 <SN6PR02MB4157F39049BEFD43AE7C3731D4602@SN6PR02MB4157.namprd02.prod.outlook.com>
Date: Mon, 16 Sep 2024 18:15:04 +0000
From: Michael Kelley <mhklinux@...look.com>
To: Michael Kelley <mhklinux@...look.com>, "kys@...rosoft.com"
	<kys@...rosoft.com>, "haiyangz@...rosoft.com" <haiyangz@...rosoft.com>,
	"wei.liu@...nel.org" <wei.liu@...nel.org>, "decui@...rosoft.com"
	<decui@...rosoft.com>, "tglx@...utronix.de" <tglx@...utronix.de>,
	"mingo@...hat.com" <mingo@...hat.com>, "bp@...en8.de" <bp@...en8.de>,
	"dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>, "x86@...nel.org"
	<x86@...nel.org>, "hpa@...or.com" <hpa@...or.com>, "lpieralisi@...nel.org"
	<lpieralisi@...nel.org>, "kw@...ux.com" <kw@...ux.com>, "robh@...nel.org"
	<robh@...nel.org>, "bhelgaas@...gle.com" <bhelgaas@...gle.com>,
	"James.Bottomley@...senPartnership.com"
	<James.Bottomley@...senPartnership.com>, "martin.petersen@...cle.com"
	<martin.petersen@...cle.com>, "arnd@...db.de" <arnd@...db.de>,
	"linux-hyperv@...r.kernel.org" <linux-hyperv@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-pci@...r.kernel.org" <linux-pci@...r.kernel.org>,
	"linux-scsi@...r.kernel.org" <linux-scsi@...r.kernel.org>,
	"linux-arch@...r.kernel.org" <linux-arch@...r.kernel.org>
CC: "maz@...nel.org" <maz@...nel.org>, "den@...inux.co.jp"
	<den@...inux.co.jp>, "jgowans@...zon.com" <jgowans@...zon.com>,
	"dawei.li@...ngroup.cn" <dawei.li@...ngroup.cn>
Subject: RE: [RFC 00/12] Hyper-V guests use Linux IRQs for channel interrupts

From: mhkelley58@...il.com <mhkelley58@...il.com> Sent: Monday, June 3, 2024 10:09 PM
> 
> This patch set makes significant changes in how Linux guests running on Hyper-V
> handle interrupts from VMBus devices. It's "RFC" because of the level of change,
> and because I'm interested in macro-level feedback on the whole idea. The usual
> detailed-level code review feedback is also welcome.

I want to get back to this RFC patch set and incorporate the changes
suggested by Thomas Gleixner. But before I do that, is there any high-level
feedback on whether the whole idea should be pursued at all? I think there's
value in having the VMBus devices appear in a standard way in /proc/irq, so
that we can eventually get away from the custom stuff in /sys/bus/vmbus.
It also makes VMBus devices show up in a way similar to virtio devices. But
on the flip side, it's churn to code that's already working well. And having
VMBus devices appear in /proc/irq means they are visible to user space utilities
like irqbalance, giving them the ability to affect performance (particularly for
networking). As such, this patch set has implications beyond just the Linux
kernel, and probably means disabling irqbalance by default in Azure cloud
images.

I'm especially interested in input from Microsoft folks, since the changes
are specific to Hyper-V, with implications for the Azure cloud.

Michael

> 
> Background
> ----------
> Linux guests running on Hyper-V are presented with a range of synthetic
> devices, including storage, networking, mouse, keyboard, etc. Guests are also
> presented with management-related pseudo-devices, including time
> synchronization, heartbeat, and host-initiated shutdown. These devices use the
> VMBus connection between the host and the guest, and each device has one or
> more VMBus channels for passing messages between the host and guest.
> Separately, a control path exists for creating and tearing down VMBus channels
> when needed. Hyper-V VMBus and its synthetic devices play a role similar to
> virtio and virtio devices.
> 
> The host interrupts the guest when it sends a new message to the guest in an
> otherwise empty channel, and when it sends a new control message to the guest.
> All interrupts are multiplexed onto a single per-CPU architectural interrupt as
> defined by the processor architecture. On arm64, this architectural interrupt
> is a PPI, and is represented in Linux as a Linux per-CPU IRQ. The x86/x64
> architecture does not provide per-CPU interrupts, so Linux reserves a fixed x86
> interrupt vector (HYPERVISOR_CALLBACK_VECTOR) on all CPUs, with a hard-coded
> interrupt handler. No Linux IRQ is assigned on x86/x64.
> 
> The interrupt handler for the architectural interrupt is the VMBus interrupt
> handler, and it runs whenever the host generates an interrupt for a channel or
> a control message. The VMBus interrupt handler consults data structures
> provided by the per-CPU Hyper-V synthetic interrupt controller (synic) to
> determine which channel or channels have a pending interrupt, or if there is a
> pending control message. When doing this demultiplexing, the VMBus interrupt
> handler directly invokes the per-device handlers, and processes any control
> messages.
> 
> In this scheme, the individual VMBus devices and their channels are not
> modelled as Linux IRQs. /proc/interrupts shows counts for the top-level
> architectural interrupt, but not for the individual VMBus devices. The VMBus
> driver has implemented separate reporting of per-device data under /sys.
> Similarly, /proc/irq/<nn> does not show affinity information for individual
> VMBus devices, and the affinity cannot be changed via /proc/irq. Again, a
> separate interface under /sys is provided for changing the vCPU that a VMBus
> channel should interrupt.
> 
> What's New
> ----------
> Linux kernel community members have commented that it would be better for the
> Hyper-V synic and the handling of VMBus channel interrupts to be modelled as
> Linux IRQs. Then they would participate in existing IRQ handling mechanisms,
> and not need something separate under /sys. This patch set provides such an
> implementation.
> 
> With this patch set, the output of /proc/interrupts looks like this in a Linux
> guest on a Hyper-V x86/x64 Generation 2 VM with 8 vCPUs (the columns for
> CPUs 2 thru 5 are omitted due to text width considerations):
> 
>            CPU0       CPU1        CPU6       CPU7
>   8:          0          0           0          0   IO-APIC   8-edge      rtc0
>   9:          2          0           0          0   IO-APIC   9-fasteoi   acpi
>  24:      16841          0           0          0     VMBus   7  heartbeat
>  25:          1          0           0          0     VMBus   9  shutdown
>  26:       6752          0           0          0     VMBus  10  timesync
>  27:          1          0           0          0     VMBus  11  vss
>  28:      22095          0           0          0     VMBus  14  pri@...rvsc1
>  29:          0         85           0          0     VMBus  15  pri@...rvsc0
>  30:          3          0           0          0     VMBus   1  balloon
>  31:        106          0           0          0     VMBus   3  mouse
>  32:         73          0           0          0     VMBus   4  keyboard
>  33:          6          0           0          0     VMBus   5  framebuffer
>  34:          0          0           0          0     VMBus  13  netvsc
>  35:          1          0           0          0     VMBus   8  kvp
>  36:          0          0           0          0     VMBus  16  netvsc
>  37:          0          0           0          0     VMBus  17  netvsc
>  38:          0          0           0          0     VMBus  18  netvsc
>  39:          0          0        2375          0     VMBus  19  netvsc
>  40:          0          0           0       1178     VMBus  20  netvsc
>  41:       1003          0           0          0     VMBus  21  netvsc
>  42:          0       4337           0          0     VMBus  22  netvsc
>  43:          0          0           0          0     VMBus  23  sub@...rvsc1
>  44:          0          0           0          0     VMBus  24  sub@...rvsc0
> NMI:          0          0           0          0   Non-maskable interrupts
> LOC:          0          0           0          0   Local timer interrupts
> SPU:          0          0           0          0   Spurious interrupts
> PMI:          0          0           0          0   Performance monitoring interrupts
> IWI:          1          0           0          2   IRQ work interrupts
> RTR:          0          0           0          0   APIC ICR read retries
> RES:        411        233         349        318   Rescheduling interrupts
> CAL:     135236      29211       99526      36424   Function call interrupts
> TLB:          0          0           0          0   TLB shootdowns
> TRM:          0          0           0          0   Thermal event interrupts
> THR:          0          0           0          0   Threshold APIC interrupts
> DFR:          0          0           0          0   Deferred Error APIC interrupts
> MCE:          0          0           0          0   Machine check exceptions
> MCP:        109        109         109        109   Machine check polls
> HYP:         71          0           0          0   Hypervisor callback interrupts
> HRE:          0          0           0          0   Hyper-V reenlightenment interrupts
> HVS:     183391     109695      181539     339239   Hyper-V stimer0 interrupts
> ERR:          0
> MIS:          0
> PIN:          0          0           0          0   Posted-interrupt notification event
> NPI:          0          0           0          0   Nested posted-interrupt event
> PIW:          0          0           0          0   Posted-interrupt wakeup event
> 
> IRQs 24 thru 44 are the VMBus channel IRQs that didn't previously exist. Some
> devices, such as storvsc and netvsc, have multiple channels, each of which has
> a separate IRQ. The HYP line now shows counts only for control messages instead
> of for all VMBus interrupts. The HVS line continues to show Hyper-V synthetic
> timer (stimer0) interrupts. (In older versions of Hyper-V where stimer0
> interrupts are delivered as control messages, those interrupts are accounted
> for on the HYP line. Since this is legacy behavior, it seemed unnecessary to
> create a separate Linux IRQ to model these messages.)
> 
> The basic approach is to update the VMBus channel create/open/close/delete
> infrastructure, and the VMBus interrupt handler, to create and process the
> additional IRQs. The goal is that individual VMBus drivers require no code
> changes to work with the new IRQs. However, a driver may optionally annotate
> the IRQ using knowledge that only the driver has. For example, in the above
> /proc/interrupts output, the storvsc driver for synthetic disks has been
> updated to distinguish the primary channel from the subchannels, and to
> distinguish controller 0 from controller 1. Since storvsc models a SCSI
> controller, the numeric designations are set to match the "host0" and "host1"
> SCSI controllers that could be seen with "lsscsi -H -v". Similarly annotating
> the "netvsc" entries is a "to do" item.
> 
> In furtherance of the "no changes to existing VMBus drivers" goal, all VMBus
> IRQs use the same interrupt handler. Each channel has a callback function
> associated with it, and the interrupt handler invokes this callback in the same
> way as before VMBus IRQs were introduced.
> 
> VMBus code creates a standalone linear IRQ domain for the VMBus IRQs. The hwirq
> #'s are the VMBus channel relid's, which are integers in the range 1 to 2047.
> Each relid uniquely identifies a VMBus channel, and a channel interrupt from
> the host provides the relid. A mapping from IRQ to hwirq (relid) is created
> when a new channel is created, and the mapping is deleted when the channel is
> deleted, so adding and removing VMBus devices in a running VM is fully
> functional.
> 
> Getting the IRQ counting correct requires a new IRQ flow handler in
> /kernel/irq/chip.c. See Patch 6. The existing flow handlers don't handle this
> case, and creating a custom flow handler outside of chip.c isn't feasible
> because the needed components aren't available to code outside of /kernel/irq.
> 
> When a guest CPU is taken offline, Linux automatically changes the affinity of
> IRQs assigned to that CPU so that the IRQ is handled on another CPU that is
> still online. Unfortunately, VMBus channel IRQs are not able to take advantage
> of that mechanism. At the point the mechanism runs, it isn't possible to know
> when Hyper-V started interrupting the new CPU; hence pending interrupts may be
> stranded on the old offline CPU. Existing VMBus code in Linux prevents a CPU
> from going offline when a VMBus channel interrupt is affined to it, and that
> code must be retained. Of course, VMBus channel IRQs can manually be affined to
> a different CPU, which could then allow a CPU to be taken offline. I have an
> idea for a somewhat convoluted way to allow the automatic Linux mechanism to
> work, but that will be another patch set.
> 
> Having VMBus channel interrupts appear in /proc/irq means that user-space
> irqbalance can change the affinity of the interrupts, where previously it could
> not. For example, this gives irqbalance the ability to change (and perhaps
> degrade) the default affinity configuration that was designed to maximize
> network throughput. This is yet another reason to disable irqbalance in VM
> images.
> 
> The custom VMBus entries under /sys are retained for backwards compatibility,
> but they are integrated. For example, changing the affinity via /proc/irq is
> reflected in /sys/bus/vmbus/devices/<guid>/channels/<nn>/cpu, and vice versa.
> 
> Patches
> -------
> The patches have been tested on x86/x64 Generation 1 and Generation 2 VMs, and
> on arm64 VMs.
> 
> * Patches 1 and 2 fixes some error handling that is needed by subsequent
> patches. They could be applied independent of this patch set.
> 
> * Patches 3,4, and 5 add support for IRQ names, and add annotations in the
> storvsc and the Hyper-V vPCI driver so that descriptions in /proc/interrupts
> convey useful information.
> 
> * Patch 6 adds a new IRQ flow handler needed to properly account for IRQs as
> VMBus IRQs or as the top-level architectural interrupt.
> 
> * Patch 7 creates the new IRQ domain for VMBus channel IRQs.
> 
> * Patch 8 allocates the VMBus channel IRQs, and creates the mapping between
> VMBus relid (used as the hwirq) and the Linux IRQ number. That mapping is then
> used for lookups.
> 
> * Patch 9 does request_irq() for the new VMBus channel IRQs, and converts the
> top-level VMBus architectural interrupt handler to use the new VMBus channel
> IRQs for handling channel interrupts. With this patch, /proc/interrupts shows
> counts for the VMBus channel IRQs.
> 
> * Patch 10 implements the irq_set_affinity() function for VMBus channel IRQs,
> and integrates it with the existing Hyper-V-specific sysfs mechanism for
> changing the CPU that a VMBus channel will interrupt.
> 
> * Patch 11 updates hv_synic_cleanup() during CPU offlining to handle the lack
> of waiting for the MODIFYCHANNEL message in vmbus_irq_set_affinity() in Patch
> 10.
> 
> * Patch 12 fixes a race in VMBus channel interrupt assignments that existed
> prior to this patch set when taking a CPU offline.
> 
> Open Topics
> -----------
> * The performance impact of the additional IRQ processing in the interrupt path
> appears to be minimal, based on looking at the code. Measurements are still to
> be taken to confirm this.
> 
> * Does the netvsc driver have sufficient information to annotate what appears
> in /proc/interrupts, particularly when multiple synthetic NICs are present? At
> first glance, it appears that the identifying info isn't generated until after
> vmbus_open() is called, and by then it's too late to do the IRQ annotation.
> Need some input from a netvsc expert.
> 
> * The VMBus irq domain and irq chip data structures are placed in the
> vmbus_connection structure. Starting with VMBus protocol version 5.0, the VMBus
> host side supports multiple connections from a single guest instance (see
> commit ae20b254306a6). While there's no upstream kernel code that creates
> multiple VMBus connections, it's unclear whether the IRQ domain should be
> global across all VMBus connection, or per connection. Are relid's global (to
> the VM) or per connection?
> 
> * I have not tried or tested any hv_sock channels. Is there a handy test
> program that would be convenient for opening multiple hv_sock connections to
> the guest, have them persist long enough to see what /proc/interrupts shows,
> and try changing the IRQ affinity?
> 
> * I have not tried or tested Linux guest hibernation paths.
> 
> The patches build and run against 6.10-rc2 and next-20240603.
> 
> Michael Kelley (12):
>   Drivers: hv: vmbus: Drop unsupported VMBus devices earlier
>   Drivers: hv: vmbus: Fix error path that deletes non-existent sysfs
>     group
>   Drivers: hv: vmbus: Add an IRQ name to VMBus channels
>   PCI: hv: Annotate the VMBus channel IRQ name
>   scsi: storvsc: Annotate the VMBus channel IRQ name
>   genirq: Add per-cpu flow handler with conditional IRQ stats
>   Drivers: hv: vmbus: Set up irqdomain and irqchip for the VMBus
>     connection
>   Drivers: hv: vmbus: Allocate an IRQ per channel and use for relid
>     mapping
>   Drivers: hv: vmbus: Use Linux IRQs to handle VMBus channel interrupts
>   Drivers: hv: vmbus: Implement vmbus_irq_set_affinity
>   Drivers: hv: vmbus: Wait for MODIFYCHANNEL to finish when offlining
>     CPUs
>   Drivers: hv: vmbus: Ensure IRQ affinity isn't set to a CPU going
>     offline
> 
>  arch/x86/kernel/cpu/mshyperv.c      |   9 +-
>  drivers/hv/channel.c                | 125 ++++------
>  drivers/hv/channel_mgmt.c           | 139 ++++++++----
>  drivers/hv/connection.c             |  48 ++--
>  drivers/hv/hv.c                     |  90 +++++++-
>  drivers/hv/hv_common.c              |   2 +-
>  drivers/hv/hyperv_vmbus.h           |  22 +-
>  drivers/hv/vmbus_drv.c              | 338 +++++++++++++++++++---------
>  drivers/pci/controller/pci-hyperv.c |   5 +
>  drivers/scsi/storvsc_drv.c          |   9 +
>  include/asm-generic/mshyperv.h      |   9 +-
>  include/linux/hyperv.h              |   9 +
>  include/linux/irq.h                 |   1 +
>  kernel/irq/chip.c                   |  29 +++
>  14 files changed, 567 insertions(+), 268 deletions(-)
> 
> --
> 2.25.1
> 


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ