[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a4f169fa-663d-4a94-878b-d783f67d48c9@intel.com>
Date: Fri, 29 Mar 2024 15:32:00 +0800
From: Zeng Guang <guang.zeng@...el.com>
To: Jacob Pan <jacob.jun.pan@...ux.intel.com>,
LKML <linux-kernel@...r.kernel.org>, X86 Kernel <x86@...nel.org>,
Peter Zijlstra <peterz@...radead.org>,
"iommu@...ts.linux.dev" <iommu@...ts.linux.dev>,
Thomas Gleixner <tglx@...utronix.de>, Lu Baolu <baolu.lu@...ux.intel.com>,
"kvm@...r.kernel.org" <kvm@...r.kernel.org>,
"Hansen, Dave" <dave.hansen@...el.com>, Joerg Roedel <joro@...tes.org>,
"H. Peter Anvin" <hpa@...or.com>, Borislav Petkov <bp@...en8.de>,
Ingo Molnar <mingo@...hat.com>
Cc: "Luse, Paul E" <paul.e.luse@...el.com>,
"Williams, Dan J" <dan.j.williams@...el.com>, Jens Axboe <axboe@...nel.dk>,
"Raj, Ashok" <ashok.raj@...el.com>, "Tian, Kevin" <kevin.tian@...el.com>,
"maz@...nel.org" <maz@...nel.org>, "seanjc@...gle.com" <seanjc@...gle.com>,
Robin Murphy <robin.murphy@....com>
Subject: Re: [PATCH 09/15] x86/irq: Install posted MSI notification handler
On 1/27/2024 7:42 AM, Jacob Pan wrote:
> @@ -353,6 +360,111 @@ void intel_posted_msi_init(void)
> pid->nv = POSTED_MSI_NOTIFICATION_VECTOR;
> pid->ndst = this_cpu_read(x86_cpu_to_apicid);
> }
> +
> +/*
> + * De-multiplexing posted interrupts is on the performance path, the code
> + * below is written to optimize the cache performance based on the following
> + * considerations:
> + * 1.Posted interrupt descriptor (PID) fits in a cache line that is frequently
> + * accessed by both CPU and IOMMU.
> + * 2.During posted MSI processing, the CPU needs to do 64-bit read and xchg
> + * for checking and clearing posted interrupt request (PIR), a 256 bit field
> + * within the PID.
> + * 3.On the other side, the IOMMU does atomic swaps of the entire PID cache
> + * line when posting interrupts and setting control bits.
> + * 4.The CPU can access the cache line a magnitude faster than the IOMMU.
> + * 5.Each time the IOMMU does interrupt posting to the PIR will evict the PID
> + * cache line. The cache line states after each operation are as follows:
> + * CPU IOMMU PID Cache line state
> + * ---------------------------------------------------------------
> + *...read64 exclusive
> + *...lock xchg64 modified
> + *... post/atomic swap invalid
> + *...-------------------------------------------------------------
> + *
> + * To reduce L1 data cache miss, it is important to avoid contention with
> + * IOMMU's interrupt posting/atomic swap. Therefore, a copy of PIR is used
> + * to dispatch interrupt handlers.
> + *
> + * In addition, the code is trying to keep the cache line state consistent
> + * as much as possible. e.g. when making a copy and clearing the PIR
> + * (assuming non-zero PIR bits are present in the entire PIR), it does:
> + * read, read, read, read, xchg, xchg, xchg, xchg
> + * instead of:
> + * read, xchg, read, xchg, read, xchg, read, xchg
> + */
> +static __always_inline inline bool handle_pending_pir(u64 *pir, struct pt_regs *regs)
> +{
> + int i, vec = FIRST_EXTERNAL_VECTOR;
> + unsigned long pir_copy[4];
> + bool handled = false;
> +
> + for (i = 0; i < 4; i++)
> + pir_copy[i] = pir[i];
> +
> + for (i = 0; i < 4; i++) {
> + if (!pir_copy[i])
> + continue;
> +
> + pir_copy[i] = arch_xchg(pir, 0);
Here is a problem that pir_copy[i] will always be written as pir[0].
This leads to handle spurious posted MSIs later.
> + handled = true;
> + }
> +
> + if (handled) {
> + for_each_set_bit_from(vec, pir_copy, FIRST_SYSTEM_VECTOR)
> + call_irq_handler(vec, regs);
> + }
> +
> + return handled;
> +}
> +
> +/*
> + * Performance data shows that 3 is good enough to harvest 90+% of the benefit
> + * on high IRQ rate workload.
> + */
> +#define MAX_POSTED_MSI_COALESCING_LOOP 3
> +
> +/*
> + * For MSIs that are delivered as posted interrupts, the CPU notifications
> + * can be coalesced if the MSIs arrive in high frequency bursts.
> + */
> +DEFINE_IDTENTRY_SYSVEC(sysvec_posted_msi_notification)
> +{
> + struct pt_regs *old_regs = set_irq_regs(regs);
> + struct pi_desc *pid;
> + int i = 0;
> +
> + pid = this_cpu_ptr(&posted_interrupt_desc);
> +
> + inc_irq_stat(posted_msi_notification_count);
> + irq_enter();
> +
> + /*
> + * Max coalescing count includes the extra round of handle_pending_pir
> + * after clearing the outstanding notification bit. Hence, at most
> + * MAX_POSTED_MSI_COALESCING_LOOP - 1 loops are executed here.
> + */
> + while (++i < MAX_POSTED_MSI_COALESCING_LOOP) {
> + if (!handle_pending_pir(pid->pir64, regs))
> + break;
> + }
> +
> + /*
> + * Clear outstanding notification bit to allow new IRQ notifications,
> + * do this last to maximize the window of interrupt coalescing.
> + */
> + pi_clear_on(pid);
> +
> + /*
> + * There could be a race of PI notification and the clearing of ON bit,
> + * process PIR bits one last time such that handling the new interrupts
> + * are not delayed until the next IRQ.
> + */
> + handle_pending_pir(pid->pir64, regs);
> +
> + apic_eoi();
> + irq_exit();
> + set_irq_regs(old_regs);
> }
> #endif /* X86_POSTED_MSI */
>
Powered by blists - more mailing lists