linux-kernel - Re: [RESEND PATCH] irqchip/gic-v4.1: Use the ITS of the NUMA node where current cpu is located

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <84b65faf-f405-2a30-2b75-3f025a28a96e@huawei.com>
Date: Mon, 1 Jul 2024 11:24:45 +0800
From: Tangnianyao <tangnianyao@...wei.com>
To: Marc Zyngier <maz@...nel.org>
CC: <tglx@...utronix.de>, <linux-arm-kernel@...ts.infradead.org>,
	<linux-kernel@...r.kernel.org>, <guoyang2@...wei.com>,
	<wangwudi@...ilicon.com>
Subject: Re: [RESEND PATCH] irqchip/gic-v4.1: Use the ITS of the NUMA node
 where current cpu is located



On 6/26/2024 16:41, Marc Zyngier wrote:
> On Wed, 26 Jun 2024 03:22:52 +0100,
> Tangnianyao <tangnianyao@...wei.com> wrote:
>>
>>
>> On 6/25/2024 15:53, Marc Zyngier wrote:
>>> On Tue, 25 Jun 2024 02:40:19 +0100,
>>> Nianyao Tang <tangnianyao@...wei.com> wrote:
>>>> When GICv4.1 enabled, guest sending IPI use the last ITS reported.
>>>> On multi-NUMA environment with more than one ITS, it makes IPI performance
>>>> various from VM to VM, depending on which NUMA the VM is deployed on.
>>>> We can use closer ITS instead of the last ITS reported.
>>> Closer to *what*? the SGI sender? or the receiver? Something else?
>> VSGI sender.
>> VSGI sender use original find_4_1_its to inject vsgi, it always find the last reported
>> 4_1 ITS, regardless of which NUMA the VSGI sender cpu is located on.
> So your concern is about cross-node MMIO accesses?  You should capture
> this in your commit message.
>
>>>> Modify find_4_1_its to find the ITS of the NUMA node where current
>>>> cpu is located and save it with per cpu variable.
>>> But find_4_1_its() isn't only used for SGIs. Is it valid to do this
>>> trick for all use cases?
>> To consider this case, I've implemented original find_4_1_its function, finding a
>> 4_1 ITS in system and return, even NUMA is not match. Would it  be enough to be
>> compatitable with other code ?
> That's not what I'm asking.
>
> The same helper is also used when invalidating a doorbell, configuring
> VSGIs, tearing down of a VPE. Do you see similar issues with these
> functionalities not a local ITS? Or is your issue specific to VSGI
> generation?

My issue specific to vsgi generation. Running ipi performance benchmark is quite sensitive
to this. Haven't seen other performance drop in any benchmark because of this.

>
>> A new find_4_1_its can firstly select 4_1 ITS on the same NUMA as the current
>> cpu(VSGI sender), and if fail to find, then return 4_1 ITS on other NUMA.
>>
>>>> (There's format issues with the previous patch, resend it)
>>> In the future, please move this sort of comment to a note after the
>>> --- delimiter.
>> ok, get it.
>>
>>>> Signed-off-by: Nianyao Tang <tangnianyao@...wei.com>
>>>> ---
>>>>  drivers/irqchip/irq-gic-v3-its.c | 27 ++++++++++++++++++---------
>>>>  1 file changed, 18 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
>>>> index 3c755d5dad6e..d35b42f3b2af 100644
>>>> --- a/drivers/irqchip/irq-gic-v3-its.c
>>>> +++ b/drivers/irqchip/irq-gic-v3-its.c
>>>> @@ -193,6 +193,8 @@ static DEFINE_RAW_SPINLOCK(vmovp_lock);
>>>>  
>>>>  static DEFINE_IDA(its_vpeid_ida);
>>>>  
>>>> +static DEFINE_PER_CPU(struct its_node *, its_on_cpu);
>>> I don't really get the "its_on_cpu" name. "local_its" would at least
>>> indicate a notion being "close".
>> I want to mean ITS on the current cpu NUMA node.
>> Yes, "local_its" is better.
>>
>>>> +
>>>>  #define gic_data_rdist()		(raw_cpu_ptr(gic_rdists->rdist))
>>>>  #define gic_data_rdist_cpu(cpu)		(per_cpu_ptr(gic_rdists->rdist, cpu))
>>>>  #define gic_data_rdist_rd_base()	(gic_data_rdist()->rd_base)
>>>> @@ -4058,19 +4060,25 @@ static struct irq_chip its_vpe_irq_chip = {
>>>>  
>>>>  static struct its_node *find_4_1_its(void)
>>>>  {
>>>> -	static struct its_node *its = NULL;
>>>> +	struct its_node *its = NULL;
>>>> +	struct its_node *its_non_cpu_node = NULL;
>>>> +	int cpu = smp_processor_id();
>>>>  
>>>> -	if (!its) {
>>>> -		list_for_each_entry(its, &its_nodes, entry) {
>>>> -			if (is_v4_1(its))
>>>> -				return its;
>>>> -		}
>>>> +	if (per_cpu(its_on_cpu, cpu))
>>>> +		return per_cpu(its_on_cpu, cpu);
>>>>  
>>>> -		/* Oops? */
>>>> -		its = NULL;
>>>> -	}
>>>> +	list_for_each_entry(its, &its_nodes, entry) {
>>>> +		if (is_v4_1(its) && its->numa_node == cpu_to_node(cpu)) {
>>>> +			per_cpu(its_on_cpu, cpu) = its;
>>>> +			return its;
>>>> +		} else if (is_v4_1(its))
>>>> +			its_non_cpu_node = its;
>>>> +	}
>>> Why do you consider the NUMA node instead of the ITS' own affinity?
>>> SVPET gives you some notion of distance with the RDs, and that'd
>>> probably be useful.
>> I assumed BIOS should report NUMA node following real topology, use NUMA node
>> for simplicity.
> NUMA is about CPU memory access, and doesn't quite describe how the
> ITS fits there. I know people have abused this when the GIC didn't
> have this notion, but we're over that now. The ITS has the most
> precise information (ITS->RD->CPU), and we also need to make this work
> for systems that do not use ACPI.
>
> This is pretty easy to do as we inherit the VPE table from the ITS on
> secondary CPUs, and that's the right spot to populate your table.

Yes, it's better to init this at host kernel start.
MPIDR is more precise for physical topology.

>
> I'd expect something like this to do the trick:
>
> diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
> index 40ebf1726393..bfcbc5a46c1c 100644
> --- a/drivers/irqchip/irq-gic-v3-its.c
> +++ b/drivers/irqchip/irq-gic-v3-its.c
> @@ -119,6 +119,8 @@ struct its_node {
>  	int			vlpi_redist_offset;
>  };
>  
> +static DEFINE_PER_CPU(struct its_node *, local_4_1_its);
> +
>  #define is_v4(its)		(!!((its)->typer & GITS_TYPER_VLPIS))
>  #define is_v4_1(its)		(!!((its)->typer & GITS_TYPER_VMAPP))
>  #define device_ids(its)		(FIELD_GET(GITS_TYPER_DEVBITS, (its)->typer) + 1)
> @@ -2729,6 +2731,8 @@ static u64 inherit_vpe_l1_table_from_its(void)
>  		}
>  		val |= FIELD_PREP(GICR_VPROPBASER_4_1_SIZE, GITS_BASER_NR_PAGES(baser) - 1);
>  
> +		*this_cpu_ptr(&local_4_1_its) = its;
> +
>  		return val;
>  	}
>  
> @@ -2766,6 +2770,8 @@ static u64 inherit_vpe_l1_table_from_rd(cpumask_t **mask)
>  		gic_data_rdist()->vpe_l1_base = gic_data_rdist_cpu(cpu)->vpe_l1_base;
>  		*mask = gic_data_rdist_cpu(cpu)->vpe_table_mask;
>  
> +		*this_cpu_ptr(&local_4_1_its) = *per_cpu_ptr(&local_4_1_its, cpu);
> +
>  		return val;
>  	}
>  
> @@ -4078,8 +4084,9 @@ static struct irq_chip its_vpe_irq_chip = {
>  
>  static struct its_node *find_4_1_its(void)
>  {
> -	static struct its_node *its = NULL;
> +	struct its_node *its;
>  
> +	its = *this_cpu_ptr(&local_4_1_its);
>  	if (!its) {
>  		list_for_each_entry(its, &its_nodes, entry) {
>  			if (is_v4_1(its))
>
>
> *if* there is no matching affinity between ITS and RDs, you can then
> fallback to NUMA nodes. But using the architected mechanism should be
> the first port of call.

I have tested this and it works.

>
>>>>  
>>>> -	return its;
>>>> +	if (!per_cpu(its_on_cpu, cpu) && its_non_cpu_node)
>>>> +		per_cpu(its_on_cpu, cpu) = its_non_cpu_node;
>>>> +
>>>> +	return its_non_cpu_node;
>>>>  }
>>> Urgh. Mixing init and runtime is awful. Why isn't this initialised
>>> when a CPU comes up? We already have all the infrastructure.
>> The original find_4_1_its use "static struct its_node *its" to save 4_1 ITS, and
>> it's init inside this function. So, to follow this, I tried to not modify this usage.
> Sure, but this is already bad, and you're making it worse. Please take
> this as an opportunity to make things better.
>
>>> But the biggest question is "what sort of performance improvement does
>>> this bring"? You give no numbers, no way to evaluate anything.
>>>
>>> I've asked for that times and times again: if your changes are
>>> claiming a performance improvement, please back it up. It's not that
>>> hard. 
>> On a 2-socket environment, reported as 2-NUMA, each socket with one ITS
>> and 32 cpu, GICv4.1 enabled.
>> For performance, I deploy a 4U8G guest, 4 vcpu on same socket.
>> When I deploy guest on socket0, kvm-unit-tests ipi_hw result is 850ns. It
>> test the delay from one vcpu sending ipi to another vcpu receiving ipi in guest.
>> When I deploy guest on socket1, the result is 750ns.
>> The reason is VSGI sender always use lasted reported ITS to inject VSGI.
>> The access from cpu to other-socket ITS will cost 100ns more compared to cpu
>> to local ITS.
> Right, this is exactly the sort of information I want to capture in a
> commit message. Because a 12% reduction in IPI latency is *huge*, and
> justifies the change. Please add it when you respin the patch.

ok, I will add this performance data to commit message and send respin patch.

>
> Thanks,
>
> 	M.
>