linux-kernel - Re: [RFC PATCH v4 5/8] x86/mm: Introduce Remote Action Request

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <49eee1cb79b75b02b8ed19a7f6d39e1ee8fae171.camel@surriel.com>
Date: Thu, 19 Jun 2025 21:10:47 -0400
From: Rik van Riel <riel@...riel.com>
To: Nadav Amit <nadav.amit@...il.com>, linux-kernel@...r.kernel.org
Cc: kernel-team@...a.com, dave.hansen@...ux.intel.com, luto@...nel.org, 
	peterz@...radead.org, bp@...en8.de, x86@...nel.org, seanjc@...gle.com, 
	tglx@...utronix.de, mingo@...nel.org, Yu-cheng Yu <yu-cheng.yu@...el.com>
Subject: Re: [RFC PATCH v4 5/8] x86/mm: Introduce Remote Action Request

On Fri, 2025-06-20 at 02:01 +0300, Nadav Amit wrote:
> 
> > +/*
> > + * Reduce contention for the RAR payloads by having a small number
> > of
> > + * CPUs share a RAR payload entry, instead of a free for all with
> > all CPUs.
> > + */
> > +struct rar_lock {
> > +	union {
> > +		raw_spinlock_t lock;
> > +		char __padding[SMP_CACHE_BYTES];
> > +	};
> > +};
> 
> I think you can lose the __padding and instead have 
> ____cacheline_aligned (and then you won't need union).
> 
I tried that initially, but the compiler was unhappy
to have __cacheline_aligned in the definition of a
struct.

> > +
> > +static struct rar_lock rar_locks[RAR_MAX_PAYLOADS]
> > __cacheline_aligned;
> > +
> > +/*
> > + * The action vector tells each CPU which payload table entries
> > + * have work for that CPU.
> > + */
> > +static DEFINE_PER_CPU_ALIGNED(u8[RAR_MAX_PAYLOADS], rar_action);
> > +
> > +/*
> > + * TODO: group CPUs together based on locality in the system
> > instead
> > + * of CPU number, to further reduce the cost of contention.
> > + */
> > +static int cpu_rar_payload_number(void)
> > +{
> > +	int cpu = raw_smp_processor_id();
> 
> Why raw_* ?

I'll change that to regular smp_processor_id()
for the next version.

> 
> > +	return cpu % rar_max_payloads;
> > +}
> > +
> > +static int get_payload_slot(void)
> > +{
> > +	int payload_nr = cpu_rar_payload_number();
> > +	raw_spin_lock(&rar_locks[payload_nr].lock);
> > +	return payload_nr;
> > +}
> 
> I think it would be better to open-code it to improve readability. If
> you choose not to, I think you should use sparse indications (e.g., 
> __acquires() ).

Good point about the annotations. This can indeed
be open coded, since any future improvements here,
for example to have cpu_rar_payload_number() take
topology into account to reduce the cost of contention,
will be in that helper function.

> 
> > +
> > +static void free_payload_slot(unsigned long payload_nr)
> > +{
> > +	raw_spin_unlock(&rar_locks[payload_nr].lock);
> > +}
> > +
> > +static void set_payload(struct rar_payload *p, u16 pcid, unsigned
> > long start,
> > +			long pages)
> > +{
> > +	p->must_be_zero_1	= 0;
> > +	p->must_be_zero_2	= 0;
> > +	p->must_be_zero_3	= 0;
> > +	p->page_size		= RAR_INVLPG_PAGE_SIZE_4K;
> 
> I think you can propagate the stride to this point instead of using 
> fixed 4KB stride.

Agreed. That's another future optimization, for
once this code all works right.

Currently I am in a situation where the receiving
CPU clears the action vector from RAR_PENDING to
RAR_SUCCESS, but the TLB does not appear to always
be correctly flushed.

> 
> > +	p->type			= RAR_TYPE_INVPCID;
> > +	p->pcid			= pcid;
> > +	p->linear_address	= start;
> > +
> > +	if (pcid) {
> > +		/* RAR invalidation of the mapping of a specific
> > process. */
> > +		if (pages < RAR_INVLPG_MAX_PAGES) {
> > +			p->num_pages = pages;
> > +			p->subtype = RAR_INVPCID_ADDR;
> > +		} else {
> > +			p->subtype = RAR_INVPCID_PCID;
> 
> I wonder whether it would be safer to set something to p->num_pages.
> (then we can do it unconditionally)

We have a limited number of bits available for
p->num_pages. I'm not sure we want to try
writing a larger number than what fits in those
bits.

> 
> > +		}
> > +	} else {
> > +		/*
> > +		 * Unfortunately RAR_INVPCID_ADDR excludes global
> > translations.
> > +		 * Always do a full flush for kernel
> > invalidations.
> > +		 */
> > +		p->subtype = RAR_INVPCID_ALL;
> > +	}
> > +
> > +	/* Ensure all writes are visible before the action entry
> > is set. */
> > +	smp_wmb();
> 
> Maybe you can drop the smp_wmb() here and instead change the 
> WRITE_ONCE() in set_action_entry() to smp_store_release() ? It should
> have the same effect and I think would be cleaner and convey your
> intent 
> better.
> 
We need protection against two different things here.

1) Receiving CPUs must see all the writes done by
   the originating CPU before we send the RAR IPI.

2) Receiving CPUs must see all the writes done by
   set_payload() before the write done by
   set_action_entry(), in case another CPU sends
   the RAR IPI before we do.

   That other RAR IPI could even be sent between
   when we write the payload, and when we write
   the action entry. The receiving CPU could take
   long enough processing other RAR payloads that
   it can see our action entry after we write it.

Does removing the smp_wmb() still leave everything
safe in that scenario?


> > +}
> > +
> > +static void set_action_entry(unsigned long payload_nr, int
> > target_cpu)
> > +{
> > +	u8 *bitmap = per_cpu(rar_action, target_cpu);
> 
> bitmap? It doesn't look like one...

I'll rename this one to rar_actions like I did in
wait_for_action_done()

> 
> > +static void wait_for_action_done(unsigned long payload_nr, int
> > target_cpu)
> > +{
> > +	u8 status;
> > +	u8 *rar_actions = per_cpu(rar_action, target_cpu);
> > +
> > +	status = READ_ONCE(rar_actions[payload_nr]);
> > +
> > +	while (status == RAR_PENDING) {
> > +		cpu_relax();
> > +		status = READ_ONCE(rar_actions[payload_nr]);
> > +	}
> > +
> > +	WARN_ON_ONCE(rar_actions[payload_nr] != RAR_SUCCESS);
> 
> WARN_ON_ONCE(status != RAR_SUCCESS)

I'll add that cleanup for v5, too.

> 
> > +}
> > +
> > +void rar_cpu_init(void)
> > +{
> > +	u8 *bitmap;
> > +	u64 r;
> > +
> > +	/* Check if this CPU was already initialized. */
> > +	rdmsrl(MSR_IA32_RAR_PAYLOAD_BASE, r);
> > +	if (r == (u64)virt_to_phys(rar_payload))
> > +		return;
> 
> Seems a bit risky test. If anything, I would check that the MSR that
> is 
> supposed to be set *last* (MSR_IA32_RAR_CTRL) have the expected
> value. 
> But it would still be best to initialize the MSRs unconditionally or
> to 
> avoid repeated initialization using a different scheme.
> 
Whatever different scheme we use must be able to deal
with CPU hotplug and suspend/resume. There are legitimate
cases where rar_cpu_init() is called, and the in-MSR
state does not match the in-memory state.

You are right that we could always unconditionally
write the MSRs.

However, I am not entirely convinced that overwriting
the per-CPU rar_action array with all zeroes (RAR_SUCCESS)
is always safe to do without some sort of guard.

I suppose it might be, since if we are in rar_cpu_init()
current->mm should be init_mm, the CPU bit in cpu_online_mask
is not set, and we don't have to worry about flushing memory 
all that much?

> > 
> > +	/* If this CPU supports less than RAR_MAX_PAYLOADS, lower
> > our limit. */
> > +	if (max_payloads < rar_max_payloads)
> > +		rar_max_payloads = max_payloads;
> > +	pr_info("RAR: support %d payloads\n", max_payloads);
> > +
> > +	for (r = 0; r < rar_max_payloads; r++)
> > +		rar_locks[r].lock =
> > __RAW_SPIN_LOCK_UNLOCKED(rar_lock);
> 
> Not a fan of the reuse of r for different purposes.

Fair enough, I'll add another variable name.

> 
> > +/*
> > + * Inspired by smp_call_function_many(), but RAR requires a global
> > payload
> > + * table rather than per-CPU payloads in the CSD table, because
> > the action
> > + * handler is microcode rather than software.
> > + */
> > +void smp_call_rar_many(const struct cpumask *mask, u16 pcid,
> > +		       unsigned long start, unsigned long end)
> > +{
> > +	unsigned long pages = (end - start + PAGE_SIZE) /
> > PAGE_SIZE;
> 
> I think "end" is not inclusive. See for instance flush_tlb_page()
> where 
> "end" is set to "a + PAGE_SIZE". So this would flush one extra page.
> 
It gets better. Once we add in a "stride" argument, we
may end up with a range that covers only the first
4kB of one of the 2MB entries the calling code wanted
to invalidate. I fell into that trap already with the
INVLPGB code :)

I'll look into simplifying this for the next version,
probably with only 4k PAGE_SIZE at first. We can think
about stride later.

> > 
> > +	 * This code cannot use the should_flush_tlb() logic here
> > because
> > +	 * RAR flushes do not update the tlb_gen, resulting in
> > unnecessary
> > +	 * flushes at context switch time.
> > +	 */
> > +	dest_mask = this_cpu_ptr(&rar_cpu_mask);
> > +	cpumask_and(dest_mask, mask, cpu_online_mask);
> > +
> > +	/* Some callers race with other CPUs changing the passed
> > mask */
> > +	if (unlikely(!cpumask_weight(dest_mask)))
> 
> cpumask_and() returns "false if *@...p is empty, else returns true".
> So 
> you can use his value instead of calling cpumask_weight().

Will do, thank you.



-- 
All Rights Reversed.