[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <f230df34-2959-47f9-9131-ac233d85027e@gmail.com>
Date: Sat, 21 Jun 2025 00:24:37 +0300
From: Nadav Amit <nadav.amit@...il.com>
To: Rik van Riel <riel@...riel.com>, linux-kernel@...r.kernel.org
Cc: kernel-team@...a.com, dave.hansen@...ux.intel.com, luto@...nel.org,
peterz@...radead.org, bp@...en8.de, x86@...nel.org, seanjc@...gle.com,
tglx@...utronix.de, mingo@...nel.org, Yu-cheng Yu <yu-cheng.yu@...el.com>
Subject: Re: [RFC PATCH v4 5/8] x86/mm: Introduce Remote Action Request
On 20/06/2025 4:10, Rik van Riel wrote:
> On Fri, 2025-06-20 at 02:01 +0300, Nadav Amit wrote:
>>> +
>>> +static void set_payload(struct rar_payload *p, u16 pcid, unsigned
>>> long start,
>>> + long pages)
>>> +{
>>> + p->must_be_zero_1 = 0;
>>> + p->must_be_zero_2 = 0;
>>> + p->must_be_zero_3 = 0;
>>> + p->page_size = RAR_INVLPG_PAGE_SIZE_4K;
>>
>> I think you can propagate the stride to this point instead of using
>> fixed 4KB stride.
>
> Agreed. That's another future optimization, for
> once this code all works right.
It might not be an optimization if it leads to performance regression
relatively to the current software-based flush that does take the stride
into account. IOW: flushing 2MB huge-page would end up doing a full TLB
flush with RAR in contrast to selectively flushing the specific entry
with the software-based shootdown. Correct me if I'm wrong.
>
> Currently I am in a situation where the receiving
> CPU clears the action vector from RAR_PENDING to
> RAR_SUCCESS, but the TLB does not appear to always
> be correctly flushed.
Interesting. I really do not know anything about it, but you may want to
look on performance counters to see whether the flush is at least
reported to take place.
>
>>
>>> + p->type = RAR_TYPE_INVPCID;
>>> + p->pcid = pcid;
>>> + p->linear_address = start;
>>> +
>>> + if (pcid) {
>>> + /* RAR invalidation of the mapping of a specific
>>> process. */
>>> + if (pages < RAR_INVLPG_MAX_PAGES) {
>>> + p->num_pages = pages;
>>> + p->subtype = RAR_INVPCID_ADDR;
>>> + } else {
>>> + p->subtype = RAR_INVPCID_PCID;
>>
>> I wonder whether it would be safer to set something to p->num_pages.
>> (then we can do it unconditionally)
>
> We have a limited number of bits available for
> p->num_pages. I'm not sure we want to try
> writing a larger number than what fits in those
> bits.
I was just looking for a way to simplify the code and at the same time
avoid "undefined" state. History proved the all kind of unintended
consequences happen when some state is uninitialized (even if the spec
does not require it).
>
>>
>>> + }
>>> + } else {
>>> + /*
>>> + * Unfortunately RAR_INVPCID_ADDR excludes global
>>> translations.
>>> + * Always do a full flush for kernel
>>> invalidations.
>>> + */
>>> + p->subtype = RAR_INVPCID_ALL;
>>> + }
>>> +
>>> + /* Ensure all writes are visible before the action entry
>>> is set. */
>>> + smp_wmb();
>>
>> Maybe you can drop the smp_wmb() here and instead change the
>> WRITE_ONCE() in set_action_entry() to smp_store_release() ? It should
>> have the same effect and I think would be cleaner and convey your
>> intent
>> better.
>>
> We need protection against two different things here.
>
> 1) Receiving CPUs must see all the writes done by
> the originating CPU before we send the RAR IPI.
>
> 2) Receiving CPUs must see all the writes done by
> set_payload() before the write done by
> set_action_entry(), in case another CPU sends
> the RAR IPI before we do.
>
> That other RAR IPI could even be sent between
> when we write the payload, and when we write
> the action entry. The receiving CPU could take
> long enough processing other RAR payloads that
> it can see our action entry after we write it.
>
> Does removing the smp_wmb() still leave everything
> safe in that scenario?
Admittedly, I do not understand your concern. IIUC until
set_action_entry() is called, and until RAR_PENDING is set, IPIs are
irrelevant and the RAR entry would not be processed. Once it is set, you
want all the prior writes to be visible. Semantically, that's exactly
what smp_store_release() means.
Technically, smp_wmb() + WRITE_ONCE() are equivalent to
smp_store_release() on x86. smp_wmb() is actually a compiler barrier as
x86 follows the TSO memory model, and smp_store_release() is actually a
compiler-barrier + WRITE_ONCE(). So I think it's all safe and at the
same time clearer.
Powered by blists - more mailing lists