linux-kernel - Re: [PATCH] x86/kfence: Avoid writing L1TF-vulnerable PTEs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f24422b7-0985-4583-9d0c-7e8f303197b5@citrix.com>
Date: Wed, 7 Jan 2026 12:01:58 +0000
From: Andrew Cooper <andrew.cooper3@...rix.com>
To: Alexander Potapenko <glider@...gle.com>
Cc: Andrew Cooper <andrew.cooper3@...rix.com>,
 LKML <linux-kernel@...r.kernel.org>, Marco Elver <elver@...gle.com>,
 Dmitry Vyukov <dvyukov@...gle.com>, Thomas Gleixner <tglx@...utronix.de>,
 Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
 Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org,
 "H. Peter Anvin" <hpa@...or.com>, Andrew Morton <akpm@...ux-foundation.org>,
 Jann Horn <jannh@...gle.com>, kasan-dev@...glegroups.com
Subject: Re: [PATCH] x86/kfence: Avoid writing L1TF-vulnerable PTEs

On 07/01/2026 11:31 am, Alexander Potapenko wrote:
> On Tue, Jan 6, 2026 at 7:04 PM Andrew Cooper <andrew.cooper3@...rix.com> wrote:
>> For native, the choice of PTE is fine.  There's real memory backing the
>> non-present PTE.  However, for XenPV, Xen complains:
>>
>>   (XEN) d1 L1TF-vulnerable L1e 8010000018200066 - Shadowing
>>
>> To explain, some background on XenPV pagetables:
>>
>>   Xen PV guests are control their own pagetables; they choose the new PTE
>>   value, and use hypercalls to make changes so Xen can audit for safety.
>>
>>   In addition to a regular reference count, Xen also maintains a type
>>   reference count.  e.g. SegDesc (referenced by vGDT/vLDT),
>>   Writable (referenced with _PAGE_RW) or L{1..4} (referenced by vCR3 or a
>>   lower pagetable level).  This is in order to prevent e.g. a page being
>>   inserted into the pagetables for which the guest has a writable mapping.
>>
>>   For non-present mappings, all other bits become software accessible, and
>>   typically contain metadata rather a real frame address.  There is nothing
>>   that a reference count could sensibly be tied to.  As such, even if Xen
>>   could recognise the address as currently safe, nothing would prevent that
>>   frame from changing owner to another VM in the future.
>>
>>   When Xen detects a PV guest writing a L1TF-PTE, it responds by activating
>>   shadow paging. This is normally only used for the live phase of
>>   migration, and comes with a reasonable overhead.
>>
>> KFENCE only cares about getting #PF to catch wild accesses; it doesn't care
>> about the value for non-present mappings.  Use a fully inverted PTE, to
>> avoid hitting the slow path when running under Xen.
>>
>> While adjusting the logic, take the opportunity to skip all actions if the
>> PTE is already in the right state, half the number PVOps callouts, and skip
>> TLB maintenance on a !P -> P transition which benefits non-Xen cases too.
>>
>> Fixes: 1dc0da6e9ec0 ("x86, kfence: enable KFENCE for x86")
>> Tested-by: Marco Elver <elver@...gle.com>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@...rix.com>
> Reviewed-by: Alexander Potapenko <glider@...gle.com>

Thanks.

>
>>         /*
>>          * We need to avoid IPIs, as we may get KFENCE allocations or faults
>>          * with interrupts disabled. Therefore, the below is best-effort, and
>> @@ -53,11 +77,6 @@ static inline bool kfence_protect_page(unsigned long addr, bool protect)
>>          * lazy fault handling takes care of faults after the page is PRESENT.
>>          */
> Nit: should this comment be moved above before set_pte() or merged wit
> the following comment block?

Hmm, probably merged as they're both about the TLB maintenance.  But the
end result is a far more messy diff:

@@ -42,23 +42,40 @@ static inline bool kfence_protect_page(unsigned long addr, bool protect)
 {
        unsigned int level;
        pte_t *pte = lookup_address(addr, &level);
+       pteval_t val;
 
        if (WARN_ON(!pte || level != PG_LEVEL_4K))
                return false;
 
+       val = pte_val(*pte);
+
        /*
-        * We need to avoid IPIs, as we may get KFENCE allocations or faults
-        * with interrupts disabled. Therefore, the below is best-effort, and
-        * does not flush TLBs on all CPUs. We can tolerate some inaccuracy;
-        * lazy fault handling takes care of faults after the page is PRESENT.
+        * protect requires making the page not-present.  If the PTE is
+        * already in the right state, there's nothing to do.
+        */
+       if (protect != !!(val & _PAGE_PRESENT))
+               return true;
+
+       /*
+        * Otherwise, invert the entire PTE.  This avoids writing out an
+        * L1TF-vulnerable PTE (not present, without the high address bits
+        * set).
         */
+       set_pte(pte, __pte(~val));
 
-       if (protect)
-               set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_PRESENT));
-       else
-               set_pte(pte, __pte(pte_val(*pte) | _PAGE_PRESENT));
+       /*
+        * If the page was protected (non-present) and we're making it
+        * present, there is no need to flush the TLB at all.
+        */
+       if (!protect)
+               return true;
 
        /*
+        * We need to avoid IPIs, as we may get KFENCE allocations or faults
+        * with interrupts disabled. Therefore, the below is best-effort, and
+        * does not flush TLBs on all CPUs. We can tolerate some inaccuracy;
+        * lazy fault handling takes care of faults after the page is PRESENT.
+        *
         * Flush this CPU's TLB, assuming whoever did the allocation/free is
         * likely to continue running on this CPU.
         */



I need to resubmit anyway, because I've spotted one silly error in the
commit message.

I could submit two patches, with the second one stated as "to make the
previous patch legible".

Thoughts?

~Andrew