lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <400cf9ab-de3f-4e8a-ab0a-4ac68c534bb8@intel.com>
Date: Wed, 27 Aug 2025 16:31:47 -0700
From: Dave Hansen <dave.hansen@...el.com>
To: Baolu Lu <baolu.lu@...ux.intel.com>, "Tian, Kevin"
 <kevin.tian@...el.com>, Jason Gunthorpe <jgg@...dia.com>
Cc: Joerg Roedel <joro@...tes.org>, Will Deacon <will@...nel.org>,
 Robin Murphy <robin.murphy@....com>, Jann Horn <jannh@...gle.com>,
 Vasant Hegde <vasant.hegde@....com>, Alistair Popple <apopple@...dia.com>,
 Peter Zijlstra <peterz@...radead.org>, Uladzislau Rezki <urezki@...il.com>,
 Jean-Philippe Brucker <jean-philippe@...aro.org>,
 Andy Lutomirski <luto@...nel.org>, "Lai, Yi1" <yi1.lai@...el.com>,
 "iommu@...ts.linux.dev" <iommu@...ts.linux.dev>,
 "security@...nel.org" <security@...nel.org>,
 "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
 "stable@...r.kernel.org" <stable@...r.kernel.org>, vishal.moola@...il.com,
 Matthew Wilcox <willy@...radead.org>
Subject: Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB
 flush

On 8/27/25 03:58, Baolu Lu wrote:
> Following the insights above, I wrote the code as follows. Does it look
> good?

I'd really prefer an actual diff that compiles. Because:

> #ifdef CONFIG_ASYNC_PGTABLE_FREE
> /* a 'struct ptdesc' that needs its destructor run */
> #define ASYNC_PGTABLE_FREE_DTOR    BIT(NR_PAGEFLAGS)

This doesn't work, does it? Don't you need to _allocate_ a new bit?
Wouldn't this die a hilariously horrible death if NR_PAGEFLAGS==64? ;)

Also, I'm pretty sure you can't just go setting random bits in this
because it aliases with page flags.

...
> static void kernel_pgtable_work_func(struct work_struct *work)
> {
>     struct ptdesc *ptdesc, *next;
>     LIST_HEAD(page_list);
> 
>     spin_lock(&kernel_pgtable_work.lock);
>     list_splice_tail_init(&kernel_pgtable_work.list, &page_list);
>     spin_unlock(&kernel_pgtable_work.lock);
> 
>     iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
> 
>     list_for_each_entry_safe(ptdesc, next, &page_list, pt_list) {
>         list_del(&ptdesc->pt_list);
>         if (ptdesc->__page_flags & ASYNC_PGTABLE_FREE_DTOR)
>             pagetable_dtor_free(ptdesc);
>         else
>             __free_page(ptdesc_page(ptdesc));
>     }
> }
What I had in mind was that kernel_pgtable_work_func() only does:

	__free_pages(page, compound_order(page));

All of the things that queue gunk to the list do the legwork of making
the work_func() simple. This also guides them toward proving a single,
compound page because _they_ have to do the work if they don't want
something simple.

> void kernel_pgtable_async_free_dtor(struct ptdesc *ptdesc)
> {
>     spin_lock(&kernel_pgtable_work.lock);
>     ptdesc->__page_flags |= ASYNC_PGTABLE_FREE_DTOR;
>     list_add(&ptdesc->pt_list, &kernel_pgtable_work.list);
>     spin_unlock(&kernel_pgtable_work.lock);
> 
>     schedule_work(&kernel_pgtable_work.work);
> }
> 
> void kernel_pgtable_async_free_page_list(struct list_head *list)
> {
>     spin_lock(&kernel_pgtable_work.lock);
>     list_splice_tail(list, &kernel_pgtable_work.list);
>     spin_unlock(&kernel_pgtable_work.lock);
> 
>     schedule_work(&kernel_pgtable_work.work);
> }
> 
> void kernel_pgtable_async_free_page(struct page *page)
> {
>     spin_lock(&kernel_pgtable_work.lock);
>     list_add(&page_ptdesc(page)->pt_list, &kernel_pgtable_work.list);
>     spin_unlock(&kernel_pgtable_work.lock);
> 
>     schedule_work(&kernel_pgtable_work.work);
> }

I wouldn't have three of these, I'd have _one_. If you want to free a
bunch of pages, then just call it a bunch of times. This is not
performance critical.

Oh, and the ptdesc flag shouldn't be for "I need to be asynchronously
freed". All kernel PTE pages should ideally set it at allocation so you
can do this:

static inline void pagetable_free(struct ptdesc *pt)
{
        struct page *page = ptdesc_page(pt);

	if (ptdesc->some_field | PTDESC_KERNEL)
		kernel_pgtable_async_free_page(page);
	else
	        __free_pages(page, compound_order(page));
}

The folks that, today, call pagetable_dtor_free() don't have to do
_anything_ at free time. They just set the PTDESC_KERNEL bit at
allocation time.

See how that code reads? "If you have a kernel page table page, you must
asynchronously free it." That's actually meaningful.

I'm pretty sure the lower 24 bits of ptdesc->__page_type are free. So
I'd just do this:

#define PTDESC_TYPE_KERNEL BIT(0)

Then, something needs to set:

	ptdesc->__page_type |= PTDESC_TYPE_KERNEL;

That could happen as late as here:

static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
{
	struct ptdesc *ptdesc = virt_to_ptdesc(pte);

	// here 	

        pagetable_dtor_free(ptdesc);
}

Or as early as ptdesc allocation/constructor time. Then you check for
PTDESC_TYPE_KERNEL in pagetable_free().

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ