[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <400cf9ab-de3f-4e8a-ab0a-4ac68c534bb8@intel.com>
Date: Wed, 27 Aug 2025 16:31:47 -0700
From: Dave Hansen <dave.hansen@...el.com>
To: Baolu Lu <baolu.lu@...ux.intel.com>, "Tian, Kevin"
<kevin.tian@...el.com>, Jason Gunthorpe <jgg@...dia.com>
Cc: Joerg Roedel <joro@...tes.org>, Will Deacon <will@...nel.org>,
Robin Murphy <robin.murphy@....com>, Jann Horn <jannh@...gle.com>,
Vasant Hegde <vasant.hegde@....com>, Alistair Popple <apopple@...dia.com>,
Peter Zijlstra <peterz@...radead.org>, Uladzislau Rezki <urezki@...il.com>,
Jean-Philippe Brucker <jean-philippe@...aro.org>,
Andy Lutomirski <luto@...nel.org>, "Lai, Yi1" <yi1.lai@...el.com>,
"iommu@...ts.linux.dev" <iommu@...ts.linux.dev>,
"security@...nel.org" <security@...nel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"stable@...r.kernel.org" <stable@...r.kernel.org>, vishal.moola@...il.com,
Matthew Wilcox <willy@...radead.org>
Subject: Re: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB
flush
On 8/27/25 03:58, Baolu Lu wrote:
> Following the insights above, I wrote the code as follows. Does it look
> good?
I'd really prefer an actual diff that compiles. Because:
> #ifdef CONFIG_ASYNC_PGTABLE_FREE
> /* a 'struct ptdesc' that needs its destructor run */
> #define ASYNC_PGTABLE_FREE_DTOR BIT(NR_PAGEFLAGS)
This doesn't work, does it? Don't you need to _allocate_ a new bit?
Wouldn't this die a hilariously horrible death if NR_PAGEFLAGS==64? ;)
Also, I'm pretty sure you can't just go setting random bits in this
because it aliases with page flags.
...
> static void kernel_pgtable_work_func(struct work_struct *work)
> {
> struct ptdesc *ptdesc, *next;
> LIST_HEAD(page_list);
>
> spin_lock(&kernel_pgtable_work.lock);
> list_splice_tail_init(&kernel_pgtable_work.list, &page_list);
> spin_unlock(&kernel_pgtable_work.lock);
>
> iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
>
> list_for_each_entry_safe(ptdesc, next, &page_list, pt_list) {
> list_del(&ptdesc->pt_list);
> if (ptdesc->__page_flags & ASYNC_PGTABLE_FREE_DTOR)
> pagetable_dtor_free(ptdesc);
> else
> __free_page(ptdesc_page(ptdesc));
> }
> }
What I had in mind was that kernel_pgtable_work_func() only does:
__free_pages(page, compound_order(page));
All of the things that queue gunk to the list do the legwork of making
the work_func() simple. This also guides them toward proving a single,
compound page because _they_ have to do the work if they don't want
something simple.
> void kernel_pgtable_async_free_dtor(struct ptdesc *ptdesc)
> {
> spin_lock(&kernel_pgtable_work.lock);
> ptdesc->__page_flags |= ASYNC_PGTABLE_FREE_DTOR;
> list_add(&ptdesc->pt_list, &kernel_pgtable_work.list);
> spin_unlock(&kernel_pgtable_work.lock);
>
> schedule_work(&kernel_pgtable_work.work);
> }
>
> void kernel_pgtable_async_free_page_list(struct list_head *list)
> {
> spin_lock(&kernel_pgtable_work.lock);
> list_splice_tail(list, &kernel_pgtable_work.list);
> spin_unlock(&kernel_pgtable_work.lock);
>
> schedule_work(&kernel_pgtable_work.work);
> }
>
> void kernel_pgtable_async_free_page(struct page *page)
> {
> spin_lock(&kernel_pgtable_work.lock);
> list_add(&page_ptdesc(page)->pt_list, &kernel_pgtable_work.list);
> spin_unlock(&kernel_pgtable_work.lock);
>
> schedule_work(&kernel_pgtable_work.work);
> }
I wouldn't have three of these, I'd have _one_. If you want to free a
bunch of pages, then just call it a bunch of times. This is not
performance critical.
Oh, and the ptdesc flag shouldn't be for "I need to be asynchronously
freed". All kernel PTE pages should ideally set it at allocation so you
can do this:
static inline void pagetable_free(struct ptdesc *pt)
{
struct page *page = ptdesc_page(pt);
if (ptdesc->some_field | PTDESC_KERNEL)
kernel_pgtable_async_free_page(page);
else
__free_pages(page, compound_order(page));
}
The folks that, today, call pagetable_dtor_free() don't have to do
_anything_ at free time. They just set the PTDESC_KERNEL bit at
allocation time.
See how that code reads? "If you have a kernel page table page, you must
asynchronously free it." That's actually meaningful.
I'm pretty sure the lower 24 bits of ptdesc->__page_type are free. So
I'd just do this:
#define PTDESC_TYPE_KERNEL BIT(0)
Then, something needs to set:
ptdesc->__page_type |= PTDESC_TYPE_KERNEL;
That could happen as late as here:
static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
{
struct ptdesc *ptdesc = virt_to_ptdesc(pte);
// here
pagetable_dtor_free(ptdesc);
}
Or as early as ptdesc allocation/constructor time. Then you check for
PTDESC_TYPE_KERNEL in pagetable_free().
Powered by blists - more mailing lists