linux-kernel - Re: [RFC PATCH] Add support for eXclusive Page Frame Ownership (XPFO)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <56EFB2DB.3090602@hpe.com>
Date:	Mon, 21 Mar 2016 09:37:47 +0100
From:	Juerg Haefliger <juerg.haefliger@....com>
To:	Laura Abbott <labbott@...hat.com>, linux-kernel@...r.kernel.org,
	linux-mm@...ck.org
Cc:	vpk@...brown.edu, Kees Cook <keescook@...omium.org>
Subject: Re: [RFC PATCH] Add support for eXclusive Page Frame Ownership (XPFO)

Hi Laura,

Sorry for the late reply. I was on FTO and then traveling for the past couple of
days.


On 03/01/2016 02:31 AM, Laura Abbott wrote:
> On 02/26/2016 06:21 AM, Juerg Haefliger wrote:
>> This patch adds support for XPFO which protects against 'ret2dir' kernel
>> attacks. The basic idea is to enforce exclusive ownership of page frames
>> by either the kernel or userland, unless explicitly requested by the
>> kernel. Whenever a page destined for userland is allocated, it is
>> unmapped from physmap. When such a page is reclaimed from userland, it is
>> mapped back to physmap.
>>
>> Mapping/unmapping from physmap is accomplished by modifying the PTE
>> permission bits to allow/disallow access to the page.
>>
>> Additional fields are added to the page struct for XPFO housekeeping.
>> Specifically a flags field to distinguish user vs. kernel pages, a
>> reference counter to track physmap map/unmap operations and a lock to
>> protect the XPFO fields.
>>
>> Known issues/limitations:
>>    - Only supported on x86-64.
>>    - Only supports 4k pages.
>>    - Adds additional data to the page struct.
>>    - There are most likely some additional and legitimate uses cases where
>>      the kernel needs to access userspace. Those need to be identified and
>>      made XPFO-aware.
>>    - There's a performance impact if XPFO is turned on. Per the paper
>>      referenced below it's in the 1-3% ballpark. More performance testing
>>      wouldn't hurt. What tests to run though?
>>
>> Reference paper by the original patch authors:
>>    http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf
>>
> 
> General note: Make sure to cc the x86 maintainers on the next version of
> the patch. I'd also recommend ccing the kernel hardening list (see the wiki
> page http://kernsec.org/wiki/index.php/Kernel_Self_Protection_Project for
> details)

Good idea. Thanks for the suggestion.


> If you can find a way to break this up into x86 specific vs. generic patches
> that would be better. Perhaps move the Kconfig for XPFO to the generic
> Kconfig layer and make it depend on ARCH_HAS_XPFO? x86 can then select
> ARCH_HAS_XPFO as the last option.

Good idea.


> There also isn't much that's actually x86 specific here except for
> some of the page table manipulation functions and even those can probably
> be abstracted away. It would be good to get more of this out of x86 to
> let other arches take advantage of it. The arm64 implementation would
> look pretty similar if you save the old kernel mapping and restore
> it on free.

OK. I need to familiarize myself with ARM to figure out which pieces can move
out of the arch subdir.


>  
>> Suggested-by: Vasileios P. Kemerlis <vpk@...brown.edu>
>> Signed-off-by: Juerg Haefliger <juerg.haefliger@....com>
>> ---
>>   arch/x86/Kconfig         |   2 +-
>>   arch/x86/Kconfig.debug   |  17 +++++
>>   arch/x86/mm/Makefile     |   2 +
>>   arch/x86/mm/init.c       |   3 +-
>>   arch/x86/mm/xpfo.c       | 176 +++++++++++++++++++++++++++++++++++++++++++++++
>>   block/blk-map.c          |   7 +-
>>   include/linux/highmem.h  |  23 +++++--
>>   include/linux/mm_types.h |   4 ++
>>   include/linux/xpfo.h     |  88 ++++++++++++++++++++++++
>>   lib/swiotlb.c            |   3 +-
>>   mm/page_alloc.c          |   7 +-
>>   11 files changed, 323 insertions(+), 9 deletions(-)
>>   create mode 100644 arch/x86/mm/xpfo.c
>>   create mode 100644 include/linux/xpfo.h
>>
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index c46662f..9d32b4a 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -1343,7 +1343,7 @@ config ARCH_DMA_ADDR_T_64BIT
>>
>>   config X86_DIRECT_GBPAGES
>>       def_bool y
>> -    depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK
>> +    depends on X86_64 && !DEBUG_PAGEALLOC && !KMEMCHECK && !XPFO
>>       ---help---
>>         Certain kernel features effectively disable kernel
>>         linear 1 GB mappings (even if the CPU otherwise
>> diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
>> index 9b18ed9..1331da5 100644
>> --- a/arch/x86/Kconfig.debug
>> +++ b/arch/x86/Kconfig.debug
>> @@ -5,6 +5,23 @@ config TRACE_IRQFLAGS_SUPPORT
>>
>>   source "lib/Kconfig.debug"
>>
>> +config XPFO
>> +    bool "Enable eXclusive Page Frame Ownership (XPFO)"
>> +    default n
>> +    depends on DEBUG_KERNEL
>> +    depends on X86_64
>> +    select DEBUG_TLBFLUSH
>> +    ---help---
>> +      This option offers protection against 'ret2dir' (kernel) attacks.
>> +      When enabled, every time a page frame is allocated to user space, it
>> +      is unmapped from the direct mapped RAM region in kernel space
>> +      (physmap). Similarly, whenever page frames are freed/reclaimed, they
>> +      are mapped back to physmap. Special care is taken to minimize the
>> +      impact on performance by reducing TLB shootdowns and unnecessary page
>> +      zero fills.
>> +
>> +      If in doubt, say "N".
>> +
>>   config X86_VERBOSE_BOOTUP
>>       bool "Enable verbose x86 bootup info messages"
>>       default y
>> diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
>> index f9d38a4..8bf52b6 100644
>> --- a/arch/x86/mm/Makefile
>> +++ b/arch/x86/mm/Makefile
>> @@ -34,3 +34,5 @@ obj-$(CONFIG_ACPI_NUMA)        += srat.o
>>   obj-$(CONFIG_NUMA_EMU)        += numa_emulation.o
>>
>>   obj-$(CONFIG_X86_INTEL_MPX)    += mpx.o
>> +
>> +obj-$(CONFIG_XPFO)        += xpfo.o
>> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
>> index 493f541..27fc8a6 100644
>> --- a/arch/x86/mm/init.c
>> +++ b/arch/x86/mm/init.c
>> @@ -150,7 +150,8 @@ static int page_size_mask;
>>
>>   static void __init probe_page_size_mask(void)
>>   {
>> -#if !defined(CONFIG_DEBUG_PAGEALLOC) && !defined(CONFIG_KMEMCHECK)
>> +#if !defined(CONFIG_DEBUG_PAGEALLOC) && !defined(CONFIG_KMEMCHECK) && \
>> +    !defined(CONFIG_XPFO)
>>       /*
>>        * For CONFIG_DEBUG_PAGEALLOC, identity mapping will use small pages.
>>        * This will simplify cpa(), which otherwise needs to support splitting
>> diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c
>> new file mode 100644
>> index 0000000..6bc24d3
>> --- /dev/null
>> +++ b/arch/x86/mm/xpfo.c
>> @@ -0,0 +1,176 @@
>> +/*
>> + * Copyright (C) 2016 Brown University. All rights reserved.
>> + * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
>> + *
>> + * Authors:
>> + *   Vasileios P. Kemerlis <vpk@...brown.edu>
>> + *   Juerg Haefliger <juerg.haefliger@....com>
>> + *
>> + * This program is free software; you can redistribute it and/or modify it
>> + * under the terms of the GNU General Public License version 2 as published by
>> + * the Free Software Foundation.
>> + */
>> +
>> +#include <linux/mm.h>
>> +#include <linux/module.h>
>> +
>> +#include <asm/pgtable.h>
>> +#include <asm/tlbflush.h>
>> +
>> +#define TEST_XPFO_FLAG(flag, page) \
>> +    test_bit(PG_XPFO_##flag, &(page)->xpfo.flags)
>> +
>> +#define SET_XPFO_FLAG(flag, page)            \
>> +    __set_bit(PG_XPFO_##flag, &(page)->xpfo.flags)
>> +
>> +#define CLEAR_XPFO_FLAG(flag, page)            \
>> +    __clear_bit(PG_XPFO_##flag, &(page)->xpfo.flags)
>> +
>> +#define TEST_AND_CLEAR_XPFO_FLAG(flag, page)            \
>> +    __test_and_clear_bit(PG_XPFO_##flag, &(page)->xpfo.flags)
>> +
>> +/*
>> + * Update a single kernel page table entry
>> + */
>> +static inline void set_kpte(struct page *page, unsigned long kaddr,
>> +                pgprot_t prot) {
>> +    unsigned int level;
>> +    pte_t *kpte = lookup_address(kaddr, &level);
>> +
>> +    /* We only support 4k pages for now */
>> +    BUG_ON(!kpte || level != PG_LEVEL_4K);
>> +
>> +    set_pte_atomic(kpte, pfn_pte(page_to_pfn(page), canon_pgprot(prot)));
>> +}
>> +
>> +inline void xpfo_clear_zap(struct page *page, int order)
>> +{
>> +    int i;
>> +
>> +    for (i = 0; i < (1 << order); i++)
>> +        CLEAR_XPFO_FLAG(zap, page + i);
>> +}
>> +
>> +inline int xpfo_test_and_clear_zap(struct page *page)
>> +{
>> +    return TEST_AND_CLEAR_XPFO_FLAG(zap, page);
>> +}
>> +
>> +inline int xpfo_test_kernel(struct page *page)
>> +{
>> +    return TEST_XPFO_FLAG(kernel, page);
>> +}
>> +
>> +inline int xpfo_test_user(struct page *page)
>> +{
>> +    return TEST_XPFO_FLAG(user, page);
>> +}
>> +
>> +void xpfo_alloc_page(struct page *page, int order, gfp_t gfp)
>> +{
>> +    int i, tlb_shoot = 0;
>> +    unsigned long kaddr;
>> +
>> +    for (i = 0; i < (1 << order); i++)  {
>> +        WARN_ON(TEST_XPFO_FLAG(user_fp, page + i) ||
>> +            TEST_XPFO_FLAG(user, page + i));
>> +
>> +        if (gfp & GFP_HIGHUSER) {
> 
> This check doesn't seem right. If the GFP flags have _any_ in common with
> GFP_HIGHUSER it will be marked as a user page so GFP_KERNEL will be marked
> as well.

Duh. You're right. I broke this when I cleaned up the original patch. It should be:
(gfp & GFP_HIGHUSER) == GFP_HIGHUSER



>> +            /* Initialize the xpfo lock and map counter */
>> +            spin_lock_init(&(page + i)->xpfo.lock);
> 
> This is initializing the spin_lock every time. That's not really necessary.

Correct. The initialization should probably be done when the page struct is
first allocated. But I haven't been able to find that piece of code quickly.
Will look again.


>> +            atomic_set(&(page + i)->xpfo.mapcount, 0);
>> +
>> +            /* Mark it as a user page */
>> +            SET_XPFO_FLAG(user_fp, page + i);
>> +
>> +            /*
>> +             * Shoot the TLB if the page was previously allocated
>> +             * to kernel space
>> +             */
>> +            if (TEST_AND_CLEAR_XPFO_FLAG(kernel, page + i))
>> +                tlb_shoot = 1;
>> +        } else {
>> +            /* Mark it as a kernel page */
>> +            SET_XPFO_FLAG(kernel, page + i);
>> +        }
>> +    }
>> +
>> +    if (tlb_shoot) {
>> +        kaddr = (unsigned long)page_address(page);
>> +        flush_tlb_kernel_range(kaddr, kaddr + (1 << order) *
>> +                       PAGE_SIZE);
>> +    }
>> +}
>> +
>> +void xpfo_free_page(struct page *page, int order)
>> +{
>> +    int i;
>> +    unsigned long kaddr;
>> +
>> +    for (i = 0; i < (1 << order); i++) {
>> +
>> +        /* The page frame was previously allocated to user space */
>> +        if (TEST_AND_CLEAR_XPFO_FLAG(user, page + i)) {
>> +            kaddr = (unsigned long)page_address(page + i);
>> +
>> +            /* Clear the page and mark it accordingly */
>> +            clear_page((void *)kaddr);
> 
> Clearing the page isn't related to XPFO. There's other work ongoing to
> do clearing of the page on free.

It's not strictly related to XPFO but adds another layer of security. Do you
happen to have a pointer to the ongoing work that you mentioned?


>> +            SET_XPFO_FLAG(zap, page + i);
>> +
>> +            /* Map it back to kernel space */
>> +            set_kpte(page + i,  kaddr, __pgprot(__PAGE_KERNEL));
>> +
>> +            /* No TLB update */
>> +        }
>> +
>> +        /* Clear the xpfo fast-path flag */
>> +        CLEAR_XPFO_FLAG(user_fp, page + i);
>> +    }
>> +}
>> +
>> +void xpfo_kmap(void *kaddr, struct page *page)
>> +{
>> +    unsigned long flags;
>> +
>> +    /* The page is allocated to kernel space, so nothing to do */
>> +    if (TEST_XPFO_FLAG(kernel, page))
>> +        return;
>> +
>> +    spin_lock_irqsave(&page->xpfo.lock, flags);
>> +
>> +    /*
>> +     * The page was previously allocated to user space, so map it back
>> +     * into the kernel. No TLB update required.
>> +     */
>> +    if ((atomic_inc_return(&page->xpfo.mapcount) == 1) &&
>> +        TEST_XPFO_FLAG(user, page))
>> +        set_kpte(page, (unsigned long)kaddr, __pgprot(__PAGE_KERNEL));
>> +
>> +    spin_unlock_irqrestore(&page->xpfo.lock, flags);
>> +}
>> +EXPORT_SYMBOL(xpfo_kmap);
>> +
>> +void xpfo_kunmap(void *kaddr, struct page *page)
>> +{
>> +    unsigned long flags;
>> +
>> +    /* The page is allocated to kernel space, so nothing to do */
>> +    if (TEST_XPFO_FLAG(kernel, page))
>> +        return;
>> +
>> +    spin_lock_irqsave(&page->xpfo.lock, flags);
>> +
>> +    /*
>> +     * The page frame is to be allocated back to user space. So unmap it
>> +     * from the kernel, update the TLB and mark it as a user page.
>> +     */
>> +    if ((atomic_dec_return(&page->xpfo.mapcount) == 0) &&
>> +        (TEST_XPFO_FLAG(user_fp, page) || TEST_XPFO_FLAG(user, page))) {
>> +        set_kpte(page, (unsigned long)kaddr, __pgprot(0));
>> +        __flush_tlb_one((unsigned long)kaddr);
>> +        SET_XPFO_FLAG(user, page);
>> +    }
>> +
>> +    spin_unlock_irqrestore(&page->xpfo.lock, flags);
>> +}
>> +EXPORT_SYMBOL(xpfo_kunmap);
> 
> I'm confused by the checks in kmap/kunmap here. It looks like once the
> page is allocated there is no changing of flags between user and
> kernel mode so the checks for if the page is user seem redundant.

Hmm... I think you're partially right. In xpfo_kmap we need to distinguish
between user and user_fp, so the check for 'user' is necessary. However, in
kunmap we can drop the check for 'user' || 'user_fp'.


>> diff --git a/block/blk-map.c b/block/blk-map.c
>> index f565e11..b7b8302 100644
>> --- a/block/blk-map.c
>> +++ b/block/blk-map.c
>> @@ -107,7 +107,12 @@ int blk_rq_map_user_iov(struct request_queue *q, struct
>> request *rq,
>>           prv.iov_len = iov.iov_len;
>>       }
>>
>> -    if (unaligned || (q->dma_pad_mask & iter->count) || map_data)
>> +    /*
>> +     * juergh: Temporary hack to force the use of a bounce buffer if XPFO
>> +     * is enabled. Results in an XPFO page fault otherwise.
>> +     */
>> +    if (unaligned || (q->dma_pad_mask & iter->count) || map_data ||
>> +        IS_ENABLED(CONFIG_XPFO))
>>           bio = bio_copy_user_iov(q, map_data, iter, gfp_mask);
>>       else
>>           bio = bio_map_user_iov(q, iter, gfp_mask);
>> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
>> index bb3f329..0ca9130 100644
>> --- a/include/linux/highmem.h
>> +++ b/include/linux/highmem.h
>> @@ -55,24 +55,37 @@ static inline struct page *kmap_to_page(void *addr)
>>   #ifndef ARCH_HAS_KMAP
>>   static inline void *kmap(struct page *page)
>>   {
>> +    void *kaddr;
>> +
>>       might_sleep();
>> -    return page_address(page);
>> +
>> +    kaddr = page_address(page);
>> +    xpfo_kmap(kaddr, page);
>> +    return kaddr;
>>   }
>>
>>   static inline void kunmap(struct page *page)
>>   {
>> +    xpfo_kunmap(page_address(page), page);
>>   }
>>
>>   static inline void *kmap_atomic(struct page *page)
>>   {
>> +    void *kaddr;
>> +
>>       preempt_disable();
>>       pagefault_disable();
>> -    return page_address(page);
>> +
>> +    kaddr = page_address(page);
>> +    xpfo_kmap(kaddr, page);
>> +    return kaddr;
>>   }
>>   #define kmap_atomic_prot(page, prot)    kmap_atomic(page)
>>
>>   static inline void __kunmap_atomic(void *addr)
>>   {
>> +    xpfo_kunmap(addr, virt_to_page(addr));
>> +
>>       pagefault_enable();
>>       preempt_enable();
>>   }
>> @@ -133,7 +146,8 @@ do
>> {                                                            \
>>   static inline void clear_user_highpage(struct page *page, unsigned long vaddr)
>>   {
>>       void *addr = kmap_atomic(page);
>> -    clear_user_page(addr, vaddr, page);
>> +    if (!xpfo_test_and_clear_zap(page))
>> +        clear_user_page(addr, vaddr, page);
>>       kunmap_atomic(addr);
>>   }
>>   #endif
>> @@ -186,7 +200,8 @@ alloc_zeroed_user_highpage_movable(struct vm_area_struct
>> *vma,
>>   static inline void clear_highpage(struct page *page)
>>   {
>>       void *kaddr = kmap_atomic(page);
>> -    clear_page(kaddr);
>> +    if (!xpfo_test_and_clear_zap(page))
>> +        clear_page(kaddr);
>>       kunmap_atomic(kaddr);
>>   }
>>
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 624b78b..71c95aa 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -12,6 +12,7 @@
>>   #include <linux/cpumask.h>
>>   #include <linux/uprobes.h>
>>   #include <linux/page-flags-layout.h>
>> +#include <linux/xpfo.h>
>>   #include <asm/page.h>
>>   #include <asm/mmu.h>
>>
>> @@ -215,6 +216,9 @@ struct page {
>>   #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
>>       int _last_cpupid;
>>   #endif
>> +#ifdef CONFIG_XPFO
>> +    struct xpfo_info xpfo;
>> +#endif
>>   }
>>   /*
>>    * The struct page can be forced to be double word aligned so that atomic ops
>> diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
>> new file mode 100644
>> index 0000000..c4f0871
>> --- /dev/null
>> +++ b/include/linux/xpfo.h
>> @@ -0,0 +1,88 @@
>> +/*
>> + * Copyright (C) 2016 Brown University. All rights reserved.
>> + * Copyright (C) 2016 Hewlett Packard Enterprise Development, L.P.
>> + *
>> + * Authors:
>> + *   Vasileios P. Kemerlis <vpk@...brown.edu>
>> + *   Juerg Haefliger <juerg.haefliger@....com>
>> + *
>> + * This program is free software; you can redistribute it and/or modify it
>> + * under the terms of the GNU General Public License version 2 as published by
>> + * the Free Software Foundation.
>> + */
>> +
>> +#ifndef _LINUX_XPFO_H
>> +#define _LINUX_XPFO_H
>> +
>> +#ifdef CONFIG_XPFO
>> +
>> +/*
>> + * XPFO page flags:
>> + *
>> + * PG_XPFO_user_fp denotes that the page is allocated to user space. This flag
>> + * is used in the fast path, where the page is marked accordingly but *not*
>> + * unmapped from the kernel. In most cases, the kernel will need access to the
>> + * page immediately after its acquisition so an unnecessary mapping operation
>> + * is avoided.
>> + *
>> + * PG_XPFO_user denotes that the page is destined for user space. This flag is
>> + * used in the slow path, where the page needs to be mapped/unmapped when the
>> + * kernel wants to access it. If a page is deallocated and this flag is set,
>> + * the page is cleared and mapped back into the kernel.
>> + *
>> + * PG_XPFO_kernel denotes a page that is destined to kernel space. This is used
>> + * for identifying pages that are first assigned to kernel space and then freed
>> + * and mapped to user space. In such cases, an expensive TLB shootdown is
>> + * necessary. Pages allocated to user space, freed, and subsequently allocated
>> + * to user space again, require only local TLB invalidation.
>> + *
>> + * PG_XPFO_zap indicates that the page has been zapped. This flag is used to
>> + * avoid zapping pages multiple times. Whenever a page is freed and was
>> + * previously mapped to user space, it needs to be zapped before mapped back
>> + * in to the kernel.
>> + */
> 
> 'zap' doesn't really indicate what is actually happening with the page. Can you
> be a bit more descriptive about what this actually does?

It means that the page has been cleared at the time it was released back to the
free pool. To prevent multiple expensive cleaning operations. But this might go
away because of the ongoing work of sanitizing pages that you mentioned.


>> +
>> +enum xpfo_pageflags {
>> +    PG_XPFO_user_fp,
>> +    PG_XPFO_user,
>> +    PG_XPFO_kernel,
>> +    PG_XPFO_zap,
>> +};
>> +
>> +struct xpfo_info {
>> +    unsigned long flags;    /* Flags for tracking the page's XPFO state */
>> +    atomic_t mapcount;    /* Counter for balancing page map/unmap
>> +                 * requests. Only the first map request maps
>> +                 * the page back to kernel space. Likewise,
>> +                 * only the last unmap request unmaps the page.
>> +                 */
>> +    spinlock_t lock;    /* Lock to serialize concurrent map/unmap
>> +                 * requests.
>> +                 */
>> +};
> 
> Can you change this to use the page_ext implementation? See what
> mm/page_owner.c does. This might lessen the impact of the extra
> page metadata. This metadata still feels like a copy of what
> mm/highmem.c is trying to do though.

I'll look into that, thanks for the pointer.


>> +
>> +extern void xpfo_clear_zap(struct page *page, int order);
>> +extern int xpfo_test_and_clear_zap(struct page *page);
>> +extern int xpfo_test_kernel(struct page *page);
>> +extern int xpfo_test_user(struct page *page);
>> +
>> +extern void xpfo_kmap(void *kaddr, struct page *page);
>> +extern void xpfo_kunmap(void *kaddr, struct page *page);
>> +extern void xpfo_alloc_page(struct page *page, int order, gfp_t gfp);
>> +extern void xpfo_free_page(struct page *page, int order);
>> +
>> +#else /* ifdef CONFIG_XPFO */
>> +
>> +static inline void xpfo_clear_zap(struct page *page, int order) { }
>> +static inline int xpfo_test_and_clear_zap(struct page *page) { return 0; }
>> +static inline int xpfo_test_kernel(struct page *page) { return 0; }
>> +static inline int xpfo_test_user(struct page *page) { return 0; }
>> +
>> +static inline void xpfo_kmap(void *kaddr, struct page *page) { }
>> +static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
>> +static inline void xpfo_alloc_page(struct page *page, int order, gfp_t gfp) { }
>> +static inline void xpfo_free_page(struct page *page, int order) { }
>> +
>> +#endif /* ifdef CONFIG_XPFO */
>> +
>> +#endif /* ifndef _LINUX_XPFO_H */
>> diff --git a/lib/swiotlb.c b/lib/swiotlb.c
>> index 76f29ec..cf57ee9 100644
>> --- a/lib/swiotlb.c
>> +++ b/lib/swiotlb.c
>> @@ -390,8 +390,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr,
>> phys_addr_t tlb_addr,
>>   {
>>       unsigned long pfn = PFN_DOWN(orig_addr);
>>       unsigned char *vaddr = phys_to_virt(tlb_addr);
>> +    struct page *page = pfn_to_page(pfn);
>>
>> -    if (PageHighMem(pfn_to_page(pfn))) {
>> +    if (PageHighMem(page) || xpfo_test_user(page)) {
>>           /* The buffer does not have a mapping.  Map it in and copy */
>>           unsigned int offset = orig_addr & ~PAGE_MASK;
>>           char *buffer;
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 838ca8bb..47b42a3 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1003,6 +1003,7 @@ static bool free_pages_prepare(struct page *page,
>> unsigned int order)
>>       }
>>       arch_free_page(page, order);
>>       kernel_map_pages(page, 1 << order, 0);
>> +    xpfo_free_page(page, order);
>>
>>       return true;
>>   }
>> @@ -1398,10 +1399,13 @@ static int prep_new_page(struct page *page, unsigned
>> int order, gfp_t gfp_flags,
>>       arch_alloc_page(page, order);
>>       kernel_map_pages(page, 1 << order, 1);
>>       kasan_alloc_pages(page, order);
>> +    xpfo_alloc_page(page, order, gfp_flags);
>>
>>       if (gfp_flags & __GFP_ZERO)
>>           for (i = 0; i < (1 << order); i++)
>>               clear_highpage(page + i);
>> +    else
>> +        xpfo_clear_zap(page, order);
>>
>>       if (order && (gfp_flags & __GFP_COMP))
>>           prep_compound_page(page, order);
>> @@ -2072,10 +2076,11 @@ void free_hot_cold_page(struct page *page, bool cold)
>>       }
>>
>>       pcp = &this_cpu_ptr(zone->pageset)->pcp;
>> -    if (!cold)
>> +    if (!cold && !xpfo_test_kernel(page))
>>           list_add(&page->lru, &pcp->lists[migratetype]);
>>       else
>>           list_add_tail(&page->lru, &pcp->lists[migratetype]);
>> +
> 
> What's the advantage of this?

Allocating a page to userspace that was previously allocated to kernel space
requires an expensive TLB shootdown. The above will put previously
kernel-allocated pages in the cold page cache to postpone their allocation as
long as possible to minimize TLB shootdowns.


>>       pcp->count++;
>>       if (pcp->count >= pcp->high) {
>>           unsigned long batch = READ_ONCE(pcp->batch);
>>

Thanks for the review and comments! It's highly appreciated.

...Juerg


> Thanks,
> Laura