linux-hardening - Re: [RESEND PATCH v4 08/11] powerpc: Initialize and use a temporary mm for patching

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1625125043.re5l9zg4kg.astroid@bobo.none>
Date:   Thu, 01 Jul 2021 17:51:08 +1000
From:   Nicholas Piggin <npiggin@...il.com>
To:     "Christopher M. Riedl" <cmr@...ux.ibm.com>,
        linuxppc-dev@...ts.ozlabs.org
Cc:     keescook@...omium.org, linux-hardening@...r.kernel.org,
        tglx@...utronix.de, x86@...nel.org
Subject: Re: [RESEND PATCH v4 08/11] powerpc: Initialize and use a temporary
 mm for patching

Excerpts from Christopher M. Riedl's message of July 1, 2021 5:02 pm:
> On Thu Jul 1, 2021 at 1:12 AM CDT, Nicholas Piggin wrote:
>> Excerpts from Christopher M. Riedl's message of May 6, 2021 2:34 pm:
>> > When code patching a STRICT_KERNEL_RWX kernel the page containing the
>> > address to be patched is temporarily mapped as writeable. Currently, a
>> > per-cpu vmalloc patch area is used for this purpose. While the patch
>> > area is per-cpu, the temporary page mapping is inserted into the kernel
>> > page tables for the duration of patching. The mapping is exposed to CPUs
>> > other than the patching CPU - this is undesirable from a hardening
>> > perspective. Use a temporary mm instead which keeps the mapping local to
>> > the CPU doing the patching.
>> > 
>> > Use the `poking_init` init hook to prepare a temporary mm and patching
>> > address. Initialize the temporary mm by copying the init mm. Choose a
>> > randomized patching address inside the temporary mm userspace address
>> > space. The patching address is randomized between PAGE_SIZE and
>> > DEFAULT_MAP_WINDOW-PAGE_SIZE. The upper limit is necessary due to how
>> > the Book3s64 Hash MMU operates - by default the space above
>> > DEFAULT_MAP_WINDOW is not available. For now, the patching address for
>> > all platforms/MMUs is randomized inside this range.  The number of
>> > possible random addresses is dependent on PAGE_SIZE and limited by
>> > DEFAULT_MAP_WINDOW.
>> > 
>> > Bits of entropy with 64K page size on BOOK3S_64:
>> > 
>> >         bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)
>> > 
>> >         PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB
>> >         bits of entropy = log2(128TB / 64K) bits of entropy = 31
>> > 
>> > Randomization occurs only once during initialization at boot.
>> > 
>> > Introduce two new functions, map_patch() and unmap_patch(), to
>> > respectively create and remove the temporary mapping with write
>> > permissions at patching_addr. The Hash MMU on Book3s64 requires mapping
>> > the page for patching with PAGE_SHARED since the kernel cannot access
>> > userspace pages with the PAGE_PRIVILEGED (PAGE_KERNEL) bit set.
>> > 
>> > Also introduce hash_prefault_mapping() to preload the SLB entry and HPTE
>> > for the patching_addr when using the Hash MMU on Book3s64 to avoid
>> > taking an SLB and Hash fault during patching.
>>
>> What prevents the SLBE or HPTE from being removed before the last
>> access?
> 
> This code runs with local IRQs disabled - we also don't access anything
> else in userspace so I'm not sure what else could cause the entries to
> be removed TBH.
> 
>>
>>
>> > +#ifdef CONFIG_PPC_BOOK3S_64
>> > +
>> > +static inline int hash_prefault_mapping(pgprot_t pgprot)
>> >  {
>> > -	struct vm_struct *area;
>> > +	int err;
>> >  
>> > -	area = get_vm_area(PAGE_SIZE, VM_ALLOC);
>> > -	if (!area) {
>> > -		WARN_ONCE(1, "Failed to create text area for cpu %d\n",
>> > -			cpu);
>> > -		return -1;
>> > -	}
>> > -	this_cpu_write(text_poke_area, area);
>> > +	if (radix_enabled())
>> > +		return 0;
>> >  
>> > -	return 0;
>> > -}
>> > +	err = slb_allocate_user(patching_mm, patching_addr);
>> > +	if (err)
>> > +		pr_warn("map patch: failed to allocate slb entry\n");
>> >  
>> > -static int text_area_cpu_down(unsigned int cpu)
>> > -{
>> > -	free_vm_area(this_cpu_read(text_poke_area));
>> > -	return 0;
>> > +	err = hash_page_mm(patching_mm, patching_addr, pgprot_val(pgprot), 0,
>> > +			   HPTE_USE_KERNEL_KEY);
>> > +	if (err)
>> > +		pr_warn("map patch: failed to insert hashed page\n");
>> > +
>> > +	/* See comment in switch_slb() in mm/book3s64/slb.c */
>> > +	isync();
>>
>> I'm not sure if this is enough. Could we context switch here? You've
>> got the PTL so no with a normal kernel but maybe yes with an RT kernel
>> How about taking an machine check that clears the SLB? Could the HPTE
>> get removed by something else here?
> 
> All of this happens after a local_irq_save() which should at least
> prevent context switches IIUC.

Ah yeah I didn't look that far back. A machine check can take out SLB
entries.

> I am not sure what else could cause the
> HPTE to get removed here.

Other CPUs?

>> You want to prevent faults because you might be patching a fault
>> handler?
> 
> In a more general sense: I don't think we want to take page faults every
> time we patch an instruction with a STRICT_RWX kernel. The Hash MMU page
> fault handler codepath also checks `current->mm` in some places which
> won't match the temporary mm. Also `current->mm` can be NULL which
> caused problems in my earlier revisions of this series.

Hmm, that's a bit of a hack then. Maybe doing an actual mm switch and 
setting current->mm properly would explode too much. Maybe that's okayish.
But I can't see how the HPT code is up to the job of this in general 
(even if that current->mm issue was fixed).

To do it without holes you would either have to get the SLB MCE handler 
to restore that particular SLB if it flushed it, or restart the patch
code from a fixup location if it took an MCE after installing the SLB.
And bolt a hash table entry.

Thanks,
Nick