linux-kernel - Re: [PATCH v8 9/9] sparc64: Add support for ADI (Application Data Integrity)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5edaf7dc-6bc7-c365-0b54-b78975c08894@oracle.com>
Date:   Thu, 12 Oct 2017 08:44:12 -0600
From:   Khalid Aziz <khalid.aziz@...cle.com>
To:     Anthony Yznaga <anthony.yznaga@...cle.com>
Cc:     David Miller <davem@...emloft.net>, dave.hansen@...ux.intel.com,
        corbet@....net, Bob Picco <bob.picco@...cle.com>,
        steven.sistare@...cle.com, pasha.tatashin@...cle.com,
        mike.kravetz@...cle.com, rob.gardner@...cle.com, mingo@...nel.org,
        nitin.m.gupta@...cle.com, kirill.shutemov@...ux.intel.com,
        tom.hromatka@...cle.com, eric.saint.etienne@...cle.com,
        allen.pais@...cle.com, cmetcalf@...lanox.com,
        akpm@...ux-foundation.org, geert@...ux-m68k.org, pmladek@...e.com,
        tklauser@...tanz.ch, atish.patra@...cle.com,
        shannon.nelson@...cle.com, vijay.ac.kumar@...cle.com,
        peterz@...radead.org, mhocko@...e.com, jack@...e.cz,
        lstoakes@...il.com, punit.agrawal@....com, hughd@...gle.com,
        thomas.tai@...cle.com, paul.gortmaker@...driver.com,
        ross.zwisler@...ux.intel.com, dave.jiang@...el.com,
        willy@...radead.org, ying.huang@...el.com, zhongjiang@...wei.com,
        minchan@...nel.org, imbrenda@...ux.vnet.ibm.com,
        aneesh.kumar@...ux.vnet.ibm.com, aarcange@...hat.com,
        linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org,
        sparclinux@...r.kernel.org, linux-mm@...ck.org,
        Khalid Aziz <khalid@...ehiking.org>
Subject: Re: [PATCH v8 9/9] sparc64: Add support for ADI (Application Data
 Integrity)

Hi Anthony,

Please quote only the relevant parts of the patch with comments. That 
makes it much easier to find the comments.

On 10/06/2017 04:12 PM, Anthony Yznaga wrote:
> 
>> On Sep 25, 2017, at 9:49 AM, Khalid Aziz <khalid.aziz@...cle.com> wrote:
>>
>> This patch extends mprotect to enable ADI (TSTATE.mcde), enable/disable
>> MCD (Memory Corruption Detection) on selected memory ranges, enable
>> TTE.mcd in PTEs, return ADI parameters to userspace and save/restore ADI
>> version tags on page swap out/in or migration. ADI is not enabled by
> 
> I still don't believe migration is properly supported.  Your
> implementation is relying on a fault happening on a page while its
> migration is in progress so that do_swap_page() will be called, but
> I don't see how do_swap_page() will be called if a fault does not
> happen until after the migration has completed.

User pages are on LRU list and for the mapped pages on LRU list, 
migrate_pages() ultimately calls try_to_unmap_one and makes a migration 
swap entry for the page being migrated. This forces a page fault upon 
access on the destination node and the page is swapped back in from swap 
cache. The fault is forced by the migration swap entry, rather than 
fault being an accidental event. If page fault happens on the 
destination node while migration is in progress, do_swap_page() waits 
until migration is done. Please take a look at the code in 
__unmap_and_move().


>> +#define finish_arch_post_lock_switch	finish_arch_post_lock_switch
>> +static inline void finish_arch_post_lock_switch(void)
>> +{
>> +	/* Restore the state of MCDPER register for the new process
>> +	 * just switched to.
>> +	 */
>> +	if (adi_capable()) {
>> +		register unsigned long tmp_mcdper;
>> +
>> +		tmp_mcdper = test_thread_flag(TIF_MCDPER);
>> +		__asm__ __volatile__(
>> +			"mov %0, %%g1\n\t"
>> +			".word 0x9d800001\n\t"	/* wr %g0, %g1, %mcdper" */
>> +			".word 0xaf902001\n\t"	/* wrpr %g0, 1, %pmcdper */
>> +			:
>> +			: "ir" (tmp_mcdper)
>> +			: "g1");
>> +		if (current && current->mm && current->mm->context.adi) {
>> +			struct pt_regs *regs;
>> +
>> +			regs = task_pt_regs(current);
>> +			regs->tstate |= TSTATE_MCDE;
> 
> 
> This works, but it costs additional cycles on every context switch to
> keep setting TSTATE_MCDE.  PSTATE.mcde=1 only affects loads and stores
> to memory mapped with TTE.mcd=1 so there is no impact if it is set and
> no memory is mapped with TTE.mcd=1.  That is why I suggested just
> setting TSTATE_MCDE once when a process thread is initialized.

This change was suggested by David Miller. I believe there is merit to 
that suggestion.

>> +	/* Tag storage has not been allocated for this vma and space
>> +	 * is available in tag storage descriptor. Since this page is
>> +	 * being swapped out, there is high probability subsequent pages
>> +	 * in the VMA will be swapped out as well. Allocate pages to
>> +	 * store tags for as many pages in this vma as possible but not
>> +	 * more than TAG_STORAGE_PAGES. Each byte in tag space holds
>> +	 * two ADI tags since each ADI tag is 4 bits. Each ADI tag
>> +	 * covers adi_blksize() worth of addresses. Check if the hole is
>> +	 * big enough to accommodate full address range for using
>> +	 * TAG_STORAGE_PAGES number of tag pages.
>> +	 */
>> +	size = TAG_STORAGE_PAGES * PAGE_SIZE;
>> +	end_addr = addr + (size*2*adi_blksize()) - 1;
>> +	/* Check for overflow. If overflow occurs, allocate only one page */
>> +	if (end_addr < addr) {
>> +		size = PAGE_SIZE;
>> +		end_addr = addr + (size*2*adi_blksize()) - 1;
> 
> end_addr could still overflow even with size = PAGE_SIZE.
> Maybe you could just set end_addr to (unsigned long)-1 and recalculate
> the size based on that.

I agree at theoretical level. The number of VA bits is already limited 
by the max implemented VA bit in hardware plus with ADI in use, top 4 
bits are not available as well either, so there is lot of unused room at 
the upper end of VA and end_addr is not going to roll over. 
Nevertheless, I can fix this as well for completeness sake.

> 
> 
>> +	}
>> +	if (hole_end < end_addr) {
>> +		/* Available hole is too small on the upper end of
>> +		 * address. Can we expand the range towards the lower
>> +		 * address and maximize use of this slot?
>> +		 */
>> +		unsigned long tmp_addr;
>> +
>> +		end_addr = hole_end - 1;
>> +		tmp_addr = end_addr - (size*2*adi_blksize()) + 1;
>> +		/* Check for underflow. If underflow occurs, allocate
>> +		 * only one page for storing ADI tags
>> +		 */
>> +		if (tmp_addr > addr) {
> 
> Should compare tmp_addr to end_addr rather than addr.

No, this is correct. If tmp_addr wraps around to the upper end, 
theoretically it can be smaller than end_addr but still be bigger than 
addr since addr < end_addr. The way it is written, this is a safer test.

> 
> 
>> +			size = PAGE_SIZE;
>> +			tmp_addr = addr + (size*2*adi_blksize()) - 1;
> 
> copy/paste error?  tmp_addr should be recalculated from end_addr and a
> new size.  The new size needs to be adjusted based on end_addr to as
> little as PAGE_SIZE.

Good catch.

>> diff --git a/arch/sparc/kernel/etrap_64.S b/arch/sparc/kernel/etrap_64.S
>> index 1276ca2567ba..7be33bf45cff 100644
>> --- a/arch/sparc/kernel/etrap_64.S
>> +++ b/arch/sparc/kernel/etrap_64.S
>> @@ -132,7 +132,33 @@ etrap_save:	save	%g2, -STACK_BIAS, %sp
>> 		stx	%g6, [%sp + PTREGS_OFF + PT_V9_G6]
>> 		stx	%g7, [%sp + PTREGS_OFF + PT_V9_G7]
>> 		or	%l7, %l0, %l7
>> -		sethi	%hi(TSTATE_TSO | TSTATE_PEF), %l0
>> +661:		sethi	%hi(TSTATE_TSO | TSTATE_PEF), %l0
>> +		/*
>> +		 * If userspace is using ADI, it could potentially pass
>> +		 * a pointer with version tag embedded in it. To maintain
>> +		 * the ADI security, we must enable PSTATE.mcde. Userspace
>> +		 * would have already set TTE.mcd in an earlier call to
>> +		 * kernel and set the version tag for the address being
>> +		 * dereferenced. Setting PSTATE.mcde would ensure any
>> +		 * access to userspace data through a system call honors
>> +		 * ADI and does not allow a rogue app to bypass ADI by
>> +		 * using system calls. Setting PSTATE.mcde only affects
>> +		 * accesses to virtual addresses that have TTE.mcd set.
>> +		 * Set PMCDPER to ensure any exceptions caused by ADI
>> +		 * version tag mismatch are exposed before system call
>> +		 * returns to userspace. Setting PMCDPER affects only
>> +		 * writes to virtual addresses that have TTE.mcd set and
>> +		 * have a version tag set as well.
>> +		 */
>> +		.section .sun_m7_1insn_patch, "ax"
>> +		.word	661b
>> +		sethi	%hi(TSTATE_TSO | TSTATE_PEF | TSTATE_MCDE), %l0
> 
> rtrap is still missing patches to turn on TSTATE_MCDE when needed.
> 
>> +		.previous
>> +661:		nop
>> +		.section .sun_m7_1insn_patch, "ax"
>> +		.word	661b
>> +		.word 0xaf902001	/* wrpr %g0, 1, %pmcdper */
> 
> 
> I still disagree with setting %pmcdper=1 on every trap into the kernel,
> and now %pmcdper is also being set to 1 on every context switch.  It
> should be sufficient to set it once for each CPU but also setting it
> on every context switch is at least less impactful than setting it on
> every etrap.
> 

We discussed this before and I believe not setting %pmcdper on trap into 
kernel can expose kernel to the possibility of running system calls with 
deferred MCD exceptions which in turn causes unreliable behavior from 
userspace point of view when MCD exception happens (userspace might get 
SIGSEGV, or system call terminates with error depending upon when 
exception is delievered). I believe it is important for system calls to 
behave consistently.


>> diff --git a/arch/sparc/kernel/setup_64.c b/arch/sparc/kernel/setup_64.c
>> index 150ee7d4b059..98a877715832 100644
>> --- a/arch/sparc/kernel/setup_64.c
>> +++ b/arch/sparc/kernel/setup_64.c
>> @@ -293,6 +293,8 @@ static void __init sun4v_patch(void)
>> 	case SUN4V_CHIP_SPARC_M7:
>> 	case SUN4V_CHIP_SPARC_M8:
>> 	case SUN4V_CHIP_SPARC_SN:
>> +		sun4v_patch_1insn_range(&__sun_m7_1insn_patch,
>> +					&__sun_m7_1insn_patch_end);
>> 		sun_m7_patch_2insn_range(&__sun_m7_2insn_patch,
>> 					 &__sun_m7_2insn_patch_end);
> 
> Why did you keep sun_m7_patch_2insn_range() and not replace it with
> sun4v_m7_patch_2insn_range()?
> 

sun_m7_patch_2insn_range() is already in the kernel and not part of this 
patch. It can be replaced but that should be a separate patch in my opinion.

Thanks,
Khalid