linux-kernel - Re: Review of KPTI patchset

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1311401854.45816.1514666587545.JavaMail.zimbra@efficios.com>
Date:   Sat, 30 Dec 2017 20:43:07 +0000 (UTC)
From:   Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To:     Thomas Gleixner <tglx@...utronix.de>
Cc:     linux-kernel <linux-kernel@...r.kernel.org>,
        Andy Lutomirski <luto@...capital.net>,
        Peter Zijlstra <peterz@...radead.org>,
        Borislav Petkov <bp@...e.de>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        Hugh Dickins <hughd@...gle.com>,
        Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: Review of KPTI patchset

----- On Dec 30, 2017, at 2:58 PM, Thomas Gleixner tglx@...utronix.de wrote:

> On Sat, 30 Dec 2017, Mathieu Desnoyers wrote:
> 
>> Hi Thomas,
>> 
>> Here is some feedback on the KPTI patchset. Sorry for not replying to the
>> patch, I was not CC'd on the original email, and don't have it in my inbox.
> 
> I can bounce you 196 versions if you want.

Oh no, don't worry about this. I'm happy reviewing the resulting patchset
as it is. :)

> 
>> I notice that fill_ldt() sets the desc->type with "|= 1", whereas all
>> other operations on the desc type are done with a type enum based on
>> clearly defined bits. Is the hardcoded "1" on purpose ?
> 
> I don't understand your question. That code does not have any enum involved
> at all:

I think I got mixed up with other "desc" fields within other structures
of desc_defs.h.

> 
>        desc->type              = (info->read_exec_only ^ 1) << 1;
>        desc->type             |= info->contents << 2;
>        /* Set the ACCESS bit so it can be mapped RO */
>        desc->type             |= 1;
> 
> So the |= 1 is completely consistent with the rest of that code.

It indeed seems consistent with the rest of that code, which could use
more comments and documentation. For instance, x86 desc_defs.h
could benefit from extra comments describing the meaning of each bit
near the "type" field.

I guess a counter-argument is that anyone reading through that code
should look up the "segment descriptor" layout in a x86 manual. Not
ideal though.

> 
>> arch/x86/include/asm/processor.h:
>> 
>> "+ * With page table isolation enabled, we map the LDT in ... [stay tuned]"
>> 
>> I look forward to publication of the next chapter containing the rest of
>> this sentence. When is it due ? ;)
> 
> Don't know. Lost my crystal ball.

Me too :) I would be helpful to complete this comment though.

[...]

>> @@ -156,6 +271,12 @@ int ldt_dup_context(struct mm_struct *old_mm, struct
>> mm_struct *mm)
>>  	       new_ldt->nr_entries * LDT_ENTRY_SIZE);
>>  	finalize_ldt_struct(new_ldt);
>>  
>> +	retval = map_ldt_struct(mm, new_ldt, 0);
>> +	if (retval) {
>> +		free_ldt_pgtables(mm);
>> +		free_ldt_struct(new_ldt);
>> +		goto out_unlock;
>> +	}
>>  	mm->context.ldt = new_ldt;
>>  
>>  out_unlock:
>> 
>> ^ I don't get why it does "free_ldt_pgtables(mm)" on the mm argument, but
>> it's not done in other error paths. Perhaps it's OK, but ownership seems
>> non-obvious.
> 
> The pagetable for LDT is allocated and populated in the user space visible
> part of a process PGDIR, which obviously is connected to the mm struct....
> 
> Which other error paths are you talking about?

Let's look at the entire function:

> /*
>  * Called on fork from arch_dup_mmap(). Just copy the current LDT state,
>  * the new task is not running, so nothing can be installed.
>  */
> int ldt_dup_context(struct mm_struct *old_mm, struct mm_struct *mm)
> {
>       struct ldt_struct *new_ldt;
>       int retval = 0;
>
>       if (!old_mm)
>               return 0;

If old_mm is NULL, free_ldt_pgtables(mm) is not called.

>
>       mutex_lock(&old_mm->context.lock);
>       if (!old_mm->context.ldt)

If old_mm->context.ldt is NULL, free_ldt_pgtables(mm) is not called.

>               goto out_unlock;
>
>       new_ldt = alloc_ldt_struct(old_mm->context.ldt->nr_entries);
>       if (!new_ldt) {
>               retval = -ENOMEM;

On allocation error, free_ldt_pgtables(mm) is not called.

>               goto out_unlock;
>       }
>
>       memcpy(new_ldt->entries, old_mm->context.ldt->entries,
>              new_ldt->nr_entries * LDT_ENTRY_SIZE);
>       finalize_ldt_struct(new_ldt);
>
>       retval = map_ldt_struct(mm, new_ldt, 0);
>       if (retval) {
>               free_ldt_pgtables(mm);

Here, if we fail to map_ldt_struct, then free_ldt_pgtables(mm) is called.

>               free_ldt_struct(new_ldt);

In addition to call free_ldt_struct(), but map_ldt_struct failed... ?

This lack of symmetry makes me uncomfortable, and it may hint at something
fishy.

>               goto out_unlock;
>       }
>       mm->context.ldt = new_ldt;
>
> out_unlock:
>       mutex_unlock(&old_mm->context.lock);
>       return retval;
> }

[...]

> 
>> +	/*
>> +	 * Force the population of PMDs for not yet allocated per cpu
>> +	 * memory like debug store buffers.
>> +	 */
>> +	npages = sizeof(struct debug_store_buffers) / PAGE_SIZE;
>> +	for (; npages; npages--, cea += PAGE_SIZE)
>> +		cea_set_pte(cea, 0, PAGE_NONE);
>> 
>> ^ the code above (in percpu_setup_debug_store()) depends on having
>> struct debug_store_buffers's size being a multiple of PAGE_SIZE. A
>> comment should be added near the structure declaration to document
>> this requirement.
> 
> Hmm. There was a build_bug_on() somewhere which ensured that. That must
> have been lost in one of the gazillion iterations.

A build bug on would work as documentation indeed.

[...]

> 
>> +/*
>> + * We get here when we do something requiring a TLB invalidation
>> + * but could not go invalidate all of the contexts.  We do the
>> + * necessary invalidation by clearing out the 'ctx_id' which
>> + * forces a TLB flush when the context is loaded.
>> + */
>> +void clear_asid_other(void)
>> +{
>> +	u16 asid;
>> +
>> +	/*
>> +	 * This is only expected to be set if we have disabled
>> +	 * kernel _PAGE_GLOBAL pages.
>> +	 */
>> +	if (!static_cpu_has(X86_FEATURE_PTI)) {
>> +		WARN_ON_ONCE(1);
>> +		return;
>> +	}
>> +
>> +	for (asid = 0; asid < TLB_NR_DYN_ASIDS; asid++) {
>> +		/* Do not need to flush the current asid */
>> +		if (asid == this_cpu_read(cpu_tlbstate.loaded_mm_asid))
>> +			continue;
>> +		/*
>> +		 * Make sure the next time we go to switch to
>> +		 * this asid, we do a flush:
>> +		 */
>> +		this_cpu_write(cpu_tlbstate.ctxs[asid].ctx_id, 0);
>> +	}
>> +	this_cpu_write(cpu_tlbstate.invalidate_other, false);
>> +}
>> 
>> Can this be called with preemption enabled ? If so, what happens
>> if migrated ?
> 
> No, it can't and if it is then it's a bug and the smp_processor_id() debug
> code will yell at you.

I thought the whole point about this_cpu_*() was that it could be called
with preemption enabled, given that it figures out the per-cpu data offset
using a segment selector prefix. How would smp_processor_id() debug code be
involved here ?

Thanks,

Mathieu


> 
> Thanks,
> 
> 	tglx

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com