linux-kernel - Re: [PATCH v2 1/2] x86/mm/64: Fix vmapped stack syncing on very-large-memory 4-level systems

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <bd7c53c9-cec6-2db2-6ee6-5cc03ca6dd39@intel.com>
Date:   Thu, 25 Jan 2018 13:49:25 -0800
From:   Dave Hansen <dave.hansen@...el.com>
To:     Andy Lutomirski <luto@...nel.org>,
        Konstantin Khlebnikov <khlebnikov@...dex-team.ru>,
        X86 ML <x86@...nel.org>, Borislav Petkov <bp@...en8.de>
Cc:     Neil Berrington <neil.berrington@...acore.com>,
        LKML <linux-kernel@...r.kernel.org>, stable@...r.kernel.org,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>
Subject: Re: [PATCH v2 1/2] x86/mm/64: Fix vmapped stack syncing on
 very-large-memory 4-level systems

On 01/25/2018 01:12 PM, Andy Lutomirski wrote:
> Neil Berrington reported a double-fault on a VM with 768GB of RAM that
> uses large amounts of vmalloc space with PTI enabled.
> 
> The cause is that load_new_mm_cr3() was never fixed to take the
> 5-level pgd folding code into account, so, on a 4-level kernel, the
> pgd synchronization logic compiles away to exactly nothing.

You don't mention it, but we can normally handle vmalloc() faults in the
kernel that are due to unsynchronized page tables.  The thing that kills
us here is that we have an unmapped stack and we try to use that stack
when entering the page fault handler, which double faults.  The double
fault handler gets a new stack and saves us enough to get an oops out.

Right?

> +static void sync_current_stack_to_mm(struct mm_struct *mm)
> +{
> +	unsigned long sp = current_stack_pointer;
> +	pgd_t *pgd = pgd_offset(mm, sp);
> +
> +	if (CONFIG_PGTABLE_LEVELS > 4) {
> +		if (unlikely(pgd_none(*pgd))) {
> +			pgd_t *pgd_ref = pgd_offset_k(sp);
> +
> +			set_pgd(pgd, *pgd_ref);
> +		}
> +	} else {
> +		/*
> +		 * "pgd" is faked.  The top level entries are "p4d"s, so sync
> +		 * the p4d.  This compiles to approximately the same code as
> +		 * the 5-level case.
> +		 */
> +		p4d_t *p4d = p4d_offset(pgd, sp);
> +
> +		if (unlikely(p4d_none(*p4d))) {
> +			pgd_t *pgd_ref = pgd_offset_k(sp);
> +			p4d_t *p4d_ref = p4d_offset(pgd_ref, sp);
> +
> +			set_p4d(p4d, *p4d_ref);
> +		}
> +	}
> +}

We keep having to add these.  It seems like a real deficiency in the
mechanism that we're using for pgd folding.  Can't we get a warning or
something when we try to do a set_pgd() that's (silently) not doing
anything?  This exact same pattern bit me more than once with the
KPTI/KAISER patches.