linux-kernel - Re: [Xen-devel] [PATCH v4 0/3] x86: modify_ldt improvement, test, and config option

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <55BA828E.8070304@oracle.com>
Date:	Thu, 30 Jul 2015 16:01:18 -0400
From:	Boris Ostrovsky <boris.ostrovsky@...cle.com>
To:	Andrew Cooper <andrew.cooper3@...rix.com>,
	Andy Lutomirski <luto@...capital.net>
CC:	David Vrabel <dvrabel@...tab.net>,
	"security@...nel.org" <security@...nel.org>,
	Peter Zijlstra <peterz@...radead.org>, X86 ML <x86@...nel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Steven Rostedt <rostedt@...dmis.org>,
	xen-devel <xen-devel@...ts.xen.org>,
	Borislav Petkov <bp@...en8.de>,
	David Vrabel <david.vrabel@...rix.com>,
	Jan Beulich <jbeulich@...e.com>,
	Sasha Levin <sasha.levin@...cle.com>
Subject: Re: [Xen-devel] [PATCH v4 0/3] x86: modify_ldt improvement, test,
 and config option

On 07/30/2015 02:54 PM, Andrew Cooper wrote:
> On 30/07/15 19:30, Andy Lutomirski wrote:
>> On Wed, Jul 29, 2015 at 5:29 PM, Andrew Cooper
>> <andrew.cooper3@...rix.com> wrote:
>>> On 30/07/2015 00:13, Andy Lutomirski wrote:
>>>> On Wed, Jul 29, 2015 at 4:02 PM, Andrew Cooper
>>>> <andrew.cooper3@...rix.com> wrote:
>>>>> On 29/07/2015 23:49, Boris Ostrovsky wrote:
>>>>>> On 07/29/2015 06:46 PM, David Vrabel wrote:
>>>>>>> On 29/07/2015 23:11, Andrew Cooper wrote:
>>>>>>>> On 29/07/2015 23:05, Andy Lutomirski wrote:
>>>>>>>>> On Wed, Jul 29, 2015 at 2:37 PM, Andrew Cooper
>>>>>>>>> <andrew.cooper3@...rix.com> wrote:
>>>>>>>>>> On 29/07/2015 22:26, Andy Lutomirski wrote:
>>>>>>>>>>> On Wed, Jul 29, 2015 at 2:23 PM, Boris Ostrovsky
>>>>>>>>>>> <boris.ostrovsky@...cle.com> wrote:
>>>>>>>>>>>> On 07/29/2015 03:03 PM, Andrew Cooper wrote:
>>>>>>>>>>>>> On 29/07/15 15:43, Boris Ostrovsky wrote:
>>>>>>>>>>>>>> FYI, I have got a repro now and am investigating.
>>>>>>>>>>>>> Good and bad news.  This bug has nothing to do with LDTs
>>>>>>>>>>>>> themselves.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have worked out what is going on, but this:
>>>>>>>>>>>>>
>>>>>>>>>>>>> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>>>>>>>>>>>>> index 5abeaac..7e1a82e 100644
>>>>>>>>>>>>> --- a/arch/x86/xen/enlighten.c
>>>>>>>>>>>>> +++ b/arch/x86/xen/enlighten.c
>>>>>>>>>>>>> @@ -493,6 +493,7 @@ static void set_aliased_prot(void *v,
>>>>>>>>>>>>> pgprot_t prot)
>>>>>>>>>>>>>              pte = pfn_pte(pfn, prot);
>>>>>>>>>>>>>     +       (void)*(volatile int*)v;
>>>>>>>>>>>>>            if (HYPERVISOR_update_va_mapping((unsigned long)v,
>>>>>>>>>>>>> pte, 0)) {
>>>>>>>>>>>>>                    pr_err("set_aliased_prot va update failed w/
>>>>>>>>>>>>> lazy mode
>>>>>>>>>>>>> %u\n", paravirt_get_lazy_mode());
>>>>>>>>>>>>>                    BUG();
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is perhaps not the fix we are looking for, and every use of
>>>>>>>>>>>>> HYPERVISOR_update_va_mapping() is susceptible to the same problem.
>>>>>>>>>>>> I think in most cases we know that page is mapped so hopefully
>>>>>>>>>>>> this is the
>>>>>>>>>>>> only site that we need to be careful about.
>>>>>>>>>>> Is there any chance we can get some kind of quick-and-dirty fix that
>>>>>>>>>>> can go to x86/urgent in the next few days even if a clean fix isn't
>>>>>>>>>>> available yet?
>>>>>>>>>> Quick and dirty?
>>>>>>>>>>
>>>>>>>>>> Reading from v is the most obvious and quick way, for areas where
>>>>>>>>>> we are
>>>>>>>>>> certain v exists, is kernel memory and is expected to have a backing
>>>>>>>>>> page.  I don't know offhand how many of current
>>>>>>>>>> HYPERVISOR_update_va_mapping() callsites this applies to.
>>>>>>>>> __get_user((char *)v, tmp), perhaps, unless there's something better
>>>>>>>>> in the wings.  Keep in mind that we need this for -stable, and it's
>>>>>>>>> likely to get backported quite quickly due to CVE-2015-5157.
>>>>>>>> Hmm - something like that tucked inside HYPERVISOR_update_va_mapping()
>>>>>>>> would probably work, and certainly be minimal hassle for -stable.
>>>>>>>>
>>>>>>>> Altering the hypercall used is certainly not something to backport, nor
>>>>>>>> are we sure it is a viable fix at this time.
>>>>>>> Changing this one use of update_va_mapping to use mmu_update_normal_pt
>>>>>>> is the correct fix to unblock this LDT series.  I see no reason why this
>>>>>>> cannot be backported.
>>>>>> To properly fix it should include batching and that is not something
>>>>>> that I think we should target for stable.
>>>>> Batching is absolutely not necessary to alter update_va_mapping to
>>>>> mmu_update_normal_pt.  After all, update_va_mapping isn't batched.
>>>>>
>>>>> However this isn't the first issue issue we have had lazy mmu faulting,
>>>>> and I doubt it is the last.  There are not many callsites of
>>>>> update_va_mapping - I will audit them tomorrow and see if any similar
>>>>> issues are lurking elsewhere.
>>>> One thing I should add: nothing flushes old aliases in xen_alloc_ldt,
>>>> yet I haven't been able to get xen_alloc_ldt to fail or subsequent LDT
>>>> access to fault.  Is this something we should be worried about?
>>> Yes.  update_va_mapping() will function perfectly well taking one RW
>>> mapping to RO even if there is a second RW mapping.  In such a case, the
>>> next LDT access will fault.
>> Which is a problem because that alias might still exist, and also
>> because Linux really doesn't expect that fault.
>>
>>> On closer inspection, Xen is rather unhelpful with the fault.  Xen's
>>> lazy #PF will be bounced back to the guest with cr2 adjusted to appear
>>> in the range passed to set_ldt().  The error code however will be
>>> unmodified (and limited only by not-user and not-reserved), so will
>>> appear as a non-present read or write supervisor access to an address
>>> which the kernel has a valid read mapping of.
>> More yuck.
>>
>> I think I'm just going to stick an unconditional vm_flush_aliases in alloc_ldt.
>>
>>> Therefore, set_ldt() needs to be confident that there are no writeable
>>> mappings to the frames used to make up the LDT.  It could proactively
>>> fault them in by accessing one descriptor in each page inside the limit,
>>> but by the time a fault is received it is probably too late to work out
>>> where the other mapping is which prevented the typechange (or indeed,
>>> whether Xen objected to one of the descriptors instead).
>> This seems like overkill.
>>
>> I'm still a bit confused, though: the failure is in xen_free_ldt.  How
>> do we make it all the way to xen_free_ldt without the vmapped page
>> existing in the guest's page tables?  After all, we had to survive
>> xen_alloc_ldt first, and ISTM that should fail in exactly the same
>> way.
> (Summarising part of a discussion which has just occurred on IRC)
>
> I presume that xen_free_ldt() is called while in the context of an mm
> which doesn't have the particular area of the vmalloc() space faulted in.

This is exactly what's happening --- the bug is only triggered during 
exit and xen_free_ldt() is called from someone else's context, e.g.:

[   53.986677] Call Trace:
[   53.986677]  [<c105312d>] xen_free_ldt+0x2d/0x40
[   53.986677]  [<c1062310>] free_ldt_struct.part.1+0x10/0x40
[   53.986677]  [<c1062735>] destroy_context+0x25/0x40
[   53.986677]  [<c10a764e>] __mmdrop+0x1e/0xc0
[   53.986677]  [<c10c9858>] finish_task_switch+0xd8/0x1a0
[   53.986677]  [<c1863736>] __schedule+0x316/0x950
[   53.986677]  [<c1863d96>] schedule+0x26/0x70
[   53.986677]  [<c10ac613>] do_wait+0x1b3/0x200
[   53.986677]  [<c10ac9d7>] SyS_waitpid+0x67/0xd0
[   53.986677]  [<c10aa820>] ? task_stopped_code+0x50/0x50
[   53.986677]  [<c186717a>] syscall_call+0x7/0x7

But that would imply that this other context has mm->context.ldt of 
ldt_gdt_32. How is that possible?

-boris

>
> This is (I presume) why reading 'v' (which occasionally causes a
> pagefault to occur) fixes the issue.
>
> ~Andrew

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/