[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <57067A4F.9090101@virtuozzo.com>
Date: Thu, 7 Apr 2016 18:18:39 +0300
From: Dmitry Safonov <dsafonov@...tuozzo.com>
To: Andy Lutomirski <luto@...capital.net>
CC: Thomas Gleixner <tglx@...utronix.de>,
Dmitry Safonov <0x7f454c46@...il.com>,
Dave Hansen <dave.hansen@...ux.intel.com>,
Ingo Molnar <mingo@...hat.com>,
Shuah Khan <shuahkh@....samsung.com>,
Borislav Petkov <bp@...en8.de>, X86 ML <x86@...nel.org>,
<khorenko@...tuozzo.com>,
Andrew Morton <akpm@...ux-foundation.org>,
<xemul@...tuozzo.com>, <linux-kselftest@...r.kernel.org>,
Cyrill Gorcunov <gorcunov@...nvz.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"H. Peter Anvin" <hpa@...or.com>
Subject: Re: [PATCH 1/2] x86/arch_prctl: add ARCH_SET_{COMPAT,NATIVE} to
change compatible mode
On 04/07/2016 05:39 PM, Andy Lutomirski wrote:
> On Apr 7, 2016 5:12 AM, "Dmitry Safonov" <dsafonov@...tuozzo.com> wrote:
>> On 04/06/2016 09:04 PM, Andy Lutomirski wrote:
>>> [cc Dave Hansen for MPX]
>>>
>>> On Apr 6, 2016 9:30 AM, "Dmitry Safonov" <dsafonov@...tuozzo.com> wrote:
>>>> Now each process that runs natively on x86_64 may execute 32-bit code
>>>> by proper setting it's CS selector: either from LDT or reuse Linux's
>>>> USER32_CS. The vice-versa is also valid: running 64-bit code in
>>>> compatible task is also possible by choosing USER_CS.
>>>> So we may switch between 32 and 64 bit code execution in any process.
>>>> Linux will choose the right syscall numbers in entries for those
>>>> processes. But it still will consider them native/compat by the
>>>> personality, that elf loader set on launch. This affects i.e., ptrace
>>>> syscall on those tasks: PTRACE_GETREGSET will return 64/32-bit regset
>>>> according to process's mode (that's how strace detect task's
>>>> personality from 4.8 version).
>>>>
>>>> This patch adds arch_prctl calls for x86 that make possible to tell
>>>> Linux kernel in which mode the application is running currently.
>>>> Mainly, this is needed for CRIU: restoring compatible & native
>>>> applications both from 64-bit restorer. By that reason I wrapped all
>>>> the code in CONFIG_CHECKPOINT_RESTORE.
>>>> This patch solves also a problem for running 64-bit code in 32-bit elf
>>>> (and reverse), that you have only 32-bit elf vdso for fast syscalls.
>>>> When switching between native <-> compat mode by arch_prctl, it will
>>>> remap needed vdso binary blob for target mode.
>>> General comments first:
>> Thanks for your comments.
>>> You forgot about x32.
>> Will add x32 support for v2.
>>
>>> I think that you should separate vdso remapping from "personality".
>>> vdso remapping should be available even on native 32-bit builds, which
>>> means that either you can't use arch_prctl for it or you'll have to
>>> wire up arch_prctl as a 32-bit syscall.
>> I cant say, I got your point. Do you mean by vdso remapping
>> mremap for vdso/vvar pages? I think, it should work now.
> For 32-bit, the vdso *must* exist in memory at the address that the
> kernel thinks it's at. Even if you had a pure 32-bit restore stub,
> you would still need vdso remap, because there's a chance the vdso
> could land at an unusable address, say one page off from where you
> want it. You couldn't map a wrapper because there wouldn't be any
> space for it without moving the real vdso out of the way.
>
> Remember, you *cannot* mremap() the 32-bit vdso because you will
> crash. It works by luck for 64-bit, but it's plausible that we'd want
> to change that some day. (I have awful patches that speed a bunch of
> things up at the cost of a vdso trampoline for 64-bit code and a bunch
> of other hacks. Those patches will never go in for real, but
> something else might want the ability to use 64-bit vdso trampolines.)
Thanks for the elaboration, now I see. Signals and fast syscalls
expect mm->context.vdso to be correct.
>
>> I did remapping for vdso as blob for native x86_64 task differs
>> to compatible task. So it's just changing blobs, address value
>> is there for convenience - I may omit it and just remap
>> different vdso blob at the same place where was previous vdso.
>> I'm not sure, why do we need possibility to map 64-bit vdso blob
>> on native 32-bit builds?
> That would fail, but I think the API should exist. But a native
> 32-bit program should be able to remap the 32-bit vdso.
>
> IOW, I think you should be able to do, roughly:
>
> map_new_vdso(VDSO_32BIT, addr);
>
> on any kernel.
>
> Am I making sense?
Yes. I will rework it for some API.
>
>>> For "personality", someone needs to enumerate all of the various thigs
>>> that try to track bitness and see how many of them even make sense.
>>> On brief inspection:
>>>
>>> - TIF_IA32: affects signal format and does something to ptrace. I
>>> suspect that whatever it does to ptrace is nonsensical, and I don't
>>> know whether we're stuck with it.
>>>
>>> - TIF_ADDR32 affects TASK_SIZE and mmap behavior (and the latter
>>> isn't even done in a sensible way).
>>>
>>> - is_64bit_mm affects MPX and uprobes.
>>>
>>> On even more brief inspection:
>>>
>>> - uprobes using is_64bit_mm is buggy.
>>>
>>> - I doubt that having TASK_SIZE vary serves any purpose. Does anyone
>>> know why TASK_SIZE is different for different tasks? It would save
>>> code size and speed things up if TASK_SIZE were always TASK_SIZE_MAX.
>>> - Using TIF_IA32 for signal processing is IMO suboptimal. Instead,
>>> we should record which syscall installed the signal handler and use
>>> the corresponding frame format.
>> Oh, I like it, will do.
>>
>>> - Using TIF_IA32 of the *target* for ptrace is nonsense. Having
>>> strace figure out syscall type using that is actively buggy, and I ran
>>> into that bug a few days ago and cursed at it. strace should inspect
>>> TS_COMPAT (I don't know how, but that's what should happen). We may
>>> be stuck with this for ABI reasons.
>> ptrace may check seg_32bit for code selector, what do you think?
> Not sure. I have never fully wrapped my had around ptrace.
Hm, I guess, it's better to check TS_COMPAT, after some thinking:
It's set up on compatible syscall enter, so there is no need to
check seg_32bit anyway.
Huge thanks, will work on v2 according to your comments.
Powered by blists - more mailing lists