linux-kernel - Re: [PATCH] x86/vdso: Add prctl to set per-process VDSO load

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:	Fri, 19 Sep 2014 22:30:25 -0700
From:	Andy Lutomirski <luto@...capital.net>
To:	Filipe Brandenburger <filbranden@...gle.com>
Cc:	Greg Thelen <gthelen@...gle.com>,
	Thomas Gleixner <tglx@...utronix.de>, X86 ML <x86@...nel.org>,
	Michael Davidson <md@...gle.com>,
	Ingo Molnar <mingo@...hat.com>,
	Richard Larocque <rlarocque@...gle.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"H. Peter Anvin" <hpa@...or.com>,
	Linux API <linux-api@...r.kernel.org>
Subject: Re: [PATCH] x86/vdso: Add prctl to set per-process VDSO load

On Sep 19, 2014 8:46 PM, "Filipe Brandenburger" <filbranden@...gle.com> wrote:
>
> On Fri, Sep 19, 2014 at 8:27 PM, Andy Lutomirski <luto@...capital.net> wrote:
> >> I thought of doing that from the prctl but AFAICT remap_pfn_range
> >> requires that it's unmapped before the call (remap_pte_range has
> >> BUG_ON(!pte_none(*pte));) and doing a zap_page_range followed by
> >> remap_pfn_range might incur a race condition if another thread of the
> >> same process is accessing the vvar page at that time... Am I wrong
> >> about that race?
> >
> > No, you're right.  Ugh.
> >
> > It might pay to add an explicit .fault callback to vm_special_mapping,
> > but your approach will work, too.  The main downside is more memory
> > overhead per mm.
>
> One way to limit that overhead was to keep only two static
> vm_special_mapping structs (one for the real vvar page and one for the
> zero page) and then change the vma->vm_private_data to point to one or
> the other when the prctl is called. I just thought of that recently
> and haven't tried it yet so not sure if it will cause other
> problems...

Sneaky :)

This might have interesting effects if someone races with you.
mmap_sem could be enough.

On the other hand, holding mmap_sem for write could be enough to let
you zap the pte and remap it.

>
> >>> Changing out the text is a whole can of worms involving self-modifying
> >>> code, although it may be completely safe if done through the page
> >>> tables.  But I don't think you can't use remap_pfn_range for that.
> >>
> >> No, I'm not planning to change the text, just replacing the vvar page
> >> swapping the one where the vsyscall_gtod_data lives with a zero page
> >> (and back depending on the parameter of the prctl.)
> >
> > Hmm.  This will break the _COARSE timing functions as well as
> > __vdso_time.  That'll need fixing, either by swapping out the code
> > (yuck!) or by adding a branch to all of those code paths.
> >
> > Maybe there's a non-branchy way, though.  Let me think.
>
> So this is what I was thinking of extending the vvar region to include
> one extra page (right now it's two pages, the one with the
> vsyscall_gtod_data and the one with the hrtimer I/O mapping, so this
> would be a third page.) This third page would be initially mapped to
> the zero page, so it wouldn't cost an extra page of memory per process
> or even globally.
>
> The prctl would swap that page with another page allocated as the
> prctl is called. That one page that's allocated with the prctl would
> be inherited over forks and execs, so if you have a bunch of processes
> that need the same offset then they'd all share that one single page.
>
> The vdso clock routines would get the base clock information from
> vsyscall_gtod_data, hpet, tsc, etc. and would simply add the offsets
> from the task_vvar page (not too branch-y, just a few extra adds.) In
> the regular case, that would be a zero page so it would be simply
> adding zero. If we need to apply offsets, then we replace that page
> with one that has a struct with the right offsets for each clock type.
>
> When running in kernel mode, the clock code (and timer code, etc.)
> would find that page through the mm_struct and find the clock offsets
> that way.

This seems like a lot of complexity and non-negligible overhead (extra
adds, cacheline misses, and even TLB misses) in the common case just
to have fast timing under very unusual circumstances.

I'd much rather do something to the vvar page that causes all the
timing calls to turn into normal syscalls and then handle the offset
in the timing syscalls.  Are you really doing something that's
performance-sensitive enough to justify all of the complexity of
making this work without syscalls?

To do that, the only real tricky part is handling the coarse timing
calls.  The could be done with extra branches, text patches, or
trickery involving setting up an alternate vvar page that always fails
the seqcount retry check.

>
> I was working on this approach before you suggested swapping the
> actual vvar page (and not sure if you were actually looking at
> hot-swapping the vdso itself.) I can go back to working on it and send
> you a few commits since I still have some of that code around.
>
> Cheers,
> Filipe
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/