linux-kernel - Re: vdso && cr (Was: arch_check_bp_in

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20121123002021.9A78159206F@miso.sublimeip.com>
Date:	Fri, 23 Nov 2012 11:20:21 +1100 (EST)
From:	u3557@...o.sublimeip.com (Amnon Shiloh)
To:	xemul@...allels.com (Pavel Emelyanov)
Cc:	oleg@...hat.com (Oleg Nesterov),
	gorcunov@...nvz.org (Cyrill Gorcunov),
	rostedt@...dmis.org (Steven Rostedt),
	fweisbec@...il.com (Frederic Weisbecker),
	mingo@...hat.com (Ingo Molnar),
	a.p.zijlstra@...llo.nl (Peter Zijlstra),
	linux-kernel@...r.kernel.org
Subject: Re: vdso && cr (Was: arch_check_bp_in_kernelspace: fix the range

Hi Pavel,

> >>
> >> Now however, that "vsyscall" was effectively replaced by vdso, it
> >> creates a new problem for me and probably for anyone else who uses
> >> some form of checkpoint/restore:
> > 
> > Oh, sorry, I can't help here. I can only add Cyrill and Pavel, they
> > seem to enjoy trying to solve the c/r problems.
> 
> Thank you :)

Thank YOU for joining!

> >> Suppose a process is checkpointed because the system needs to reboot
> >> for a kernel-upgrade, then restored on the new and different kernel.
> >> The new VDSO page may no longer match the new kernel - it could for
> >> example fetch data from addresses in the vsyscall page that now
> >> contain different things; or in case the hardware also was changed,
> >> it may use machine-instructions that are now illegal.
> 
> If we could make VDSO entry points not move across the kernels (iow, make
> them looks as yet another syscall table) this would help, I suppose.

It will indeed solve PART of the problem, but there is one more issue:

One obviously cannot c/r a process while it runs in the VDSO page
without c/r'ing that page itself, but this can probably be handled
by single-stepping the process until it is out of that page (assuming
there are no sleeps, pauses or extremely long loops on that page) -
but suppose a catched signal interrupts the VDSO code and the process
needs to be checkpointed within that interrupt code - eventually it
will return ("sigreturn") to the VDSO page... a different page...
and probably fall on the wrong machine-instruction (or even between
machine-instructions), with all registers scrambled anyway.

The solution can be to hold all catched signals while in the VDSO page.
This is not something the application (or library) can reasonably do
due to the prohibitive cost of "sigprocmask()" before and after, defying
the whole purpose of the VDSO page, but could be achieved by some new
'prctl' option (or perhaps even be the default).

In my specific case, because the checkpointed process is ptraced,
and assuming VDSO entry points are fixed, the ptracer can postpone
all catched signals that occur within the VDSO page, but for others
who write/maintain a c/r package, that's probably not an option.

> 
> > Sure. You shouldn't try to save/restore this page(s) directly. But
> > I do not really understand why do you need. IOW, I don't really
> > understand the problem, it depends on what c/r actually does.
> 
> Think about it like this -- you stop a process, then change the kern^w VDSO
> page. Everything should work as it used to be :)

There are two reasons one may need to save/restore this page:
1) Entry points are not fixed (yet).
2) In case the process needs to return to it back from an interrupt.

> 
> >> As I don't mind to forego the "fast" sys_time(), my obvious solution
> >> is to disable the vdso for traced processes that may be checkpointed.
> 
> This is very poor solution from my POV. Nobody wants to have their applications
> work fast only until it's checkpointed.

I know, but it's a price I must and am willing to pay until a solution
is found that prevents catching signals within the VDSO page.

I made a small experiment and just zeroed out the whole VDSO page
straight after "execve" (brute force, easier than having to study
the internal format of the VDSO page).  The program worked, using
the glibc version of "gettimeofday()" instead (which used "vsyscall",
but probably for not much longer).

So consider my immediate personal problem solved - what I'll do next
is to compile a special temporary kernel with all vdso functions
(__vdso_gettimeofday, __vdso_time, __vdso_clock_gettime, __vdso_getcpu)
reduced to system-calls, so they become kernel/hardware-independent,
then I'll save and set aside the resulting VDSO page and always replace
original VDSO pages with "my-vdso" after "execve".

However, this doesn't solve the problem for other c/r packages that do
not ptrace their processes all the time, and therefore unable to replace
the VDSO page immediately after each "execve".
For them you will need to either:

1) fix the VDSO entry points + introduce a kernel feature to prevent
   catching signals within the VDSO page (probably a new prctl,
   or make it the default) ; or
2) Introduce a kernel feature (probably a new prctl, so long as
   it is not reset across fork/clone/exec) for those programs who
   request it to load a "slow-but-sure", kernel/hardware-independent
   version of the VDSO page.

> 
> Thanks,
> Pavel
> 


Thank you and Best Regards,
Amnon.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/