lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Sat, 17 Dec 2016 01:46:06 +1100
From:   Balbir Singh <bsingharora@...il.com>
To:     Anju T Sudhakar <anju@...ux.vnet.ibm.com>,
        linux-kernel@...r.kernel.org, linuxppc-dev@...ts.ozlabs.org
Cc:     ananth@...ibm.com, mahesh@...ux.vnet.ibm.com, paulus@...ba.org,
        mhiramat@...nel.org, naveen.n.rao@...ux.vnet.ibm.com,
        srikar@...ux.vnet.ibm.com
Subject: Re: [PATCH V2 0/4] OPTPROBES for powerpc



On 15/12/16 03:18, Anju T Sudhakar wrote:
> This is the V2 patchset of the kprobes jump optimization
> (a.k.a OPTPROBES)for powerpc. Kprobe being an inevitable tool
> for kernel developers, enhancing the performance of kprobe has
> got much importance.
> 
> Currently kprobes inserts a trap instruction to probe a running kernel.
> Jump optimization allows kprobes to replace the trap with a branch,
> reducing the probe overhead drastically.
> 
> In this series, conditional branch instructions are not considered for
> optimization as they have to be assessed carefully in SMP systems.
> 
> The kprobe placed on the kretprobe_trampoline during boot time, is also
> optimized in this series. Patch 4/4 furnishes this.
> 
> The first two patches can go independently of the series. The helper 
> functions in these patches are invoked in patch 3/4.
> 
> Performance:
> ============
> An optimized kprobe in powerpc is 1.05 to 4.7 times faster than a kprobe.
>  
> Example:
>  
> Placed a probe at an offset 0x50 in _do_fork().
> *Time Diff here is, difference in time before hitting the probe and
> after the probed instruction. mftb() is employed in kernel/fork.c for
> this purpose.
>  
> # echo 0 > /proc/sys/debug/kprobes-optimization
> Kprobes globally unoptimized
>  [  233.607120] Time Diff = 0x1f0
>  [  233.608273] Time Diff = 0x1ee
>  [  233.609228] Time Diff = 0x203
>  [  233.610400] Time Diff = 0x1ec
>  [  233.611335] Time Diff = 0x200
>  [  233.612552] Time Diff = 0x1f0
>  [  233.613386] Time Diff = 0x1ee
>  [  233.614547] Time Diff = 0x212
>  [  233.615570] Time Diff = 0x206
>  [  233.616819] Time Diff = 0x1f3
>  [  233.617773] Time Diff = 0x1ec
>  [  233.618944] Time Diff = 0x1fb
>  [  233.619879] Time Diff = 0x1f0
>  [  233.621066] Time Diff = 0x1f9
>  [  233.621999] Time Diff = 0x283
>  [  233.623281] Time Diff = 0x24d
>  [  233.624172] Time Diff = 0x1ea
>  [  233.625381] Time Diff = 0x1f0
>  [  233.626358] Time Diff = 0x200
>  [  233.627572] Time Diff = 0x1ed
>  
> # echo 1 > /proc/sys/debug/kprobes-optimization
> Kprobes globally optimized
>  [   70.797075] Time Diff = 0x103
>  [   70.799102] Time Diff = 0x181
>  [   70.801861] Time Diff = 0x15e
>  [   70.803466] Time Diff = 0xf0
>  [   70.804348] Time Diff = 0xd0
>  [   70.805653] Time Diff = 0xad
>  [   70.806477] Time Diff = 0xe0
>  [   70.807725] Time Diff = 0xbe
>  [   70.808541] Time Diff = 0xc3
>  [   70.810191] Time Diff = 0xc7
>  [   70.811007] Time Diff = 0xc0
>  [   70.812629] Time Diff = 0xc0
>  [   70.813640] Time Diff = 0xda
>  [   70.814915] Time Diff = 0xbb
>  [   70.815726] Time Diff = 0xc4
>  [   70.816955] Time Diff = 0xc0
>  [   70.817778] Time Diff = 0xcd
>  [   70.818999] Time Diff = 0xcd
>  [   70.820099] Time Diff = 0xcb
>  [   70.821333] Time Diff = 0xf0
> 
> Implementation:
> ===================
>  
> The trap instruction is replaced by a branch to a detour buffer. To address
> the limitation of branch instruction in power architecture, detour buffer
> slot is allocated from a reserved area . This will ensure that the branch
> is within ± 32 MB range. The current kprobes insn caches allocate memory 
> area for insn slots with module_alloc(). This will always be beyond 
> ± 32MB range.
>  

The paragraph is a little confusing. We need the detour buffer to be within
+-32 MB, but then you say we always get memory from module_alloc() beyond
32MB.

> The detour buffer contains a call to optimized_callback() which in turn
> call the pre_handler(). Once the pre-handler is run, the original
> instruction is emulated from the detour buffer itself. Also the detour
> buffer is equipped with a branch back to the normal work flow after the
> probed instruction is emulated.

Does the branch itself use registers that need to be saved? I presume
we are going to rely on the +-32MB, what are the guarantees of success
of such a mechanism?

Balbir Singh.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ