[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120330115715.GB13022@somewhere.redhat.com>
Date: Fri, 30 Mar 2012 13:57:17 +0200
From: Frederic Weisbecker <fweisbec@...il.com>
To: "H. Peter Anvin" <hpa@...or.com>
Cc: Vaibhav Nagarnaik <vnagarnaik@...gle.com>,
Ingo Molnar <mingo@...nel.org>,
Steven Rostedt <rostedt@...dmis.org>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>,
David Sharp <dhsharp@...gle.com>,
Justin Teravest <teravest@...gle.com>,
Laurent Chavey <chavey@...gle.com>,
Michael Davidson <md@...gle.com>, x86@...nel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH 4/6] trace: trace syscall in its handler not from ptrace
handler
On Thu, Mar 29, 2012 at 01:06:10PM -0700, H. Peter Anvin wrote:
> On 03/29/2012 12:43 PM, Vaibhav Nagarnaik wrote:
> >
> > However, we agree that the syscall tracing as implemented currently is
> > a bit unwieldy. We would want to be a part of the re-designing effort
> > if there is a momentum in the community towards that goal. We would be
> > happy to contribute towards this effort.
> >
>
> I had a long discussion with Frederic over IRC earlier today. We came
> up with the following strawman:
>
> 1. A system call thunk (which could be enabled/disabled by patching the
> syscall table.) This provides an entry and exit hook, and also sets a
> per-thread flag to capture userspace traffic.
>
> 2. Instrumenting get_user/put_user/copy_from_user/copy_to_user to
> capture traffic to userspace. This captures the *full* set of system
> call arguments, including things addressed via pointers. Furthermore,
> it captures the exact versions fed to or returned from the kernel, and
> deals with data-dependent collection like ioctl().
>
> This has to be done with extreme care to avoid introducing overhead in
> the no-tracing case, however, as these functions are extraordinarily
> performance sensitive. This probably will require careful patching in
> the first enable/last disable case.
>
> 3. There will need to be userspace tools written to decode the resulting
> trace buffer. This is pretty much needed anyway, but once you throw in
> complex data structures it becomes even more so. A trace will basically
> consist of:
>
> SYSCALL_ENTRY <syscall number> <arg1..6>
> COPY_FROM_USER <address> <data>
> ...
> COPY_TO_USER <address> <data>
> ...
> SYSCALL_EXIT <return value>
>
> Outputting this in human-readable format requires some reasonably
> sophisticated logic, but the *HUGE* advantage is that not only is all
> the information there, it is *correct by construction*.
>
> -hpa
Note we have the relevant tracepoints in place with the "raw_syscalls"
events subsystem. They are generic with only two tracepoints sys_enter
and sys_exit and they blindly dump the syscall number/arg/return value:
$ cat /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/format
name: sys_enter
ID: 53
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:int common_padding; offset:8; size:4; signed:1;
field:long id; offset:16; size:8; signed:1;
field:unsigned long args[6]; offset:24; size:48; signed:0;
print fmt: "NR %ld (%lx, %lx, %lx, %lx, %lx, %lx)", REC->id, REC->args[0], REC->args[1], REC->args[2], REC->args[3],
REC->args[4], REC->args[5]
$ cat /sys/kernel/debug/tracing/events/raw_syscalls/sys_exit/format
name: sys_exit
ID: 52
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:int common_padding; offset:8; size:4; signed:1;
field:long id; offset:16; size:8; signed:1;
field:long ret; offset:24; size:8; signed:1;
print fmt: "NR %ld = %ld", REC->id, REC->ret
Now we have yet to do the syscall table patching and the copy_*_user() tracepoints.
But other than these details the bulk of the remaining work is in userspace.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists