linux-kernel - Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090316234657.GC6150@nowhere>
Date:	Tue, 17 Mar 2009 00:46:58 +0100
From:	Frederic Weisbecker <fweisbec@...il.com>
To:	"Frank Ch. Eigler" <fche@...hat.com>
Cc:	Mathieu Desnoyers <mathieu.desnoyers@...ymtl.ca>,
	Masami Hiramatsu <mhiramat@...hat.com>,
	Roland McGrath <roland@...hat.com>,
	Ingo Molnar <mingo@...e.hu>,
	LKML <linux-kernel@...r.kernel.org>,
	Lai Jiangshan <laijs@...fujitsu.com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Jiaying Zhang <jiayingz@...gle.com>,
	Martin Bligh <mbligh@...gle.com>, utrace-devel@...hat.com
Subject: Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure

On Mon, Mar 16, 2009 at 06:18:00PM -0400, Frank Ch. Eigler wrote:
> Hi -
> 
> 
> On Mon, Mar 16, 2009 at 05:45:26PM -0400, Mathieu Desnoyers wrote:
> 
> > [...]
> > > As far as I know, utrace supports multiple trace-engines on a process.
> > > Since ptrace is just an engine of utrace, you can add another engine on utrace.
> > > 
> > > utrace-+-ptrace_engine---owner_process
> > >        |
> > >        +-systemtap_module
> > >        |
> > >        +-ftrace_plugin
> 
> Right.  In this way, utrace is simply a multiplexing intermediary.
> 
> 
> > > Here, Frank had posted an example of utrace->ftrace engine.
> > > http://lkml.org/lkml/2009/1/27/294
> > > 
> > > And here is the latest his patch(which seems to support syscall tracing...)
> > > http://git.kernel.org/?p=linux/kernel/git/frob/linux-2.6-utrace.git;a=blob;f=kernel/trace/trace_process.c;h=619815f6c2543d0d82824139773deb4ca460a280;hb=ab20efa8d8b5ded96e8f8c3663dda3b4cb532124
> > > 
> > 
> > Reminder : we are looking at system-wide tracing here. Here are some
> > comments about the current utrace implementation.
> > 
> > Looking at include/linux/utrace.h from the tree
> > 
> > 17  * A tracing engine starts by calling utrace_attach_task() or
> > 18  * utrace_attach_pid() on the chosen thread, passing in a set of hooks
> > 19  * (&struct utrace_engine_ops), and some associated data.  This produces a
> > 20  * &struct utrace_engine, which is the handle used for all other
> > 21  * operations.  An attached engine has its ops vector, its data, and an
> > 22  * event mask controlled by utrace_set_events().
> > 
> > So if the system has, say 3000 threads, then we have 3000 struct
> > utrace_engine created ? I wonder what effet this could have one
> > cachelines if this is used to trace hot paths like system call
> > entry/exit. Have you benchmarked this kind of scenario under tbench ?
> 
> It has not been a problem, since utrace_engines are designed to be
> lightweight.  Starting or stopping a systemtap script of the form
> 
>     probe process.syscall {}
> 
> appears to have no noticable impact on a tbench suite.
> 
> 
> > 24  * For each event bit that is set, that engine will get the
> > 25  * appropriate ops->report_*() callback when the event occurs.  The
> > 26  * &struct utrace_engine_ops need not provide callbacks for an event
> > 27  * unless the engine sets one of the associated event bits.
> > 
> > Looking at utrace_set_events(), we seem to be limited to 32 events on a
> > 32-bits architectures because it uses a bitmask ? Isn't it a bit small?
> 
> There are only a few types of thread events that involve different
> classes of treatment, or different degrees of freedom in terms of
> interference with the uninstrumented fast path of the threads.
> 
> For example, it does not make sense to have different flag bits for
> different system calls, since choosing to trace *any* system call
> involves taking the thread off of the fast path with the TIF_ flag.
> Once it's off the fast path, it doesn't matter whether the utrace core
> or some client performs syscall discrimination, so it is left to the
> client.
> 
> 
> > 682 /**
> > 683  * utrace_set_events_pid - choose which event reports a tracing engine gets
> > 684  * @pid:                thread to affect
> > 685  * @engine:             attached engine to affect
> > 686  * @eventmask:          new event mask
> > 687  *
> > 688  * This is the same as utrace_set_events(), but takes a &struct pid
> > 689  * pointer rather than a &struct task_struct pointer.  The caller must
> > 690  * hold a ref on @pid, but does not need to worry about the task
> > 691  * staying valid.  If it's been reaped so that @pid points nowhere,
> > 692  * then this call returns -%ESRCH.
> > 
> > 
> > Comments like "but does not need to worry about the task staying valid"
> > does not make me feel safe and comfortable at all, could you explain
> > how you can assume that derefencing an "invalid" pointer will return
> > NULL ?
> 
> (We're doing a final round of "internal" (pre-LKML) reviews of the
> utrace implementation right now on utrace-devel@...hat.com, where such
> comments get fastest attention from the experts.)
> 
> For this particular issue, the utrace documentation file explains the
> liveness rules for the various pointers that can be fed to or received
> from utrace functions.  This is not about "feeling" safe, it's about
> what the mechanism is deliberately designed to permit.
> 
> 
> > About the utrace_attach_task() :
> > 
> > 244         if (unlikely(target->flags & PF_KTHREAD))
> > 245                 /*
> > 246                  * Silly kernel, utrace is for users!
> > 247                  */
> > 248                 return ERR_PTR(-EPERM);
> > 
> > So we cannot trace kernel threads ?
> 
> I'm not quite sure about all the reasons for this, but I believe that
> kernel threads don't tend to engage in job control / signal /
> system-call activities the same way as normal user threads do.
> 


Some of them use some syscalls, but it doesn't involve a user/kernel switch.
So it's not tracable by hooking syscall_entry/exit or using tracehooks.
It would require specific hooks on sys_* functions for that.

So this check is right (writing on each thread info seems somewhat costly so
it's better if it is avoided like here).

Frederic.

 
> > 118 /*
> > 119  * Called without locks, when we might be the first utrace engine to attach.
> > 120  * If this is a newborn thread and we are not the creator, we have to wait
> > 121  * for it.  The creator gets the first chance to attach.  The PF_STARTING
> > 122  * flag is cleared after its report_clone hook has had a chance to run.
> > 123  */
> > 124 static inline int utrace_attach_delay(struct task_struct *target)
> > 125 {
> > 126         if ((target->flags & PF_STARTING) && target->real_parent != current)
> > 127                 do {
> > 128                         schedule_timeout_interruptible(1);
> > 129                         if (signal_pending(current))
> > 130                                 return -ERESTARTNOINTR;
> > 131                 } while (target->flags & PF_STARTING);
> > 132
> > 133         return 0;
> > 134 }
> > 
> > Why do we absolutely have to poll until the thread has started ?
> 
> (I don't know off the top of my head - Roland?)
> 
> 
> > utrace_add_engine()
> >   set_notify_resume(target);
> > 
> > ok, so this is where the TIF_NOTIFY_RESUME thread flag is set. I notice
> > that it is set asynchronously with the execution of the target thread
> > (as I do with my TIF_KERNEL_TRACE thread flag).
> > 
> > However, on x86_64, _TIF_DO_NOTIFY_MASK is only tested in 
> > entry_64.S
> > 
> > int_signal:
> > and
> > retint_signal:
> > 
> > code paths. However, if there is no syscall tracing to do upon syscall
> > entry, the thread flags are not re-read at syscall exit and you will
> > miss the syscall exit returning from your target thread if this thread
> > was blocked while you set its TIF_NOTIFY_RESUME. Or is it handled in
> > some subtle way I did not figure out ? BTW re-reading the TIF flags from
> > the thread_info at syscall exit on the fast path is out of question
> > because it considerably degrades the kernel performances. entry_*.S is
> > a very, very critical path.
> 
> (I don't know off the top of my head - Roland?)
> 
> 
> - FChE

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/