[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220329215421.GA2997@kbox>
Date: Tue, 29 Mar 2022 14:54:21 -0700
From: Beau Belgrave <beaub@...ux.microsoft.com>
To: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
Cc: Beau Belgrave <beaub@...rosoft.com>,
linux-arch <linux-arch@...r.kernel.org>,
linux-kernel <linux-kernel@...r.kernel.org>,
linux-trace-devel <linux-trace-devel@...r.kernel.org>,
rostedt <rostedt@...dmis.org>,
Masami Hiramatsu <mhiramat@...nel.org>,
Alexei Starovoitov <ast@...nel.org>
Subject: Re: Comments on new user events ABI
On Tue, Mar 29, 2022 at 12:17:14PM -0400, Mathieu Desnoyers wrote:
> ----- On Mar 28, 2022, at 8:29 PM, Beau Belgrave beaub@...ux.microsoft.com wrote:
>
> >> ----- On Mar 28, 2022, at 4:24 PM, Mathieu Desnoyers
> >> mathieu.desnoyers@...icios.com wrote:
[..]
> >> I also don't get what user-space is expected to do differently if FTRACE vs
> >> PERF is enabled, considering that it gates a writev() to a file descriptor
> >> associated with /sys/kernel/debug/tracing/user_events_data.
> >>
> >
> > The intention wasn't for user-space to check the byte other than non-zero.
> > User processes could do that, but it's more so administration tools can see
> > where the events are registered if they cannot be closed and the state of the
> > machine.
>
> Administration tools could simply use a seqfile interface and parse text. I
> don't expect those to be performance-sensitive at all. So we should not design
> the overhead-sensitive mmap'd ABI (enable flags) for administration tools.
>
> > Maybe you have a slicker way to do this, but it seems to check a bit in the
> > page would involve at least a byte read followed by a mask or shift?
>
> I suspect the minimum cost in terms of instruction would be to have the mask
> prepared in advance (when the event is registered), and therefore the instructions
> required on the user-space fast-path would be a load followed by a mask (&).
>
> > That seems more expensive than checking a byte?
>
> It is one single extra ALU instruction on the fast-path, indeed. But we have to
> be careful about optimizing too much for a non-representative microbenchmark
> and missing the big picture, which includes L1/L2 cache lines and TLB load overhead
> when tracing a real-life workloads.
>
> On modern x86 CPUs, an ALU instruction takes around 0.3ns, which is a few
> orders of magnitude faster than a cache miss. Using bit-wise rather than
> byte-wise indexing increases functional density of those enable flags by a
> factor 8, which lessens the amount of cpu cache required by that same factor.
>
> So as soon as the workload running on the system fills the CPU L1 or L2 caches,
> when the number of user events registered scales up, and when the checks for
> events enabled are done frequently enough, evicting fewer cache lines becomes
> more important than the extra instruction required to apply the mask on the
> fast-path.
>
Ok, makes sense. This will give us more events as well, something I hear
some complaints about.
> >
> >> I would have rather thought that tracers implemented in user-space could
> >> register
> >> themselves, and then there could be one
> >> /sys/kernel/debug/tracing/user_events_status
> >> per tracer. Considering that all kernel tracers use the same ABI to write an
> >> event,
> >> and then dispatch this event internally within the kernel to each registered
> >> tracer, I would expect to have a single memory mapping for all those (e.g. a
> >> /sys/kernel/debug/tracing/user_events_status/kernel_tracers file).
> >>
> >> Then eventually if we have other user-space tracers such as lttng-ust with its
> >> their own user-space code performing tracing in a shared memory ring buffer, it
> >> would make sense to allow it to register its own
> >> /sys/kernel/debug/tracing/user_events_status/lttng_ust file, with its own
> >> indexes.
> >>
> >
> > I don't follow that. The intention is to get user processes to participate with
> > trace_events and the built-in tooling. When would a user-space tracer be used
> > instead of perf/ftrace?
> >
> > It seems like a feature request?
>
> It can very well be out of scope for the user events, and I'm fine with that.
> I was merely considering how the user events could be leveraged by tracers
> implemented purely in user-space. But if the stated goal of this feature is
> really to call into kernel tracers through a writev(), then I suspect that
> supporting purely user-space tracers is indeed out of scope.
>
That was the goal with this ABI, are there maybe ways we can change the
ABI to accomodate this later without shutting that out?
> >
> >> If this facility is ever used by lttng-ust to enable user-space tracing, I would
> >> not
> >> want to take the overhead of calling writev for the sake of kernel tracers if
> >> those are disabled.
> >>
> >
> > If they were disabled the byte wouldn't be set, right? So no writev overhead.
> >
> > Seems I'm missing something.
>
> I was wondering whether we could leverage the user events bit-enable ABI for
> tracers which are not implemented within the Linux kernel (e.g. LTTng-UST).
> But I suspect it might be better for me to re-implement this in user-space
> over shared memory.
>
Perhaps.
> >
> >> So perhaps in the short-term there is no need to implement the user-space tracer
> >> registration ABI, but I would have expected a simple bitmap for
> >> /sys/kernel/debug/tracing/user_events_data/kernel_tracers rather than the
> >> bytewise index, because as far as the kernel tracers are concerned, providing
> >> the bit to tell userspace instrumentation exactly which tracers are internally
> >> enabled within the kernel does not appear to be of any use other than increasing
> >> the footprint on the actively used cpu cache lines.
> >>
> >>
> >> * user_events_data page faults
> >>
> >> If my understanding is correct, when the user-space program's memory containing
> >> the payload passed to writev() to a user_events_data file descriptor is kicked
> >> out from the page cache between fault_in_iov_iter_readable and its use by the
> >> tracers due to high memory pressure, the writev() will fail with -EFAULT and
> >> the data will be discarded unless user-space somehow handles this error (which
> >> is not handled in the samples/user_events/sample.c example program). It is good
> >> that the memory is faulted in immediately before calling the tracers, but
> >> considering that it is not mlock'd, should we make more effort to ensure the
> >> tracers are able to handle page faults ?
> >>
> >> Integration of the work done by Michael Jeanson and myself on faultable
> >> tracepoint
> >> would allow the tracepoint probes to take page faults. Then, further
> >> modifications
> >> in the kernel tracers would be needed to handle those page faults.
> >>
> >
> > Is this something that can be done later or does it require ABI changes?
> >
> > I would love to never miss data due to page faults.
>
> This is internal to the user events, tracepoint, and kernel tracers
> implementation. I don't expect this to modify the user events ABI.
>
> The only thing that might require some thinking ABI-wise is how the ring
> buffers are exposed to user-space consumers, because we would want to
> allow taking page faults between space reservation and commit.
>
> The LTTng ring buffer has been supporting this out of the box for years
> now, but this may be trickier for other kernel tracers, for which allowing
> preemption between reserve and commit has never been a requirement until
> now.
>
I'll let Steven comment on that one.
> >
> >>
> >> * user_reg name_args and write_index vs purely user-space tracers
> >>
> >> That part of the user event registration (event layout and ID allocation)
> >> appears
> >> to be intrinsically tied to the kernel tracers and the expected event layout.
> >> This
> >> seems fine as long as the only users we consider are the kernel tracers, but it
> >> appears to be less relevant for purely user-space tracers. Actually, tying the
> >> mmap'd event enable mechanism with the event ID and description makes me wonder
> >> whether it might be better to have LTTng-UST implement its own shared-memory
> >> based
> >> "fast-event-enabling" mechanism rather than use this user-event ABI. The other
> >> advantage of doing all of this in user-space would be to allow many instances
> >> of this bitmap to exist on a given system, e.g. one per container in a
> >> multi-container
> >> system, rather than requiring this to be a global kernel-wide singleton, and to
> >> use
> >> it from a non-privileged user.
> >>
> >
> > We have some conversation going about using namespaces/cgroups to isolation
> > containers with bitmaps/status pages. The main thing I personally want to be
> > able to do is from the root namespace see all the events in the descendents
> > easily via perf, eBPF or ftrace.
> >
> > Link:
> > https://lore.kernel.org/linux-trace-devel/20220316232009.7952988633787ef1003f13b0@kernel.org/
> >
>
> I see that a notion close to "tracing namespaces" comes up in this thread. This is something
> I brought forward at the last Linux Plumbers Conference aiming to facilitate system call tracing
> for a given container (or from a process hierarchy). I suspect the tracing namespace could also
> be tied to a set of kernel buffers, and to a user events "domain". I think this concept could
> neatly solve many of our container-related isolation issues here. As you say, it should probably
> be a hierarchical namespace.
>
Agreed.
> >>
> >> Some comments about the implementation:
> [...]
> >
> >>
> >> return user_events_write_core(file, &i);
> >> }
> >>
> >> include/uapi/linux/user_events.h:
> >>
> >> struct user_reg {
> >>
> >> /* Input: Size of the user_reg structure being used */
> >> __u32 size;
> >>
> >> /* Input: Pointer to string with event name, description and flags */
> >> __u64 name_args;
> >>
> >> /* Output: Byte index of the event within the status page */
> >> __u32 status_index;
> >>
> >> /* Output: Index of the event to use when writing data */
> >> __u32 write_index;
> >> };
> >>
> >> As this structure is expected to grow, and the user-space sample program uses
> >> "sizeof()"
> >> to figure out its size (which includes padding), I would be more comfortable if
> >> this was
> >> a packed structure rather than non-packed, because as fields are added, it's
> >> tricky to
> >> figure out from the kernel perspective whether the size received are fields that
> >> user-space
> >> is aware of, or if this is just padding.
> >>
> >
> > I think that would be a good idea, Steven?
>
> [leaving this for Steven to answer]
>
[..]
> >> kernel/trace/trace_events_user.c:
> >>
> >> static long user_reg_get(struct user_reg __user *ureg, struct user_reg *kreg)
> >> {
> >> u32 size;
> >> long ret;
> >>
> >> ret = get_user(size, &ureg->size);
> >>
> >> if (ret)
> >> return ret;
> >>
> >>
> >> if (size > PAGE_SIZE)
> >> return -E2BIG;
> >>
> >> ^ here I would be tempted to validate that the structure size at least provides
> >> room
> >> for the "v0" ABI, e.g.:
> >>
> >> if (size < offsetofend(struct user_reg, write_index))
> >> return -EINVAL;
> >>
> >> return copy_struct_from_user(kreg, sizeof(*kreg), ureg, size);
> >>
> >> ^ I find it odd that the kernel copy of struct user_reg may contain a
> >> size field which contents differs from the size fetched by get_user().
> >> This can happen if a buggy or plainly hostile user-space attempts to
> >> confuse the kernel about the size of this structure. Fortunately, the
> >> size field does not seem to be used afterwards, but I would think it
> >> safer to copy back the "size" fetched by get_user into the reg->size
> >> after copy_struct_from_user in case future changes in the code end up
> >> relying on a consistent size field.
> >> }
> >>
> >
> > Any size that the user doesn't fill in will zero out, if they enter a large
> > size then we error out. I tried to get the data out from the user once and use
> > it consistently from the kernel side. I didn't want users changing their values
> > at key places trying to get the kernel to do bad things.
> >
> > This is why the size isn't used beyond the first copy (and why the data in the
> > write paths is only validated after paging in and copy the data to a buffer
> > that the user process no longer can modify mid-write).
>
> As long as the kreg->size value is not used afterwards, it's technically fine, but
> having it there and inconsistent with the size value loaded by get_user() is error-prone.
> I would recommend to either overwrite this kreg->size value with the size loaded by
> get_user(), or document that this field should never be used because it is inconsistent
> with the size used for copy_struct_from_user().
>
Alright, I'll set it afterwards.
> [...]
>
> >> kernel/trace/trace_events_user.c:
> >>
> >> static int user_events_release(struct inode *node, struct file *file)
> >> {
> >> [...]
> >> /*
> >> * Ensure refs cannot change under any situation by taking the
> >> * register mutex during the final freeing of the references.
> >> */
> >> mutex_lock(®_mutex);
> >> [...]
> >> mutex_unlock(®_mutex);
> >>
> >> kfree(refs);
> >>
> >> ^ AFAIU, the user_events_write() does not rely on reg_mutex to ensure mutual
> >> exclusion.
> >> Doing so would be prohibitive performance-wise. But I suspect that freeing
> >> "refs" here
> >> without waiting for a RCU grace period can be an issue if user_events_write_core
> >> is using
> >> refs concurrently with file descriptor close.
> >>
> >
> > The refs is per-file, so if user_events_write() is running release cannot be
> > called for that file instance, since it's being used. Did I miss something?
>
> I'm possibly the one missing something here, but what prevents 2 threads
> from doing user_event_write() and close() concurrently on the same file descriptor ?
>
While nothing prevents it, my understanding is that the actual release()
on the file_operations won't be invoked until the file struct has been
ref'd down to zero. (fs/file_table.c:fput, fs/read_write.c:do_pwritev).
While the write call is being run, the file should have a non-zero ref
count. I believe looking at do_pwritev the fdget/fdput pair are largely
responsible for this, do you agree?
> >
> >> kernel/trace/trace_events_user.c:
> >>
> >> static bool user_field_match(struct ftrace_event_field *field, int argc,
> >> const char **argv, int *iout)
> >> [...]
> >>
> >> for (; i < argc; ++i) {
> >> [...]
> >> pos += snprintf(arg_name + pos, len - pos, argv[i]);
> >>
> >> ^ what happens if strlen(argv[i]) > (len - pos) ? Based on lib/vsprintf.c:
> >>
> >> * The return value is the number of characters which would be
> >> * generated for the given input, excluding the trailing null,
> >> * as per ISO C99. If the return is greater than or equal to
> >> * @size, the resulting string is truncated.
> >>
> >> So the "pos" returned by the first call to sprintf would be greater than
> >> MAX_FIELD_ARG_NAME.
> >> Then the second call to snprintf passes a @size argument of "len - pos" using
> >> the pos value
> >> which is larger than len... which is a negative integer passed as argument to a
> >> size_t (unsigned).
> >> So it expects a very long string. And the @buf argument is out-of-bound
> >> (field_name + pos).
> >> Is this pattern for using snprintf() used elsewhere ? From a quick grep, I find
> >> this pattern in
> >> a few places where AFAIU the input is not user-controlled (as it seems to be the
> >> case here), but
> >> still it might be worth looking into:
> >>
> >> kernel/cgroup/cgroup.c:show_delegatable_files()
> >> kernel/time/clocksource.c:available_clocksource_show()
> >>
> >> Also, passing a copy of a userspace string (argv[i]) as format string argument
> >> to
> >> snprintf can be misused to leak kernel data to user-space.
> >>
> >> The same function also appear to have similar issues with its use of the
> >> field->name userspace input
> >> string.
> >>
> >
> > The strings are copied first before coming down these paths, so they cannot
> > be forced to changed mid-reads, etc.
>
> That's correct. But their combined length can be larger than MAX_FIELD_ARG_NAME.
>
> > I think simply checking if any arg is beyond the MAX_FIELD_ARG_NAME would
> > prevent these types of issues, right?
>
> No. The issue is when the combined length is larger than the max. If I look at
> other tracing code, I notice that it often uses a pattern where snprintf is first
> called on the input with a @size=0 to calculate the string length, then the
> output string is dynamically allocated with the correct length, and the entire
> code performing the snprintf formatting is called again with a nonzero size,
> thus effectively populating the output string. There are other ways to prevent
> these issues, but I would be tempted to use the same pattern used elsewhere in
> the tracing code. See trace_events_user.c:user_event_set_print_fmt() as an example.
>
> And don't forget that passing a copy of a user input as snprintf format string is
> bad, because it allows user-space to provide event/fields names with e.g. "%s" or
> "%p" and control the behavior of snprintf in unwanted ways, typically leading to
> a kernel memory information leak to user-space.
>
Yes, agreed. I will address this.
> Thanks,
>
> Mathieu
>
> >
> >> Unfortunately this is all the time I have for review right now, but it is at
> >> least a good starting
> >> point for discussion.
> >>
> >> Thanks,
> >>
> >> Mathieu
> >>
> >> --
> >> Mathieu Desnoyers
> >> EfficiOS Inc.
> >> http://www.efficios.com
> >
> > Thanks,
> > -Beau
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
Thanks,
-Beau
Powered by blists - more mailing lists