lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANLsYkz9YYEe=BGwsWtmdw03dnO9W=XUm+WOUzzqXZ+SWTr8Vw@mail.gmail.com>
Date:	Fri, 29 Apr 2016 12:12:53 -0600
From:	Mathieu Poirier <mathieu.poirier@...aro.org>
To:	Alexander Shishkin <alexander.shishkin@...ux.intel.com>
Cc:	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Thomas Gleixner <tglx@...utronix.de>, x86@...nel.org,
	Borislav Petkov <bp@...en8.de>, Ingo Molnar <mingo@...hat.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	vince@...ter.net, Stephane Eranian <eranian@...gle.com>,
	Arnaldo Carvalho de Melo <acme@...radead.org>
Subject: Re: [PATCH v2 5/7] perf: Introduce address range filtering

On 27 April 2016 at 09:44, Alexander Shishkin
<alexander.shishkin@...ux.intel.com> wrote:
> Many instruction trace pmus out there support address range-based
> filtering, which would, for example, generate trace data only for a
> given range of instruction addresses, which is useful for tracing
> individual functions, modules or libraries. Other pmus may also
> utilize this functionality to allow filtering to or filtering out
> code at certain address ranges.
>
> This patch introduces the interface for userspace to specify these
> filters and for the pmu drivers to apply these filters to hardware
> configuration.
>
> The user interface is an ascii string that is passed via an ioctl
> and specifies (in the form of an ascii string) address ranges within
> certain object files or within kernel. There is no special treatment
> for kernel modules yet, but it might be a worthy pursuit.
>
> The pmu driver interface basically add two extra callbacks to the
> pmu driver structure, one of which validates the filter configuration
> proposed by the user against what the hardware is actually capable of
> doing and the other one translates hardware-independent filter
> configuration into something that can be programmed into the
> hardware.
>
> Signed-off-by: Alexander Shishkin <alexander.shishkin@...ux.intel.com>
> ---
>  include/linux/perf_event.h |  98 +++++++
>  kernel/events/core.c       | 623 +++++++++++++++++++++++++++++++++++++++++++--
>  2 files changed, 705 insertions(+), 16 deletions(-)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 85749ae8cb..32b2b4866a 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -151,6 +151,15 @@ struct hw_perf_event {
>          */
>         struct task_struct              *target;
>
> +       /*
> +        * PMU would store hardware filter configuration
> +        * here.
> +        */
> +       void                            *addr_filters;
> +
> +       /* Last sync'ed generation of filters */
> +       unsigned long                   addr_filters_gen;
> +
>  /*
>   * hw_perf_event::state flags; used to track the PERF_EF_* state.
>   */
> @@ -240,6 +249,9 @@ struct pmu {
>         int                             task_ctx_nr;
>         int                             hrtimer_interval_ms;
>
> +       /* number of address filters this pmu can do */
> +       unsigned int                    nr_addr_filters;
> +
>         /*
>          * Fully disable/enable this PMU, can be used to protect from the PMI
>          * as well as for lazy/batch writing of the MSRs.
> @@ -393,12 +405,71 @@ struct pmu {
>         void (*free_aux)                (void *aux); /* optional */
>
>         /*
> +        * Validate address range filters: make sure hw supports the
> +        * requested configuration and number of filters; return 0 if the
> +        * supplied filters are valid, -errno otherwise.
> +        *
> +        * Runs in the context of the ioctl()ing process and is not serialized
> +        * with the rest of the pmu callbacks.
> +        */
> +       int (*addr_filters_validate)    (struct list_head *filters);
> +                                       /* optional */
> +
> +       /*
> +        * Synchronize address range filter configuration:
> +        * translate hw-agnostic filters into hardware configuration in
> +        * event::hw::addr_filters.
> +        *
> +        * Runs as a part of filter sync sequence that is done in ->start()
> +        * callback by calling perf_event_addr_filters_sync().
> +        *
> +        * May (and should) traverse event::addr_filters::list, for which its
> +        * caller provides necessary serialization.
> +        */
> +       void (*addr_filters_sync)       (struct perf_event *event);
> +                                       /* optional */
> +
> +       /*
>          * Filter events for PMU-specific reasons.
>          */
>         int (*filter_match)             (struct perf_event *event); /* optional */
>  };
>
>  /**
> + * struct perf_addr_filter - address range filter definition
> + * @entry:     event's filter list linkage
> + * @inode:     object file's inode for file-based filters
> + * @offset:    filter range offset
> + * @size:      filter range size
> + * @range:     1: range, 0: address
> + * @filter:    1: filter/start, 0: stop
> + *
> + * This is a hardware-agnostic filter configuration as specified by the user.
> + */
> +struct perf_addr_filter {
> +       struct list_head        entry;
> +       struct inode            *inode;
> +       unsigned long           offset;
> +       unsigned long           size;
> +       unsigned int            range   : 1,
> +                               filter  : 1;
> +};
> +
> +/**
> + * struct perf_addr_filters_head - container for address range filters
> + * @list:      list of filters for this event
> + * @lock:      spinlock that serializes accesses to the @list and event's
> + *             (and its children's) filter generations.
> + *
> + * A child event will use parent's @list (and therefore @lock), so they are
> + * bundled together; see perf_event_addr_filters().
> + */
> +struct perf_addr_filters_head {
> +       struct list_head        list;
> +       raw_spinlock_t          lock;
> +};
> +
> +/**
>   * enum perf_event_active_state - the states of a event
>   */
>  enum perf_event_active_state {
> @@ -566,6 +637,12 @@ struct perf_event {
>
>         atomic_t                        event_limit;
>
> +       /* address range filters */
> +       struct perf_addr_filters_head   addr_filters;
> +       /* vma address array for file-based filders */
> +       unsigned long                   *addr_filters_offs;
> +       unsigned long                   addr_filters_gen;
> +
>         void (*destroy)(struct perf_event *);
>         struct rcu_head                 rcu_head;
>
> @@ -679,6 +756,22 @@ struct perf_output_handle {
>         int                             page;
>  };
>
> +void perf_event_addr_filters_sync(struct perf_event *event);
> +
> +/*
> + * An inherited event uses parent's filters
> + */
> +static inline struct perf_addr_filters_head *
> +perf_event_addr_filters(struct perf_event *event)
> +{
> +       struct perf_addr_filters_head *ifh = &event->addr_filters;
> +
> +       if (event->parent)
> +               ifh = &event->parent->addr_filters;
> +
> +       return ifh;
> +}
> +
>  #ifdef CONFIG_CGROUP_PERF
>
>  /*
> @@ -1066,6 +1159,11 @@ static inline bool is_write_backward(struct perf_event *event)
>         return !!event->attr.write_backward;
>  }
>
> +static inline bool has_addr_filter(struct perf_event *event)
> +{
> +       return event->pmu->nr_addr_filters;
> +}
> +
>  extern int perf_output_begin(struct perf_output_handle *handle,
>                              struct perf_event *event, unsigned int size);
>  extern int perf_output_begin_forward(struct perf_output_handle *handle,
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 6d335f3878..606398b62a 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -44,6 +44,8 @@
>  #include <linux/compat.h>
>  #include <linux/bpf.h>
>  #include <linux/filter.h>
> +#include <linux/namei.h>
> +#include <linux/parser.h>
>
>  #include "internal.h"
>
> @@ -2364,11 +2366,17 @@ void perf_event_enable(struct perf_event *event)
>  }
>  EXPORT_SYMBOL_GPL(perf_event_enable);
>
> +struct stop_event_data {
> +       struct perf_event       *event;
> +       unsigned int            restart;
> +};
> +
>  static int __perf_event_stop(void *info)
>  {
> -       struct perf_event *event = info;
> +       struct stop_event_data *sd = info;
> +       struct perf_event *event = sd->event;
>
> -       /* for AUX events, our job is done if the event is already inactive */
> +       /* if it's already INACTIVE, do nothing */
>         if (READ_ONCE(event->state) != PERF_EVENT_STATE_ACTIVE)
>                 return 0;
>
> @@ -2384,9 +2392,86 @@ static int __perf_event_stop(void *info)
>
>         event->pmu->stop(event, PERF_EF_UPDATE);
>
> +       /*
> +        * May race with the actual stop (through perf_pmu_output_stop()),
> +        * but it is only used for events with AUX ring buffer, and such
> +        * events will refuse to restart because of rb::aux_mmap_count==0,
> +        * see comments in perf_aux_output_begin().
> +        *
> +        * Since this is happening on a event-local cpu, no trace is lost
> +        * while restarting.
> +        */
> +       if (sd->restart)
> +               event->pmu->start(event, PERF_EF_START);
> +
>         return 0;
>  }
>
> +static int perf_event_restart(struct perf_event *event)
> +{
> +       struct stop_event_data sd = {
> +               .event          = event,
> +               .restart        = 1,
> +       };
> +       int ret = 0;
> +
> +       do {
> +               if (READ_ONCE(event->state) != PERF_EVENT_STATE_ACTIVE)
> +                       return 0;
> +
> +               /* matches smp_wmb() in event_sched_in() */
> +               smp_rmb();
> +
> +               /*
> +                * We only want to restart ACTIVE events, so if the event goes
> +                * inactive here (event->oncpu==-1), there's nothing more to do;
> +                * fall through with ret==-ENXIO.
> +                */
> +               ret = cpu_function_call(READ_ONCE(event->oncpu),
> +                                       __perf_event_stop, &sd);
> +       } while (ret == -EAGAIN);
> +
> +       return ret;
> +}
> +
> +/*
> + * In order to contain the amount of racy and tricky in the address filter
> + * configuration management, it is a two part process:
> + *
> + * (p1) when userspace mappings change as a result of (1) or (2) or (3) below,
> + *      we update the addresses of corresponding vmas in
> + *     event::addr_filters_offs array and bump the event::addr_filters_gen;
> + * (p2) when an event is scheduled in (pmu::add), it calls
> + *      perf_event_addr_filters_sync() which calls pmu::addr_filters_sync()
> + *      if the generation has changed since the previous call.
> + *
> + * If (p1) happens while the event is active, we restart it to force (p2).
> + *
> + * (1) perf_addr_filters_apply(): adjusting filters' offsets based on
> + *     pre-existing mappings, called once when new filters arrive via SET_FILTER
> + *     ioctl;
> + * (2) perf_addr_filters_adjust(): adjusting filters' offsets based on newly
> + *     registered mapping, called for every new mmap(), with mm::mmap_sem down
> + *     for reading;
> + * (3) perf_event_addr_filters_exec(): clearing filters' offsets in the process
> + *     of exec.
> + */
> +void perf_event_addr_filters_sync(struct perf_event *event)
> +{
> +       struct perf_addr_filters_head *ifh = perf_event_addr_filters(event);
> +
> +       if (!has_addr_filter(event))
> +               return;
> +
> +       raw_spin_lock(&ifh->lock);
> +       if (event->addr_filters_gen != event->hw.addr_filters_gen) {
> +               event->pmu->addr_filters_sync(event);
> +               event->hw.addr_filters_gen = event->addr_filters_gen;
> +       }
> +       raw_spin_unlock(&ifh->lock);
> +}
> +EXPORT_SYMBOL_GPL(perf_event_addr_filters_sync);
> +
>  static int _perf_event_refresh(struct perf_event *event, int refresh)
>  {
>         /*
> @@ -3236,16 +3321,6 @@ out:
>                 put_ctx(clone_ctx);
>  }
>
> -void perf_event_exec(void)
> -{
> -       int ctxn;
> -
> -       rcu_read_lock();
> -       for_each_task_context_nr(ctxn)
> -               perf_event_enable_on_exec(ctxn);
> -       rcu_read_unlock();
> -}
> -
>  struct perf_read_data {
>         struct perf_event *event;
>         bool group;
> @@ -3757,6 +3832,9 @@ static bool exclusive_event_installable(struct perf_event *event,
>         return true;
>  }
>
> +static void perf_addr_filters_splice(struct perf_event *event,
> +                                      struct list_head *head);
> +
>  static void _free_event(struct perf_event *event)
>  {
>         irq_work_sync(&event->pending);
> @@ -3784,6 +3862,8 @@ static void _free_event(struct perf_event *event)
>         }
>
>         perf_event_free_bpf_prog(event);
> +       perf_addr_filters_splice(event, NULL);
> +       kfree(event->addr_filters_offs);
>
>         if (event->destroy)
>                 event->destroy(event);
> @@ -5855,6 +5935,57 @@ next:
>         rcu_read_unlock();
>  }
>
> +/*
> + * Clear all file-based filters at exec, they'll have to be
> + * re-instated when/if these objects are mmapped again.
> + */
> +static void perf_event_addr_filters_exec(struct perf_event *event, void *data)
> +{
> +       struct perf_addr_filters_head *ifh = perf_event_addr_filters(event);
> +       struct perf_addr_filter *filter;
> +       unsigned int restart = 0, count = 0;
> +       unsigned long flags;
> +
> +       if (!has_addr_filter(event))
> +               return;
> +
> +       raw_spin_lock_irqsave(&ifh->lock, flags);
> +       list_for_each_entry(filter, &ifh->list, entry) {
> +               if (filter->inode) {
> +                       event->addr_filters_offs[count] = 0;
> +                       restart++;
> +               }
> +
> +               count++;
> +       }
> +
> +       if (restart)
> +               event->addr_filters_gen++;
> +       raw_spin_unlock_irqrestore(&ifh->lock, flags);
> +
> +       if (restart)
> +               perf_event_restart(event);
> +}
> +
> +void perf_event_exec(void)
> +{
> +       struct perf_event_context *ctx;
> +       int ctxn;
> +
> +       rcu_read_lock();
> +       for_each_task_context_nr(ctxn) {
> +               ctx = current->perf_event_ctxp[ctxn];
> +               if (!ctx)
> +                       continue;
> +
> +               perf_event_enable_on_exec(ctxn);
> +
> +               perf_event_aux_ctx(ctx, perf_event_addr_filters_exec, NULL,
> +                                  true);
> +       }
> +       rcu_read_unlock();
> +}
> +
>  struct remote_output {
>         struct ring_buffer      *rb;
>         int                     err;
> @@ -5865,6 +5996,9 @@ static void __perf_event_output_stop(struct perf_event *event, void *data)
>         struct perf_event *parent = event->parent;
>         struct remote_output *ro = data;
>         struct ring_buffer *rb = ro->rb;
> +       struct stop_event_data sd = {
> +               .event  = event,
> +       };
>
>         if (!has_aux(event))
>                 return;
> @@ -5877,7 +6011,7 @@ static void __perf_event_output_stop(struct perf_event *event, void *data)
>          * ring-buffer, but it will be the child that's actually using it:
>          */
>         if (rcu_dereference(parent->rb) == rb)
> -               ro->err = __perf_event_stop(event);
> +               ro->err = __perf_event_stop(&sd);
>  }
>
>  static int __perf_pmu_output_stop(void *info)
> @@ -6338,6 +6472,87 @@ got_name:
>         kfree(buf);
>  }
>
> +/*
> + * Whether this @filter depends on a dynamic object which is not loaded
> + * yet or its load addresses are not known.
> + */
> +static bool perf_addr_filter_needs_mmap(struct perf_addr_filter *filter)
> +{
> +       return filter->filter && filter->inode;
> +}
> +
> +/*
> + * Check whether inode and address range match filter criteria.
> + */
> +static bool perf_addr_filter_match(struct perf_addr_filter *filter,
> +                                    struct file *file, unsigned long offset,
> +                                    unsigned long size)
> +{
> +       if (filter->inode != file->f_inode)
> +               return false;
> +
> +       if (filter->offset > offset + size)
> +               return false;
> +
> +       if (filter->offset + filter->size < offset)
> +               return false;
> +
> +       return true;
> +}
> +
> +static void __perf_addr_filters_adjust(struct perf_event *event, void *data)
> +{
> +       struct perf_addr_filters_head *ifh = perf_event_addr_filters(event);
> +       struct vm_area_struct *vma = data;
> +       unsigned long off = vma->vm_pgoff << PAGE_SHIFT, flags;
> +       struct file *file = vma->vm_file;
> +       struct perf_addr_filter *filter;
> +       unsigned int restart = 0, count = 0;
> +
> +       if (!has_addr_filter(event))
> +               return;
> +
> +       if (!file)
> +               return;
> +
> +       raw_spin_lock_irqsave(&ifh->lock, flags);
> +       list_for_each_entry(filter, &ifh->list, entry) {
> +               if (perf_addr_filter_match(filter, file, off,
> +                                            vma->vm_end - vma->vm_start)) {
> +                       event->addr_filters_offs[count] = vma->vm_start;
> +                       restart++;
> +               }
> +
> +               count++;
> +       }
> +
> +       if (restart)
> +               event->addr_filters_gen++;
> +       raw_spin_unlock_irqrestore(&ifh->lock, flags);
> +
> +       if (restart)
> +               perf_event_restart(event);
> +}
> +
> +/*
> + * Adjust all task's events' filters to the new vma
> + */
> +static void perf_addr_filters_adjust(struct vm_area_struct *vma)
> +{
> +       struct perf_event_context *ctx;
> +       int ctxn;
> +
> +       rcu_read_lock();
> +       for_each_task_context_nr(ctxn) {
> +               ctx = rcu_dereference(current->perf_event_ctxp[ctxn]);
> +               if (!ctx)
> +                       continue;
> +
> +               perf_event_aux_ctx(ctx, __perf_addr_filters_adjust, vma, true);
> +       }
> +       rcu_read_unlock();
> +}
> +
>  void perf_event_mmap(struct vm_area_struct *vma)
>  {
>         struct perf_mmap_event mmap_event;
> @@ -6369,6 +6584,7 @@ void perf_event_mmap(struct vm_area_struct *vma)
>                 /* .flags (attr_mmap2 only) */
>         };
>
> +       perf_addr_filters_adjust(vma);
>         perf_event_mmap_event(&mmap_event);
>  }
>
> @@ -7328,13 +7544,370 @@ void perf_bp_event(struct perf_event *bp, void *data)
>  }
>  #endif
>
> +/*
> + * Allocate a new address filter
> + */
> +static struct perf_addr_filter *
> +perf_addr_filter_new(struct perf_event *event, struct list_head *filters)
> +{
> +       int node = cpu_to_node(event->cpu == -1 ? 0 : event->cpu);
> +       struct perf_addr_filter *filter;
> +
> +       filter = kzalloc_node(sizeof(*filter), GFP_KERNEL, node);
> +       if (!filter)
> +               return NULL;
> +
> +       INIT_LIST_HEAD(&filter->entry);
> +       list_add_tail(&filter->entry, filters);
> +
> +       return filter;
> +}
> +
> +static void free_filters_list(struct list_head *filters)
> +{
> +       struct perf_addr_filter *filter, *iter;
> +
> +       list_for_each_entry_safe(filter, iter, filters, entry) {
> +               if (filter->inode)
> +                       iput(filter->inode);
> +               list_del(&filter->entry);
> +               kfree(filter);
> +       }
> +}
> +
> +/*
> + * Free existing address filters and optionally install new ones
> + */
> +static void perf_addr_filters_splice(struct perf_event *event,
> +                                    struct list_head *head)
> +{
> +       unsigned long flags;
> +       LIST_HEAD(list);
> +
> +       if (!has_addr_filter(event))
> +               return;
> +
> +       /* don't bother with children, they don't have their own filters */
> +       if (event->parent)
> +               return;
> +
> +       raw_spin_lock_irqsave(&event->addr_filters.lock, flags);
> +
> +       list_splice_init(&event->addr_filters.list, &list);
> +       if (head)
> +               list_splice(head, &event->addr_filters.list);
> +
> +       raw_spin_unlock_irqrestore(&event->addr_filters.lock, flags);
> +
> +       free_filters_list(&list);
> +}
> +
> +/*
> + * Scan through mm's vmas and see if one of them matches the
> + * @filter; if so, adjust filter's address range.
> + * Called with mm::mmap_sem down for reading.
> + */
> +static unsigned long perf_addr_filter_apply(struct perf_addr_filter *filter,
> +                                           struct mm_struct *mm)
> +{
> +       struct vm_area_struct *vma;
> +
> +       for (vma = mm->mmap; vma->vm_next; vma = vma->vm_next) {
> +               struct file *file = vma->vm_file;
> +               unsigned long off = vma->vm_pgoff << PAGE_SHIFT;
> +               unsigned long vma_size = vma->vm_end - vma->vm_start;
> +
> +               if (!file)
> +                       continue;
> +
> +               if (!perf_addr_filter_match(filter, file, off, vma_size))
> +                       continue;
> +
> +               return vma->vm_start;
> +       }
> +
> +       return 0;
> +}
> +
> +/*
> + * Update event's address range filters based on the
> + * task's existing mappings, if any.
> + */
> +static void perf_event_addr_filters_apply(struct perf_event *event)
> +{
> +       struct perf_addr_filters_head *ifh = perf_event_addr_filters(event);
> +       struct task_struct *task = READ_ONCE(event->ctx->task);
> +       struct perf_addr_filter *filter;
> +       struct mm_struct *mm = NULL;
> +       unsigned int count = 0;
> +       unsigned long flags;
> +
> +       /*
> +        * We may observe TASK_TOMBSTONE, which means that the event tear-down
> +        * will stop on the parent's child_mutex that our caller is also holding
> +        */
> +       if (task == TASK_TOMBSTONE)
> +               return;
> +
> +       mm = get_task_mm(event->ctx->task);
> +       if (!mm)
> +               goto restart;
> +
> +       down_read(&mm->mmap_sem);
> +
> +       raw_spin_lock_irqsave(&ifh->lock, flags);
> +       list_for_each_entry(filter, &ifh->list, entry) {
> +               event->addr_filters_offs[count] = 0;
> +
> +               if (perf_addr_filter_needs_mmap(filter))
> +                       event->addr_filters_offs[count] =
> +                               perf_addr_filter_apply(filter, mm);
> +
> +               count++;
> +       }
> +
> +       event->addr_filters_gen++;
> +       raw_spin_unlock_irqrestore(&ifh->lock, flags);
> +
> +       up_read(&mm->mmap_sem);
> +
> +       mmput(mm);
> +
> +restart:
> +       perf_event_restart(event);
> +}
> +
> +/*
> + * Address range filtering: limiting the data to certain
> + * instruction address ranges. Filters are ioctl()ed to us from
> + * userspace as ascii strings.
> + *
> + * Filter string format:
> + *
> + * ACTION RANGE_SPEC
> + * where ACTION is one of the
> + *  * "filter": limit the trace to this region
> + *  * "start": start tracing from this address
> + *  * "stop": stop tracing at this address/region;
> + * RANGE_SPEC is
> + *  * for kernel addresses: <start address>[/<size>]
> + *  * for object files:     <start address>[/<size>]@</path/to/object/file>
> + *
> + * if <size> is not specified, the range is treated as a single address.
> + */
> +enum {
> +       IF_ACT_FILTER,
> +       IF_ACT_START,
> +       IF_ACT_STOP,
> +       IF_SRC_FILE,
> +       IF_SRC_KERNEL,
> +       IF_SRC_FILEADDR,
> +       IF_SRC_KERNELADDR,
> +};
> +
> +enum {
> +       IF_STATE_ACTION = 0,
> +       IF_STATE_SOURCE,
> +       IF_STATE_END,
> +};
> +
> +static const match_table_t if_tokens = {
> +       { IF_ACT_FILTER,        "filter" },
> +       { IF_ACT_START,         "start" },
> +       { IF_ACT_STOP,          "stop" },
> +       { IF_SRC_FILE,          "%u/%u@%s" },
> +       { IF_SRC_KERNEL,        "%u/%u" },
> +       { IF_SRC_FILEADDR,      "%u@%s" },
> +       { IF_SRC_KERNELADDR,    "%u" },
> +};
> +
> +/*
> + * Address filter string parser
> + */
> +static int
> +perf_event_parse_addr_filter(struct perf_event *event, char *fstr,
> +                            struct list_head *filters)
> +{
> +       struct perf_addr_filter *filter = NULL;
> +       char *start, *orig, *filename = NULL;
> +       struct path path;
> +       substring_t args[MAX_OPT_ARGS];
> +       int state = IF_STATE_ACTION, token;
> +       unsigned int kernel = 0;
> +       int ret = -EINVAL;
> +
> +       orig = fstr = kstrdup(fstr, GFP_KERNEL);
> +       if (!fstr)
> +               return -ENOMEM;
> +
> +       while ((start = strsep(&fstr, " ,\n")) != NULL) {
> +               ret = -EINVAL;
> +
> +               if (!*start)
> +                       continue;
> +
> +               /* filter definition begins */
> +               if (state == IF_STATE_ACTION) {
> +                       filter = perf_addr_filter_new(event, filters);
> +                       if (!filter)
> +                               goto fail;
> +               }
> +
> +               token = match_token(start, if_tokens, args);
> +               switch (token) {
> +               case IF_ACT_FILTER:
> +               case IF_ACT_START:
> +                       filter->filter = 1;
> +
> +               case IF_ACT_STOP:
> +                       if (state != IF_STATE_ACTION)
> +                               goto fail;
> +
> +                       state = IF_STATE_SOURCE;
> +                       break;
> +
> +               case IF_SRC_KERNELADDR:
> +               case IF_SRC_KERNEL:
> +                       kernel = 1;
> +
> +               case IF_SRC_FILEADDR:
> +               case IF_SRC_FILE:
> +                       if (state != IF_STATE_SOURCE)
> +                               goto fail;
> +
> +                       if (token == IF_SRC_FILE || token == IF_SRC_KERNEL)
> +                               filter->range = 1;
> +
> +                       *args[0].to = 0;
> +                       ret = kstrtoul(args[0].from, 0, &filter->offset);
> +                       if (ret)
> +                               goto fail;
> +
> +                       if (filter->range) {
> +                               *args[1].to = 0;
> +                               ret = kstrtoul(args[1].from, 0, &filter->size);
> +                               if (ret)
> +                                       goto fail;
> +                       }
> +
> +                       if (token == IF_SRC_FILE) {
> +                               filename = match_strdup(&args[2]);
> +                               if (!filename) {
> +                                       ret = -ENOMEM;
> +                                       goto fail;
> +                               }
> +                       }
> +
> +                       state = IF_STATE_END;
> +                       break;
> +
> +               default:
> +                       goto fail;
> +               }
> +
> +               /*
> +                * Filter definition is fully parsed, validate and install it.
> +                * Make sure that it doesn't contradict itself or the event's
> +                * attribute.
> +                */
> +               if (state == IF_STATE_END) {
> +                       if (kernel && event->attr.exclude_kernel)
> +                               goto fail;
> +
> +                       if (!kernel) {
> +                               if (!filename)
> +                                       goto fail;
> +
> +                               /* look up the path and grab its inode */
> +                               ret = kern_path(filename, LOOKUP_FOLLOW, &path);
> +                               if (ret)
> +                                       goto fail_free_name;
> +
> +                               filter->inode = igrab(d_inode(path.dentry));
> +                               path_put(&path);
> +                               kfree(filename);
> +                               filename = NULL;
> +
> +                               ret = -EINVAL;
> +                               if (!filter->inode ||
> +                                   !S_ISREG(filter->inode->i_mode))
> +                                       /* free_filters_list() will iput() */
> +                                       goto fail;
> +                       }
> +
> +                       /* ready to consume more filters */
> +                       state = IF_STATE_ACTION;
> +                       filter = NULL;
> +               }
> +       }
> +
> +       if (state != IF_STATE_ACTION)
> +               goto fail;
> +
> +       kfree(orig);
> +
> +       return 0;
> +
> +fail_free_name:
> +       kfree(filename);
> +fail:
> +       free_filters_list(filters);
> +       kfree(orig);
> +
> +       return ret;
> +}
> +
> +static int
> +perf_event_set_addr_filter(struct perf_event *event, char *filter_str)
> +{
> +       LIST_HEAD(filters);
> +       int ret;
> +
> +       /*
> +        * Since this is called in perf_ioctl() path, we're already holding
> +        * ctx::mutex.
> +        */
> +       lockdep_assert_held(&event->ctx->mutex);
> +
> +       if (WARN_ON_ONCE(event->parent))
> +               return -EINVAL;
> +
> +       /*
> +        * For now, we only support filtering in per-task events; doing so
> +        * for cpu-wide events requires additional context switching trickery,
> +        * since same object code will be mapped at different virtual
> +        * addresses in different processes.
> +        */
> +       if (!event->ctx->task)
> +               return -EOPNOTSUPP;
> +
> +       ret = perf_event_parse_addr_filter(event, filter_str, &filters);
> +       if (ret)
> +               return ret;
> +
> +       ret = event->pmu->addr_filters_validate(&filters);
> +       if (ret) {
> +               free_filters_list(&filters);
> +               return ret;
> +       }
> +
> +       /* remove existing filters, if any */
> +       perf_addr_filters_splice(event, &filters);
> +
> +       /* install new filters */
> +       perf_event_for_each_child(event, perf_event_addr_filters_apply);
> +
> +       return ret;
> +}
> +
>  static int perf_event_set_filter(struct perf_event *event, void __user *arg)
>  {
>         char *filter_str;
>         int ret = -EINVAL;
>
> -       if (event->attr.type != PERF_TYPE_TRACEPOINT ||
> -           !IS_ENABLED(CONFIG_EVENT_TRACING))
> +       if ((event->attr.type != PERF_TYPE_TRACEPOINT ||
> +           !IS_ENABLED(CONFIG_EVENT_TRACING)) &&
> +           !has_addr_filter(event))
>                 return -EINVAL;
>
>         filter_str = strndup_user(arg, PAGE_SIZE);
> @@ -7345,6 +7918,8 @@ static int perf_event_set_filter(struct perf_event *event, void __user *arg)
>             event->attr.type == PERF_TYPE_TRACEPOINT)
>                 ret = ftrace_profile_set_filter(event, event->attr.config,
>                                                 filter_str);
> +       else if (has_addr_filter(event))
> +               ret = perf_event_set_addr_filter(event, filter_str);
>
>         kfree(filter_str);
>         return ret;
> @@ -8139,6 +8714,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
>         INIT_LIST_HEAD(&event->sibling_list);
>         INIT_LIST_HEAD(&event->rb_entry);
>         INIT_LIST_HEAD(&event->active_entry);
> +       INIT_LIST_HEAD(&event->addr_filters.list);
>         INIT_HLIST_NODE(&event->hlist_entry);
>
>
> @@ -8146,6 +8722,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
>         init_irq_work(&event->pending, perf_pending_event);
>
>         mutex_init(&event->mmap_mutex);
> +       raw_spin_lock_init(&event->addr_filters.lock);
>
>         atomic_long_set(&event->refcount, 1);
>         event->cpu              = cpu;
> @@ -8230,11 +8807,22 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
>         if (err)
>                 goto err_pmu;
>
> +       if (has_addr_filter(event)) {
> +               event->addr_filters_offs = kcalloc(pmu->nr_addr_filters,
> +                                                  sizeof(unsigned long),
> +                                                  GFP_KERNEL);
> +               if (!event->addr_filters_offs)
> +                       goto err_per_task;
> +
> +               /* force hw sync on the address filters */
> +               event->addr_filters_gen = 1;
> +       }
> +
>         if (!event->parent) {
>                 if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN) {
>                         err = get_callchain_buffers();
>                         if (err)
> -                               goto err_per_task;
> +                               goto err_addr_filters;
>                 }
>         }
>
> @@ -8243,6 +8831,9 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
>
>         return event;
>
> +err_addr_filters:
> +       kfree(event->addr_filters_offs);
> +
>  err_per_task:
>         exclusive_event_destroy(event);

I see two things in this work:

1) A framework to deal with filters described in user space.
2) An implementation for address filtering that will work for both
Intel and ARM.

This will work well for address filtering (for both PT and CS) but
what happens when we want to introduce new filters?  This is
inevitable and some filters will be architecture agnostic while others
architecture specific.

To me the above is well done and I can work with it, but the framework
to deal with general filters and the function of address filtering
need to be decoupled from the beginning.  80% of the above can be
reused with a simple name change (dropping "addr").  The rest, like
the parser and how to make the "if_tokens" expandable will be harder
deal with.

Thanks,
Mathieu

>
> --
> 2.8.0.rc3
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ