linux-kernel - Re: [PATCH v4 2/5] perf lock: Retrieve owner callstack in bpf program

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJpZYjVPfWznVm2Zcvk77Np-vfKKNiWH2ipB58QCJ4ZQ6h_afw@mail.gmail.com>
Date: Tue, 11 Feb 2025 15:26:04 -0800
From: Chun-Tse Shao <ctshao@...gle.com>
To: Namhyung Kim <namhyung@...nel.org>
Cc: linux-kernel@...r.kernel.org, peterz@...radead.org, mingo@...hat.com, 
	acme@...nel.org, mark.rutland@....com, alexander.shishkin@...ux.intel.com, 
	jolsa@...nel.org, irogers@...gle.com, adrian.hunter@...el.com, 
	kan.liang@...ux.intel.com, linux-perf-users@...r.kernel.org, 
	bpf@...r.kernel.org
Subject: Re: [PATCH v4 2/5] perf lock: Retrieve owner callstack in bpf program

Hi Namhyung, thanks for your review first!


On Fri, Jan 31, 2025 at 3:48 PM Namhyung Kim <namhyung@...nel.org> wrote:
>
> On Wed, Jan 29, 2025 at 09:21:36PM -0800, Chun-Tse Shao wrote:
> > Tracks lock contention by tracing owner callstacks in
> > `contention_begin()` and `contention_end()`, storing data in the
> > owner_stat BPF map. `contention_begin()` records the owner and their
> > callstack. `contention_end()` updates contention statistics (count,
> > time), decrements the waiter count, and removes the record when no
> > waiters remain. Statistics are also updated if the owner's callstack
> > changes. Note that owner and its callstack retrieval may fail.
> >
> > To elaborate the process in detail:
> >   /*
> >    * In `contention_begin(), the current task is the lock waiter`. We
> >    * create/update `owner_data` for the given `lock` address.
> >   contention_begin() {
> >     Try to get owner task. Skip entire process if fails.
> >     Try to get owner stack based on task. Use empty stack if fails.
> >     Store owner stack into `owner_stacks` and create `stack_id`. If fail
> >       to store, use negative `stack_id`, which will be ignored while
> >       reporting in usermode.
> >     Retrieve `owner_tracing_data` in `owner_data` with given `lock`
> >       address.
> >
> >     /*
> >      * The first case means contention just happens, or mismatched owner
> >      * infomation so we just drop the previous record.
> >      */
> >     if (`owner_tracing_data` does not exist ||
> >         the recorded owner `pid` does not match with the detected owner) {
> >       Create `owner_tracing_data` with info from detected owner, and
> >         store it in `owner_data` with key `lock` address.
> >     }
> >     /*
> >      * The second case means contention is on going. One more waiter is
> >      * joining the lock contention. Both `owner_data` and `owner_stat`
> >      * should be updated.
> >      */
> >     else {
> >       `owner_tracing_data.count`++
> >
> >       Create `contention_key` with owner `stack_id` and lookup
> >         `contention_data` in `owner_stat`.
> >       if (`contention_data` does not exist) {
> >         Create new entry for `contention_key`:`contention_data` in
> >           `owner_stat`.
> >       }
> >       else {
> >         Update the `count` and `total_time` in existing
> >         `contention_data`.
> >       }
> >
> >       Update `timestamp` and `stack_id` in `owner_tracing_data`.
> >     }
> >   }
> >
> >   /*
> >    * In `contention_end()`, the current task will be the new owner of
> >    * the `lock`, if `ctx[1]` is not 0.
> >    */
> >   contention_end() {
> >     Lookup `owner_tracing_data` in `owner_data` with key `lock`.
> >
> >     Create `contention_key` with `owner_tracing_data.stack_id` and
> >       lookup `contention_data` in `owner_stat`.
> >     if (`contention_data` does not exist) {
> >       Create new entry for `contention_key`:`contention_data` in
> >         `owner_stat`.
> >     }
> >     else {
> >       Update the `count` and `total_time` in existing `contention_data`.
> >     }
> >
> >     /*
> >      * There is no more waiters, contention is over, delete the record.
> >      */
> >     if (`owner_tracing_data.count` <= 1) {
> >       delete this record in `owner_data`.
> >     }
> >     /*
> >      * Contention is still on going.
> >      */
> >     else {
> >       `owner_tracing_data.count`--
> >
> >       if (`!ctx[0]`) {
> >         The current task exits without acquiring the lock. However we
> >           check for the recorded owner if the stack is changed, and
> >           update `onwer_data` and `owner_stat` accordingly.
> >       }
> >       else {
> >         The current task is the new owner, retrieve its stack, store it
> >           in `owner_stack` and update `owner_tracing_data`.
> >       }
> >     }
> >   }
>
> I think this is too much detail. :)
>
> I'd say something like this:
>
> This implements per-callstack aggregation of lock owners in addition to
> per-thread.  The owner callstack is captured using bpf_get_task_stack()
> at contention_begin and it also adds a custom stackid function for the
> owner stacks to be compared easily.
>
> The owner info is kept in a hash map using lock addr as a key to handle
> multiple waiters for the same lock.  At contention_end, it updates the
> owner lock stat based on the info that was saved at contention_begin.
> If there are more waiters, it'd update the owner pid to itself as
> contention_end means it gets the lock now.  But it also needs to check
> the return value of the lock function in case task was killed by a signal
> or something.
>

Thanks, I will just reuse this description. :D

> >
> > Signed-off-by: Chun-Tse Shao <ctshao@...gle.com>
> > ---
> >  .../perf/util/bpf_skel/lock_contention.bpf.c  | 248 +++++++++++++++++-
> >  1 file changed, 247 insertions(+), 1 deletion(-)
> >
> > diff --git a/tools/perf/util/bpf_skel/lock_contention.bpf.c b/tools/perf/util/bpf_skel/lock_contention.bpf.c
> > index 23fe9cc980ae..b12df873379f 100644
> > --- a/tools/perf/util/bpf_skel/lock_contention.bpf.c
> > +++ b/tools/perf/util/bpf_skel/lock_contention.bpf.c
> > @@ -197,6 +197,9 @@ int data_fail;
> >  int task_map_full;
> >  int data_map_full;
> >
> > +struct task_struct *bpf_task_from_pid(s32 pid) __ksym __weak;
> > +void bpf_task_release(struct task_struct *p) __ksym __weak;
> > +
> >  static inline __u64 get_current_cgroup_id(void)
> >  {
> >       struct task_struct *task;
> > @@ -420,6 +423,26 @@ static inline struct tstamp_data *get_tstamp_elem(__u32 flags)
> >       return pelem;
> >  }
> >
> > +static inline s32 get_owner_stack_id(u64 *stacktrace)
> > +{
> > +     s32 *id, new_id;
> > +     static s64 id_gen = 1;
> > +
> > +     id = bpf_map_lookup_elem(&owner_stacks, stacktrace);
> > +     if (id)
> > +             return *id;
> > +
> > +     new_id = (s32)__sync_fetch_and_add(&id_gen, 1);
> > +
> > +     bpf_map_update_elem(&owner_stacks, stacktrace, &new_id, BPF_NOEXIST);
> > +
> > +     id = bpf_map_lookup_elem(&owner_stacks, stacktrace);
> > +     if (id)
> > +             return *id;
> > +
> > +     return -1;
> > +}
> > +
> >  SEC("tp_btf/contention_begin")
> >  int contention_begin(u64 *ctx)
> >  {
> > @@ -437,6 +460,97 @@ int contention_begin(u64 *ctx)
> >       pelem->flags = (__u32)ctx[1];
> >
> >       if (needs_callstack) {
> > +             u32 i = 0;
> > +             u32 id = 0;
> > +             int owner_pid;
> > +             u64 *buf;
> > +             struct task_struct *task;
> > +             struct owner_tracing_data *otdata;
> > +
> > +             if (!lock_owner)
> > +                     goto skip_owner_begin;
> > +
> > +             task = get_lock_owner(pelem->lock, pelem->flags);
> > +             if (!task)
> > +                     goto skip_owner_begin;
> > +
> > +             owner_pid = BPF_CORE_READ(task, pid);
> > +
> > +             buf = bpf_map_lookup_elem(&stack_buf, &i);
> > +             if (!buf)
> > +                     goto skip_owner_begin;
> > +             for (i = 0; i < max_stack; i++)
> > +                     buf[i] = 0x0;
> > +
> > +             if (bpf_task_from_pid) {
>
> I think you can do this instead:
>
>                 if (bpf_task_from_pid == NULL)
>                         goto skip_owner_begin;
>
> nit: it can be just 'skip_owner'. :)
>
>
> > +                     task = bpf_task_from_pid(owner_pid);
> > +                     if (task) {
> > +                             bpf_get_task_stack(task, buf, max_stack * sizeof(unsigned long), 0);
> > +                             bpf_task_release(task);
> > +                     }
> > +             }
> > +
> > +             otdata = bpf_map_lookup_elem(&owner_data, &pelem->lock);
> > +             id = get_owner_stack_id(buf);
> > +
> > +             /*
> > +              * Contention just happens, or corner case `lock` is owned by process not
> > +              * `owner_pid`. For the corner case we treat it as unexpected internal error and
> > +              * just ignore the precvious tracing record.
> > +              */
> > +             if (!otdata || otdata->pid != owner_pid) {
> > +                     struct owner_tracing_data first = {
> > +                             .pid = owner_pid,
> > +                             .timestamp = pelem->timestamp,
> > +                             .count = 1,
> > +                             .stack_id = id,
> > +                     };
> > +                     bpf_map_update_elem(&owner_data, &pelem->lock, &first, BPF_ANY);
> > +             }
> > +             /* Contention is ongoing and new waiter joins */
> > +             else {
> > +                     __sync_fetch_and_add(&otdata->count, 1);
> > +
> > +                     /*
> > +                      * The owner is the same, but stacktrace might be changed. In this case we
> > +                      * store/update `owner_stat` based on current owner stack id.
> > +                      */
> > +                     if (id != otdata->stack_id) {
> > +                             u64 duration = otdata->timestamp - pelem->timestamp;
>
> Isn't it the opposite?
>
>         u64 duration = pelem->timestamp - otdata->timestamp;
>
>
> > +                             struct contention_key ckey = {
> > +                                     .stack_id = id,
> > +                                     .pid = 0,
> > +                                     .lock_addr_or_cgroup = 0,
> > +                             };
> > +                             struct contention_data *cdata =
> > +                                     bpf_map_lookup_elem(&owner_stat, &ckey);
> > +
> > +                             if (!cdata) {
> > +                                     struct contention_data first = {
> > +                                             .total_time = duration,
> > +                                             .max_time = duration,
> > +                                             .min_time = duration,
> > +                                             .count = 1,
> > +                                             .flags = pelem->flags,
> > +                                     };
> > +                                     bpf_map_update_elem(&owner_stat, &ckey, &first,
> > +                                                         BPF_NOEXIST);
> > +                             } else {
> > +                                     __sync_fetch_and_add(&cdata->total_time, duration);
> > +                                     __sync_fetch_and_add(&cdata->count, 1);
> > +
> > +                                     /* FIXME: need atomic operations */
> > +                                     if (cdata->max_time < duration)
> > +                                             cdata->max_time = duration;
> > +                                     if (cdata->min_time > duration)
> > +                                             cdata->min_time = duration;
> > +                             }
>
> And as I said before, can we move this block out as a function?
>
> > +
> > +                             otdata->timestamp = pelem->timestamp;
> > +                             otdata->stack_id = id;
> > +                     }
> > +             }
> > +skip_owner_begin:
> >               pelem->stack_id = bpf_get_stackid(ctx, &stacks,
> >                                                 BPF_F_FAST_STACK_CMP | stack_skip);
> >               if (pelem->stack_id < 0)
> > @@ -473,6 +587,7 @@ int contention_end(u64 *ctx)
> >       struct tstamp_data *pelem;
> >       struct contention_key key = {};
> >       struct contention_data *data;
> > +     __u64 timestamp;
> >       __u64 duration;
> >       bool need_delete = false;
> >
> > @@ -500,12 +615,143 @@ int contention_end(u64 *ctx)
> >               need_delete = true;
> >       }
> >
> > -     duration = bpf_ktime_get_ns() - pelem->timestamp;
> > +     timestamp = bpf_ktime_get_ns();
> > +     duration = timestamp - pelem->timestamp;
> >       if ((__s64)duration < 0) {
> >               __sync_fetch_and_add(&time_fail, 1);
> >               goto out;
> >       }
> >
> > +     if (needs_callstack && lock_owner) {
> > +             u64 owner_time;
> > +             struct contention_key ckey = {};
> > +             struct contention_data *cdata;
> > +             struct owner_tracing_data *otdata;
> > +
> > +             otdata = bpf_map_lookup_elem(&owner_data, &pelem->lock);
> > +             if (!otdata)
> > +                     goto skip_owner_end;
> > +
> > +             /* Update `owner_stat` */
> > +             owner_time = timestamp - otdata->timestamp;
> > +             ckey.stack_id = otdata->stack_id;
> > +             cdata = bpf_map_lookup_elem(&owner_stat, &ckey);
> > +
> > +             if (!cdata) {
> > +                     struct contention_data first = {
> > +                             .total_time = owner_time,
> > +                             .max_time = owner_time,
> > +                             .min_time = owner_time,
> > +                             .count = 1,
> > +                             .flags = pelem->flags,
> > +                     };
> > +                     bpf_map_update_elem(&owner_stat, &ckey, &first, BPF_NOEXIST);
> > +             } else {
> > +                     __sync_fetch_and_add(&cdata->total_time, owner_time);
> > +                     __sync_fetch_and_add(&cdata->count, 1);
> > +
> > +                     /* FIXME: need atomic operations */
> > +                     if (cdata->max_time < owner_time)
> > +                             cdata->max_time = owner_time;
> > +                     if (cdata->min_time > owner_time)
> > +                             cdata->min_time = owner_time;
> > +             }
> > +
> > +             /* No contention is occurring, delete `lock` entry in `owner_data` */
> > +             if (otdata->count <= 1)
> > +                     bpf_map_delete_elem(&owner_data, &pelem->lock);
> > +             /*
> > +              * Contention is still ongoing, with a new owner (current task). `owner_data`
> > +              * should be updated accordingly.
> > +              */
> > +             else {
> > +                     u32 i = 0;
> > +                     u64 *buf;
> > +
> > +                     __sync_fetch_and_add(&otdata->count, -1);
> > +
> > +                     buf = bpf_map_lookup_elem(&stack_buf, &i);
> > +                     if (!buf)
> > +                             goto skip_owner_end;
> > +                     for (i = 0; i < (u32)max_stack; i++)
> > +                             buf[i] = 0x0;
> > +
> > +                     /*
> > +                      * ctx[1] has the return code of the lock function.
>
> Then I think it's clearer to have a local variable named 'ret' or so.
>
>
> > +                      * If ctx[1] is not 0, the current task terminates lock waiting without
> > +                      * acquiring it. Owner is not changed, but we still need to update the owner
> > +                      * stack.
> > +                      */
> > +                     if (!ctx[1]) {
>
> This doesn't match to the comment.  It should be:
>
>                         if (ret < 0) {
>
>
> > +                             s32 id = 0;
> > +                             struct task_struct *task = NULL;
> > +
> > +                             if (bpf_task_from_pid)
>
> Same as the above.  No need to go down if you cannot get the task and
> stack.
>
>                                 if (bpf_task_from_pid == NULL)
>                                         goto skip_owner;
>
>
> > +                                     task = bpf_task_from_pid(otdata->pid);
> > +
> > +                             if (task) {
> > +                                     bpf_get_task_stack(task, buf,
> > +                                                        max_stack * sizeof(unsigned long), 0);
> > +                                     bpf_task_release(task);
> > +                             }
> > +
> > +                             id = get_owner_stack_id(buf);
> > +
> > +                             /*
> > +                              * If owner stack is changed, update `owner_data` and `owner_stat`
> > +                              * accordingly.
> > +                              */
> > +                             if (id != otdata->stack_id) {
> > +                                     u64 duration = otdata->timestamp - pelem->timestamp;
> > +                                     struct contention_key ckey = {
> > +                                             .stack_id = id,
> > +                                             .pid = 0,
> > +                                             .lock_addr_or_cgroup = 0,
> > +                                     };
> > +                                     struct contention_data *cdata =
> > +                                             bpf_map_lookup_elem(&owner_stat, &ckey);
> > +
> > +                                     if (!cdata) {
> > +                                             struct contention_data first = {
> > +                                                     .total_time = duration,
> > +                                                     .max_time = duration,
> > +                                                     .min_time = duration,
> > +                                                     .count = 1,
> > +                                                     .flags = pelem->flags,
> > +                                             };
> > +                                             bpf_map_update_elem(&owner_stat, &ckey, &first,
> > +                                                                 BPF_NOEXIST);
> > +                                     } else {
> > +                                             __sync_fetch_and_add(&cdata->total_time, duration);
> > +                                             __sync_fetch_and_add(&cdata->count, 1);
> > +
> > +                                             /* FIXME: need atomic operations */
> > +                                             if (cdata->max_time < duration)
> > +                                                     cdata->max_time = duration;
> > +                                             if (cdata->min_time > duration)
> > +                                                     cdata->min_time = duration;
> > +                                     }
> > +
> > +                                     otdata->timestamp = pelem->timestamp;
> > +                                     otdata->stack_id = id;
> > +                             }
> > +                     }
> > +                     /*
> > +                      * If ctx[1] is 0, then update tracinng data with the current task, which is
> > +                      * the new owner.
> > +                      */
> > +                     else {
> > +                             otdata->pid = pid;
> > +                             otdata->timestamp = timestamp;
> > +
> > +                             bpf_get_task_stack(bpf_get_current_task_btf(), buf,
> > +                                                max_stack * sizeof(unsigned long), 0);
>
> This would be meaningless since it's still in the contention path.
> Current callstack will be the same as the waiter callstack.  You'd
> better just invalidate callstack here and let the next waiter update
> it.

I wonder why this is meaningless. In this situation, the lock owner is
transferred to the current task, and there is at least one more
waiter, the contention is still ongoing. `otdata` is for tracing the
lock owner, so it should be correctly updated with the new owner,
which is the current task.
>
> Thanks,
> Namhyung
>
>
> > +                             otdata->stack_id = get_owner_stack_id(buf);
> > +                     }
> > +             }
> > +     }
> > +skip_owner_end:
> > +
> >       switch (aggr_mode) {
> >       case LOCK_AGGR_CALLER:
> >               key.stack_id = pelem->stack_id;
> > --
> > 2.48.1.362.g079036d154-goog
> >