[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <Z6v9Up1SRto4N4b7@google.com>
Date: Tue, 11 Feb 2025 17:45:54 -0800
From: Namhyung Kim <namhyung@...nel.org>
To: Chun-Tse Shao <ctshao@...gle.com>
Cc: linux-kernel@...r.kernel.org, peterz@...radead.org, mingo@...hat.com,
acme@...nel.org, mark.rutland@....com,
alexander.shishkin@...ux.intel.com, jolsa@...nel.org,
irogers@...gle.com, adrian.hunter@...el.com,
kan.liang@...ux.intel.com, linux-perf-users@...r.kernel.org,
bpf@...r.kernel.org
Subject: Re: [PATCH v4 2/5] perf lock: Retrieve owner callstack in bpf program
On Tue, Feb 11, 2025 at 03:26:04PM -0800, Chun-Tse Shao wrote:
> Hi Namhyung, thanks for your review first!
>
>
> On Fri, Jan 31, 2025 at 3:48 PM Namhyung Kim <namhyung@...nel.org> wrote:
> >
> > On Wed, Jan 29, 2025 at 09:21:36PM -0800, Chun-Tse Shao wrote:
> > > Tracks lock contention by tracing owner callstacks in
> > > `contention_begin()` and `contention_end()`, storing data in the
> > > owner_stat BPF map. `contention_begin()` records the owner and their
> > > callstack. `contention_end()` updates contention statistics (count,
> > > time), decrements the waiter count, and removes the record when no
> > > waiters remain. Statistics are also updated if the owner's callstack
> > > changes. Note that owner and its callstack retrieval may fail.
> > >
> > > To elaborate the process in detail:
> > > /*
> > > * In `contention_begin(), the current task is the lock waiter`. We
> > > * create/update `owner_data` for the given `lock` address.
> > > contention_begin() {
> > > Try to get owner task. Skip entire process if fails.
> > > Try to get owner stack based on task. Use empty stack if fails.
> > > Store owner stack into `owner_stacks` and create `stack_id`. If fail
> > > to store, use negative `stack_id`, which will be ignored while
> > > reporting in usermode.
> > > Retrieve `owner_tracing_data` in `owner_data` with given `lock`
> > > address.
> > >
> > > /*
> > > * The first case means contention just happens, or mismatched owner
> > > * infomation so we just drop the previous record.
> > > */
> > > if (`owner_tracing_data` does not exist ||
> > > the recorded owner `pid` does not match with the detected owner) {
> > > Create `owner_tracing_data` with info from detected owner, and
> > > store it in `owner_data` with key `lock` address.
> > > }
> > > /*
> > > * The second case means contention is on going. One more waiter is
> > > * joining the lock contention. Both `owner_data` and `owner_stat`
> > > * should be updated.
> > > */
> > > else {
> > > `owner_tracing_data.count`++
> > >
> > > Create `contention_key` with owner `stack_id` and lookup
> > > `contention_data` in `owner_stat`.
> > > if (`contention_data` does not exist) {
> > > Create new entry for `contention_key`:`contention_data` in
> > > `owner_stat`.
> > > }
> > > else {
> > > Update the `count` and `total_time` in existing
> > > `contention_data`.
> > > }
> > >
> > > Update `timestamp` and `stack_id` in `owner_tracing_data`.
> > > }
> > > }
> > >
> > > /*
> > > * In `contention_end()`, the current task will be the new owner of
> > > * the `lock`, if `ctx[1]` is not 0.
> > > */
> > > contention_end() {
> > > Lookup `owner_tracing_data` in `owner_data` with key `lock`.
> > >
> > > Create `contention_key` with `owner_tracing_data.stack_id` and
> > > lookup `contention_data` in `owner_stat`.
> > > if (`contention_data` does not exist) {
> > > Create new entry for `contention_key`:`contention_data` in
> > > `owner_stat`.
> > > }
> > > else {
> > > Update the `count` and `total_time` in existing `contention_data`.
> > > }
> > >
> > > /*
> > > * There is no more waiters, contention is over, delete the record.
> > > */
> > > if (`owner_tracing_data.count` <= 1) {
> > > delete this record in `owner_data`.
> > > }
> > > /*
> > > * Contention is still on going.
> > > */
> > > else {
> > > `owner_tracing_data.count`--
> > >
> > > if (`!ctx[0]`) {
> > > The current task exits without acquiring the lock. However we
> > > check for the recorded owner if the stack is changed, and
> > > update `onwer_data` and `owner_stat` accordingly.
> > > }
> > > else {
> > > The current task is the new owner, retrieve its stack, store it
> > > in `owner_stack` and update `owner_tracing_data`.
> > > }
> > > }
> > > }
> >
> > I think this is too much detail. :)
> >
> > I'd say something like this:
> >
> > This implements per-callstack aggregation of lock owners in addition to
> > per-thread. The owner callstack is captured using bpf_get_task_stack()
> > at contention_begin and it also adds a custom stackid function for the
> > owner stacks to be compared easily.
> >
> > The owner info is kept in a hash map using lock addr as a key to handle
> > multiple waiters for the same lock. At contention_end, it updates the
> > owner lock stat based on the info that was saved at contention_begin.
> > If there are more waiters, it'd update the owner pid to itself as
> > contention_end means it gets the lock now. But it also needs to check
> > the return value of the lock function in case task was killed by a signal
> > or something.
> >
>
> Thanks, I will just reuse this description. :D
>
> > >
> > > Signed-off-by: Chun-Tse Shao <ctshao@...gle.com>
> > > ---
> > > .../perf/util/bpf_skel/lock_contention.bpf.c | 248 +++++++++++++++++-
> > > 1 file changed, 247 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/tools/perf/util/bpf_skel/lock_contention.bpf.c b/tools/perf/util/bpf_skel/lock_contention.bpf.c
> > > index 23fe9cc980ae..b12df873379f 100644
> > > --- a/tools/perf/util/bpf_skel/lock_contention.bpf.c
> > > +++ b/tools/perf/util/bpf_skel/lock_contention.bpf.c
> > > @@ -197,6 +197,9 @@ int data_fail;
> > > int task_map_full;
> > > int data_map_full;
> > >
> > > +struct task_struct *bpf_task_from_pid(s32 pid) __ksym __weak;
> > > +void bpf_task_release(struct task_struct *p) __ksym __weak;
> > > +
> > > static inline __u64 get_current_cgroup_id(void)
> > > {
> > > struct task_struct *task;
> > > @@ -420,6 +423,26 @@ static inline struct tstamp_data *get_tstamp_elem(__u32 flags)
> > > return pelem;
> > > }
> > >
> > > +static inline s32 get_owner_stack_id(u64 *stacktrace)
> > > +{
> > > + s32 *id, new_id;
> > > + static s64 id_gen = 1;
> > > +
> > > + id = bpf_map_lookup_elem(&owner_stacks, stacktrace);
> > > + if (id)
> > > + return *id;
> > > +
> > > + new_id = (s32)__sync_fetch_and_add(&id_gen, 1);
> > > +
> > > + bpf_map_update_elem(&owner_stacks, stacktrace, &new_id, BPF_NOEXIST);
> > > +
> > > + id = bpf_map_lookup_elem(&owner_stacks, stacktrace);
> > > + if (id)
> > > + return *id;
> > > +
> > > + return -1;
> > > +}
> > > +
> > > SEC("tp_btf/contention_begin")
> > > int contention_begin(u64 *ctx)
> > > {
> > > @@ -437,6 +460,97 @@ int contention_begin(u64 *ctx)
> > > pelem->flags = (__u32)ctx[1];
> > >
> > > if (needs_callstack) {
> > > + u32 i = 0;
> > > + u32 id = 0;
> > > + int owner_pid;
> > > + u64 *buf;
> > > + struct task_struct *task;
> > > + struct owner_tracing_data *otdata;
> > > +
> > > + if (!lock_owner)
> > > + goto skip_owner_begin;
> > > +
> > > + task = get_lock_owner(pelem->lock, pelem->flags);
> > > + if (!task)
> > > + goto skip_owner_begin;
> > > +
> > > + owner_pid = BPF_CORE_READ(task, pid);
> > > +
> > > + buf = bpf_map_lookup_elem(&stack_buf, &i);
> > > + if (!buf)
> > > + goto skip_owner_begin;
> > > + for (i = 0; i < max_stack; i++)
> > > + buf[i] = 0x0;
> > > +
> > > + if (bpf_task_from_pid) {
> >
> > I think you can do this instead:
> >
> > if (bpf_task_from_pid == NULL)
> > goto skip_owner_begin;
> >
> > nit: it can be just 'skip_owner'. :)
> >
> >
> > > + task = bpf_task_from_pid(owner_pid);
> > > + if (task) {
> > > + bpf_get_task_stack(task, buf, max_stack * sizeof(unsigned long), 0);
> > > + bpf_task_release(task);
> > > + }
> > > + }
> > > +
> > > + otdata = bpf_map_lookup_elem(&owner_data, &pelem->lock);
> > > + id = get_owner_stack_id(buf);
> > > +
> > > + /*
> > > + * Contention just happens, or corner case `lock` is owned by process not
> > > + * `owner_pid`. For the corner case we treat it as unexpected internal error and
> > > + * just ignore the precvious tracing record.
> > > + */
> > > + if (!otdata || otdata->pid != owner_pid) {
> > > + struct owner_tracing_data first = {
> > > + .pid = owner_pid,
> > > + .timestamp = pelem->timestamp,
> > > + .count = 1,
> > > + .stack_id = id,
> > > + };
> > > + bpf_map_update_elem(&owner_data, &pelem->lock, &first, BPF_ANY);
> > > + }
> > > + /* Contention is ongoing and new waiter joins */
> > > + else {
> > > + __sync_fetch_and_add(&otdata->count, 1);
> > > +
> > > + /*
> > > + * The owner is the same, but stacktrace might be changed. In this case we
> > > + * store/update `owner_stat` based on current owner stack id.
> > > + */
> > > + if (id != otdata->stack_id) {
> > > + u64 duration = otdata->timestamp - pelem->timestamp;
> >
> > Isn't it the opposite?
> >
> > u64 duration = pelem->timestamp - otdata->timestamp;
> >
> >
> > > + struct contention_key ckey = {
> > > + .stack_id = id,
> > > + .pid = 0,
> > > + .lock_addr_or_cgroup = 0,
> > > + };
> > > + struct contention_data *cdata =
> > > + bpf_map_lookup_elem(&owner_stat, &ckey);
> > > +
> > > + if (!cdata) {
> > > + struct contention_data first = {
> > > + .total_time = duration,
> > > + .max_time = duration,
> > > + .min_time = duration,
> > > + .count = 1,
> > > + .flags = pelem->flags,
> > > + };
> > > + bpf_map_update_elem(&owner_stat, &ckey, &first,
> > > + BPF_NOEXIST);
> > > + } else {
> > > + __sync_fetch_and_add(&cdata->total_time, duration);
> > > + __sync_fetch_and_add(&cdata->count, 1);
> > > +
> > > + /* FIXME: need atomic operations */
> > > + if (cdata->max_time < duration)
> > > + cdata->max_time = duration;
> > > + if (cdata->min_time > duration)
> > > + cdata->min_time = duration;
> > > + }
> >
> > And as I said before, can we move this block out as a function?
> >
> > > +
> > > + otdata->timestamp = pelem->timestamp;
> > > + otdata->stack_id = id;
> > > + }
> > > + }
> > > +skip_owner_begin:
> > > pelem->stack_id = bpf_get_stackid(ctx, &stacks,
> > > BPF_F_FAST_STACK_CMP | stack_skip);
> > > if (pelem->stack_id < 0)
> > > @@ -473,6 +587,7 @@ int contention_end(u64 *ctx)
> > > struct tstamp_data *pelem;
> > > struct contention_key key = {};
> > > struct contention_data *data;
> > > + __u64 timestamp;
> > > __u64 duration;
> > > bool need_delete = false;
> > >
> > > @@ -500,12 +615,143 @@ int contention_end(u64 *ctx)
> > > need_delete = true;
> > > }
> > >
> > > - duration = bpf_ktime_get_ns() - pelem->timestamp;
> > > + timestamp = bpf_ktime_get_ns();
> > > + duration = timestamp - pelem->timestamp;
> > > if ((__s64)duration < 0) {
> > > __sync_fetch_and_add(&time_fail, 1);
> > > goto out;
> > > }
> > >
> > > + if (needs_callstack && lock_owner) {
> > > + u64 owner_time;
> > > + struct contention_key ckey = {};
> > > + struct contention_data *cdata;
> > > + struct owner_tracing_data *otdata;
> > > +
> > > + otdata = bpf_map_lookup_elem(&owner_data, &pelem->lock);
> > > + if (!otdata)
> > > + goto skip_owner_end;
> > > +
> > > + /* Update `owner_stat` */
> > > + owner_time = timestamp - otdata->timestamp;
> > > + ckey.stack_id = otdata->stack_id;
> > > + cdata = bpf_map_lookup_elem(&owner_stat, &ckey);
> > > +
> > > + if (!cdata) {
> > > + struct contention_data first = {
> > > + .total_time = owner_time,
> > > + .max_time = owner_time,
> > > + .min_time = owner_time,
> > > + .count = 1,
> > > + .flags = pelem->flags,
> > > + };
> > > + bpf_map_update_elem(&owner_stat, &ckey, &first, BPF_NOEXIST);
> > > + } else {
> > > + __sync_fetch_and_add(&cdata->total_time, owner_time);
> > > + __sync_fetch_and_add(&cdata->count, 1);
> > > +
> > > + /* FIXME: need atomic operations */
> > > + if (cdata->max_time < owner_time)
> > > + cdata->max_time = owner_time;
> > > + if (cdata->min_time > owner_time)
> > > + cdata->min_time = owner_time;
> > > + }
> > > +
> > > + /* No contention is occurring, delete `lock` entry in `owner_data` */
> > > + if (otdata->count <= 1)
> > > + bpf_map_delete_elem(&owner_data, &pelem->lock);
> > > + /*
> > > + * Contention is still ongoing, with a new owner (current task). `owner_data`
> > > + * should be updated accordingly.
> > > + */
> > > + else {
> > > + u32 i = 0;
> > > + u64 *buf;
> > > +
> > > + __sync_fetch_and_add(&otdata->count, -1);
> > > +
> > > + buf = bpf_map_lookup_elem(&stack_buf, &i);
> > > + if (!buf)
> > > + goto skip_owner_end;
> > > + for (i = 0; i < (u32)max_stack; i++)
> > > + buf[i] = 0x0;
> > > +
> > > + /*
> > > + * ctx[1] has the return code of the lock function.
> >
> > Then I think it's clearer to have a local variable named 'ret' or so.
> >
> >
> > > + * If ctx[1] is not 0, the current task terminates lock waiting without
> > > + * acquiring it. Owner is not changed, but we still need to update the owner
> > > + * stack.
> > > + */
> > > + if (!ctx[1]) {
> >
> > This doesn't match to the comment. It should be:
> >
> > if (ret < 0) {
> >
> >
> > > + s32 id = 0;
> > > + struct task_struct *task = NULL;
> > > +
> > > + if (bpf_task_from_pid)
> >
> > Same as the above. No need to go down if you cannot get the task and
> > stack.
> >
> > if (bpf_task_from_pid == NULL)
> > goto skip_owner;
> >
> >
> > > + task = bpf_task_from_pid(otdata->pid);
> > > +
> > > + if (task) {
> > > + bpf_get_task_stack(task, buf,
> > > + max_stack * sizeof(unsigned long), 0);
> > > + bpf_task_release(task);
> > > + }
> > > +
> > > + id = get_owner_stack_id(buf);
> > > +
> > > + /*
> > > + * If owner stack is changed, update `owner_data` and `owner_stat`
> > > + * accordingly.
> > > + */
> > > + if (id != otdata->stack_id) {
> > > + u64 duration = otdata->timestamp - pelem->timestamp;
> > > + struct contention_key ckey = {
> > > + .stack_id = id,
> > > + .pid = 0,
> > > + .lock_addr_or_cgroup = 0,
> > > + };
> > > + struct contention_data *cdata =
> > > + bpf_map_lookup_elem(&owner_stat, &ckey);
> > > +
> > > + if (!cdata) {
> > > + struct contention_data first = {
> > > + .total_time = duration,
> > > + .max_time = duration,
> > > + .min_time = duration,
> > > + .count = 1,
> > > + .flags = pelem->flags,
> > > + };
> > > + bpf_map_update_elem(&owner_stat, &ckey, &first,
> > > + BPF_NOEXIST);
> > > + } else {
> > > + __sync_fetch_and_add(&cdata->total_time, duration);
> > > + __sync_fetch_and_add(&cdata->count, 1);
> > > +
> > > + /* FIXME: need atomic operations */
> > > + if (cdata->max_time < duration)
> > > + cdata->max_time = duration;
> > > + if (cdata->min_time > duration)
> > > + cdata->min_time = duration;
> > > + }
> > > +
> > > + otdata->timestamp = pelem->timestamp;
> > > + otdata->stack_id = id;
> > > + }
> > > + }
> > > + /*
> > > + * If ctx[1] is 0, then update tracinng data with the current task, which is
> > > + * the new owner.
> > > + */
> > > + else {
> > > + otdata->pid = pid;
> > > + otdata->timestamp = timestamp;
> > > +
> > > + bpf_get_task_stack(bpf_get_current_task_btf(), buf,
> > > + max_stack * sizeof(unsigned long), 0);
> >
> > This would be meaningless since it's still in the contention path.
> > Current callstack will be the same as the waiter callstack. You'd
> > better just invalidate callstack here and let the next waiter update
> > it.
>
> I wonder why this is meaningless. In this situation, the lock owner is
> transferred to the current task, and there is at least one more
> waiter, the contention is still ongoing. `otdata` is for tracing the
> lock owner, so it should be correctly updated with the new owner,
> which is the current task.
Yep, but I meant it has the same callstack as with waiters and provides
no additional information about the owner.
Thanks,
Namhyung
> >
> > > + otdata->stack_id = get_owner_stack_id(buf);
> > > + }
> > > + }
> > > + }
> > > +skip_owner_end:
> > > +
> > > switch (aggr_mode) {
> > > case LOCK_AGGR_CALLER:
> > > key.stack_id = pelem->stack_id;
> > > --
> > > 2.48.1.362.g079036d154-goog
> > >
Powered by blists - more mailing lists