linux-kernel - Re: [PATCH] mlx5: only schedule EQ comp tasklet if necessary

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADUfDZq_hsHDxi0uyxUY-PLJQm9CNYnnxjqWUXsx0ssuh6ee-A@mail.gmail.com>
Date: Tue, 29 Oct 2024 09:32:13 -0700
From: Caleb Sander <csander@...estorage.com>
To: Parav Pandit <parav@...dia.com>
Cc: Saeed Mahameed <saeedm@...dia.com>, Leon Romanovsky <leon@...nel.org>, Tariq Toukan <tariqt@...dia.com>, 
	Andrew Lunn <andrew+netdev@...n.ch>, "David S. Miller" <davem@...emloft.net>, 
	Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, 
	"netdev@...r.kernel.org" <netdev@...r.kernel.org>, 
	"linux-rdma@...r.kernel.org" <linux-rdma@...r.kernel.org>, 
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] mlx5: only schedule EQ comp tasklet if necessary

On Mon, Oct 28, 2024 at 9:08 PM Parav Pandit <parav@...dia.com> wrote:
>
> Hi
>
> > From: Caleb Sander Mateos <csander@...estorage.com>
> > Sent: Sunday, October 27, 2024 9:37 AM
> >
> > Currently, the mlx5_eq_comp_int() interrupt handler schedules a tasklet to call
> > mlx5_cq_tasklet_cb() if it processes any completions. For CQs whose
> > completions don't need to be processed in tasklet context, this overhead is
> > unnecessary. Atomic operations are needed to schedule, lock, and clear the
> > tasklet. And when mlx5_cq_tasklet_cb() runs, it acquires a spin lock to access
> > the list of CQs enqueued for processing.
> >
> > Schedule the tasklet in mlx5_add_cq_to_tasklet() instead to avoid this
> > overhead. mlx5_add_cq_to_tasklet() is responsible for enqueuing the CQs to
> > be processed in tasklet context, so it can schedule the tasklet. CQs that need
> > tasklet processing have their interrupt comp handler set to
> > mlx5_add_cq_to_tasklet(), so they will schedule the tasklet. CQs that don't
> > need tasklet processing won't schedule the tasklet. To avoid scheduling the
> > tasklet multiple times during the same interrupt, only schedule the tasklet in
> > mlx5_add_cq_to_tasklet() if the tasklet work queue was empty before the
> > new CQ was pushed to it.
> >
> > Note that the mlx4 driver works the same way: it schedules the tasklet in
> > mlx4_add_cq_to_tasklet() and only if the work queue was empty before.
> >
> > Signed-off-by: Caleb Sander Mateos <csander@...estorage.com>
> > ---
> >  drivers/net/ethernet/mellanox/mlx5/core/cq.c | 5 +++++
> > drivers/net/ethernet/mellanox/mlx5/core/eq.c | 5 +----
> >  2 files changed, 6 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/cq.c
> > b/drivers/net/ethernet/mellanox/mlx5/core/cq.c
> > index 4caa1b6f40ba..25f3b26db729 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/cq.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/cq.c
> > @@ -69,22 +69,27 @@ void mlx5_cq_tasklet_cb(struct tasklet_struct *t)
> > static void mlx5_add_cq_to_tasklet(struct mlx5_core_cq *cq,
> >                                  struct mlx5_eqe *eqe)
> >  {
> >       unsigned long flags;
> >       struct mlx5_eq_tasklet *tasklet_ctx = cq->tasklet_ctx.priv;
> > +     bool schedule_tasklet = false;
> >
> >       spin_lock_irqsave(&tasklet_ctx->lock, flags);
> >       /* When migrating CQs between EQs will be implemented, please note
> >        * that you need to sync this point. It is possible that
> >        * while migrating a CQ, completions on the old EQs could
> >        * still arrive.
> >        */
> >       if (list_empty_careful(&cq->tasklet_ctx.list)) {
> >               mlx5_cq_hold(cq);
> > +             schedule_tasklet = list_empty(&tasklet_ctx->list);
> >               list_add_tail(&cq->tasklet_ctx.list, &tasklet_ctx->list);
> >       }
> >       spin_unlock_irqrestore(&tasklet_ctx->lock, flags);
> > +
> > +     if (schedule_tasklet)
> > +             tasklet_schedule(&tasklet_ctx->task);
> >  }
> >
> >  /* Callers must verify outbox status in case of err */  int mlx5_create_cq(struct
> > mlx5_core_dev *dev, struct mlx5_core_cq *cq,
> >                  u32 *in, int inlen, u32 *out, int outlen) diff --git
> > a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> > b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> > index 68cb86b37e56..66fc17d9c949 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> > @@ -112,17 +112,17 @@ static int mlx5_eq_comp_int(struct notifier_block
> > *nb,
> >       struct mlx5_eq_comp *eq_comp =
> >               container_of(nb, struct mlx5_eq_comp, irq_nb);
> >       struct mlx5_eq *eq = &eq_comp->core;
> >       struct mlx5_eqe *eqe;
> >       int num_eqes = 0;
> > -     u32 cqn = -1;
> >
> >       eqe = next_eqe_sw(eq);
> >       if (!eqe)
> >               goto out;
> >
> >       do {
> > +             u32 cqn;
> >               struct mlx5_core_cq *cq;
> >
> A small nit, cqn should be declared after cq to follow the netdev coding guidelines of [1].

Sure, will fix. Thanks for the reference.

>
> >               /* Make sure we read EQ entry contents after we've
> >                * checked the ownership bit.
> >                */
> > @@ -145,13 +145,10 @@ static int mlx5_eq_comp_int(struct notifier_block
> > *nb,
> >       } while ((++num_eqes < MLX5_EQ_POLLING_BUDGET) && (eqe =
> > next_eqe_sw(eq)));
> >
> >  out:
> >       eq_update_ci(eq, 1);
> >
> > -     if (cqn != -1)
> > -             tasklet_schedule(&eq_comp->tasklet_ctx.task);
> > -
> Current code processes many EQEs and performs the check for tasklet_schedule only once in the cqn check.
> While this change, on every EQE, the additional check will be done.
> This will marginally make the interrupt handler slow.
> Returning a bool from comp() wont be good either, and we cannot inline things here due to function pointer.
>
> The cost of scheduling null tasklet is higher than this if (schedule_tasklet) check.
> In other series internally, I am working to reduce the cost of comp() itself unrelated to this change.
> so it ok to have the additional check introduced here.

Right, there's definitely a tradeoff here.
>From what I could tell, there is only one CQ type that processes
completions in tasklet context (user Infiniband CQs, running
mlx5_ib_cq_comp()). All others handle their completions in interrupt
context. Ideally the CQ types that don't need it would not pay the
cost of the tasklet schedule and execution. There are several atomic
operations involved in the tasklet path which are fairly expensive. In
our TCP-heavy workload, we see 4% of the CPU time spent on the
tasklet_trylock() in tasklet_action_common.constprop.0, with a smaller
amount spent on the atomic operations in tasklet_schedule(),
tasklet_clear_sched(), and acquiring the spinlock in
mlx5_cq_tasklet_cb().
I agree the additional branch per EQE should be cheaper than
scheduling the unused tasklet, but the cost would be paid by
Infiniband workloads while non-Infiniband workloads see the benefit.
How about instead scheduling the tasklet in mlx5_eq_comp_int() if any
of the CQs have a tasklet completion handler? That should get the best
of both worlds: skipping the tasklet schedule for CQs that don't need
it while ensuring the tasklet is only scheduled once per interrupt.
Something like this:
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index 68cb86b37e56..f0ba3725b8e9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -112,9 +112,9 @@ static int mlx5_eq_comp_int(struct notifier_block *nb,
        struct mlx5_eq_comp *eq_comp =
                container_of(nb, struct mlx5_eq_comp, irq_nb);
        struct mlx5_eq *eq = &eq_comp->core;
+       bool schedule_tasklet = false;
        struct mlx5_eqe *eqe;
        int num_eqes = 0;
-       u32 cqn = -1;

        eqe = next_eqe_sw(eq);
        if (!eqe)
@@ -122,6 +122,7 @@ static int mlx5_eq_comp_int(struct notifier_block *nb,

        do {
                struct mlx5_core_cq *cq;
+               u32 cqn;

                /* Make sure we read EQ entry contents after we've
                 * checked the ownership bit.
@@ -134,6 +135,7 @@ static int mlx5_eq_comp_int(struct notifier_block *nb,
                if (likely(cq)) {
                        ++cq->arm_sn;
                        cq->comp(cq, eqe);
+                       schedule_tasklet |= !!cq->tasklet_ctx.comp;
                        mlx5_cq_put(cq);
                } else {
                        dev_dbg_ratelimited(eq->dev->device,
@@ -147,7 +149,7 @@ static int mlx5_eq_comp_int(struct notifier_block *nb,
 out:
        eq_update_ci(eq, 1);

-       if (cqn != -1)
+       if (schedule_tasklet)
                tasklet_schedule(&eq_comp->tasklet_ctx.task);

        return 0;

Thanks,
Caleb