[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Zn81srqbHfKBC7zZ@slm.duckdns.org>
Date: Fri, 28 Jun 2024 12:14:10 -1000
From: Tejun Heo <tj@...nel.org>
To: Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc: Alexei Starovoitov <ast@...nel.org>,
LKML <linux-kernel@...r.kernel.org>, bpf <bpf@...r.kernel.org>,
David Vernet <void@...ifault.com>
Subject: Re: [PATCH sched_ext/for-6.11 1/2] sched_ext: Implement DSQ iterator
Hello,
On Thu, Jun 27, 2024 at 06:11:48PM -0700, Alexei Starovoitov wrote:
> > +struct bpf_iter_scx_dsq_kern {
> > + struct scx_dsq_node cursor;
> > + struct scx_dispatch_q *dsq;
> > + u32 dsq_seq;
> > + u32 flags;
> > +} __attribute__((aligned(8)));
> > +
> > +struct bpf_iter_scx_dsq {
> > + u64 __opaque[12];
> > +} __attribute__((aligned(8)));
>
> I think this is a bit too much to put on the prog stack.
> Folks are working on increasing this limit and moving
> the stack into "divided stack", so it won't be an issue eventually,
> but let's find a way to reduce it.
Yeah, it is kinda big. Do you have some idea on where the boundary between
okay and too big would fall on?
> It seems to me scx_dsq_node has a bunch of fields,
> but if I'm reading the code correctly this patch is
> only using cursor.list part of it ?
Great point. Cursors used to have to go on the rbtrees too but that's no
longer the case, so I should be able to drop the rbnode which should help
reducing the size substantially. I'll see what I can do.
> Another alternative is to use bpf_mem_alloc() like we do
> in bpf_iter_css_task and others?
This might be okay but given that this can be used pretty frequently (e.g.
every scheduling event) and it *seems* possible to reduce its size
substantially, I'd like to keep it on stack if possible.
> > +__bpf_kfunc int bpf_iter_scx_dsq_new(struct bpf_iter_scx_dsq *it, u64 dsq_id,
> > + u64 flags)
> > +{
> > + struct bpf_iter_scx_dsq_kern *kit = (void *)it;
> > +
> > + BUILD_BUG_ON(sizeof(struct bpf_iter_scx_dsq_kern) >
> > + sizeof(struct bpf_iter_scx_dsq));
> > + BUILD_BUG_ON(__alignof__(struct bpf_iter_scx_dsq_kern) !=
> > + __alignof__(struct bpf_iter_scx_dsq));
> > +
> > + if (flags & ~__SCX_DSQ_ITER_ALL_FLAGS)
> > + return -EINVAL;
> > +
> > + kit->dsq = find_non_local_dsq(dsq_id);
> > + if (!kit->dsq)
> > + return -ENOENT;
> > +
> > + INIT_LIST_HEAD(&kit->cursor.list);
> > + RB_CLEAR_NODE(&kit->cursor.priq);
> > + kit->cursor.flags = SCX_TASK_DSQ_CURSOR;
>
> Are these two assignments really necessary?
> Something inside nldsq_next_task() is using that?
>
> > + kit->dsq_seq = READ_ONCE(kit->dsq->seq);
> > + kit->flags = flags;
I'm a bit confused whether you're referring to the statements above or
below, but AFAICS, they're all used except for kit->cursor.priq.
- SCX_TASK_DSQ_CURSOR assignment is what tells nldsq_next_task() that the
node is a cursor, not a real task, and thus should be skipped for internal
iterations.
- kit->dsq_seq is used by bpf_iter_scx_dsq_next() to ignore tasks that are
queued after the iteration has started. This, among other things,
guarantees that p->scx.dsq_vtime increases monotonically throughout
iteration.
- kit->flags carries SCX_DSQ_ITER_REV which tells bpf_iter_scx_dsq_next()
the direction of the iteration.
Thanks.
--
tejun
Powered by blists - more mailing lists