[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CA+khW7hSFU2YL+jNw2F2qsuYEW0E6r8kJkg1BoBukAqR_sk+6Q@mail.gmail.com>
Date: Wed, 3 Aug 2022 17:18:25 -0700
From: Hao Luo <haoluo@...gle.com>
To: Yonghong Song <yhs@...com>
Cc: linux-kernel@...r.kernel.org, bpf@...r.kernel.org,
cgroups@...r.kernel.org, netdev@...r.kernel.org,
Alexei Starovoitov <ast@...nel.org>,
Andrii Nakryiko <andrii@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>,
Martin KaFai Lau <kafai@...com>,
Song Liu <songliubraving@...com>, Tejun Heo <tj@...nel.org>,
Zefan Li <lizefan.x@...edance.com>,
KP Singh <kpsingh@...nel.org>,
Johannes Weiner <hannes@...xchg.org>,
Michal Hocko <mhocko@...nel.org>,
Benjamin Tissoires <benjamin.tissoires@...hat.com>,
John Fastabend <john.fastabend@...il.com>,
Michal Koutny <mkoutny@...e.com>,
Roman Gushchin <roman.gushchin@...ux.dev>,
David Rientjes <rientjes@...gle.com>,
Stanislav Fomichev <sdf@...gle.com>,
Shakeel Butt <shakeelb@...gle.com>,
Yosry Ahmed <yosryahmed@...gle.com>
Subject: Re: [PATCH bpf-next v6 4/8] bpf: Introduce cgroup iter
On Wed, Aug 3, 2022 at 12:44 AM Yonghong Song <yhs@...com> wrote:
>
>
>
> On 8/1/22 10:54 AM, Hao Luo wrote:
> > Cgroup_iter is a type of bpf_iter. It walks over cgroups in three modes:
> >
> > - walking a cgroup's descendants in pre-order.
> > - walking a cgroup's descendants in post-order.
> > - walking a cgroup's ancestors.
> >
> > When attaching cgroup_iter, one can set a cgroup to the iter_link
> > created from attaching. This cgroup is passed as a file descriptor and
> > serves as the starting point of the walk. If no cgroup is specified,
> > the starting point will be the root cgroup.
> >
> > For walking descendants, one can specify the order: either pre-order or
> > post-order. For walking ancestors, the walk starts at the specified
> > cgroup and ends at the root.
> >
> > One can also terminate the walk early by returning 1 from the iter
> > program.
> >
> > Note that because walking cgroup hierarchy holds cgroup_mutex, the iter
> > program is called with cgroup_mutex held.
> >
> > Currently only one session is supported, which means, depending on the
> > volume of data bpf program intends to send to user space, the number
> > of cgroups that can be walked is limited. For example, given the current
> > buffer size is 8 * PAGE_SIZE, if the program sends 64B data for each
> > cgroup, assuming PAGE_SIZE is 4kb, the total number of cgroups that can
> > be walked is 512. This is a limitation of cgroup_iter. If the output
> > data is larger than the buffer size, the second read() will signal
> > EOPNOTSUPP. In order to work around, the user may have to update their
>
> 'the second read() will signal EOPNOTSUPP' is not true. for bpf_iter,
> we have user buffer from read() syscall and kernel buffer. The above
> buffer size like 8 * PAGE_SIZE refers to the kernel buffer size.
>
> If read() syscall buffer size is less than kernel buffer size,
> the second read() will not signal EOPNOTSUPP. So to make it precise,
> we can say
> If the output data is larger than the kernel buffer size, after
> all data in the kernel buffer is consumed by user space, the
> subsequent read() syscall will signal EOPNOTSUPP.
>
Thanks Yonghong. Will update.
> > program to reduce the volume of data sent to output. For example, skip
> > some uninteresting cgroups. In future, we may extend bpf_iter flags to
> > allow customizing buffer size.
> >
> > Acked-by: Yonghong Song <yhs@...com>
> > Acked-by: Tejun Heo <tj@...nel.org>
> > Signed-off-by: Hao Luo <haoluo@...gle.com>
> > ---
[...]
> > + *
> > + * Currently only one session is supported, which means, depending on the
> > + * volume of data bpf program intends to send to user space, the number
> > + * of cgroups that can be walked is limited. For example, given the current
> > + * buffer size is 8 * PAGE_SIZE, if the program sends 64B data for each
> > + * cgroup, assuming PAGE_SIZE is 4kb, the total number of cgroups that can
> > + * be walked is 512. This is a limitation of cgroup_iter. If the output data
> > + * is larger than the buffer size, the second read() will signal EOPNOTSUPP.
> > + * In order to work around, the user may have to update their program to
>
> same here as above for better description.
>
SG. Will update.
> > + * reduce the volume of data sent to output. For example, skip some
> > + * uninteresting cgroups.
> > + */
> > +
> > +struct bpf_iter__cgroup {
> > + __bpf_md_ptr(struct bpf_iter_meta *, meta);
> > + __bpf_md_ptr(struct cgroup *, cgroup);
> > +};
> > +
> > +struct cgroup_iter_priv {
> > + struct cgroup_subsys_state *start_css;
> > + bool visited_all;
> > + bool terminate;
> > + int order;
> > +};
> > +
> > +static void *cgroup_iter_seq_start(struct seq_file *seq, loff_t *pos)
> > +{
> > + struct cgroup_iter_priv *p = seq->private;
> > +
> > + mutex_lock(&cgroup_mutex);
> > +
> > + /* cgroup_iter doesn't support read across multiple sessions. */
> > + if (*pos > 0) {
> > + if (p->visited_all)
> > + return NULL;
>
> This looks good. thanks!
>
> > +
> > + /* Haven't visited all, but because cgroup_mutex has dropped,
> > + * return -EOPNOTSUPP to indicate incomplete iteration.
> > + */
> > + return ERR_PTR(-EOPNOTSUPP);
> > + }
> > +
> > + ++*pos;
> > + p->terminate = false;
> > + p->visited_all = false;
> > + if (p->order == BPF_ITER_CGROUP_PRE)
> > + return css_next_descendant_pre(NULL, p->start_css);
> > + else if (p->order == BPF_ITER_CGROUP_POST)
> > + return css_next_descendant_post(NULL, p->start_css);
> > + else /* BPF_ITER_CGROUP_PARENT_UP */
> > + return p->start_css;
> > +}
> > +
> [...]
Powered by blists - more mailing lists