netdev - Re: [RFC][Proposal] BPF Control MAP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2412411ee9115352428e71ff214bcc8bd44e8781.camel@mellanox.com>
Date:   Wed, 20 Mar 2019 04:47:11 +0000
From:   Saeed Mahameed <saeedm@...lanox.com>
To:     "alexei.starovoitov@...il.com" <alexei.starovoitov@...il.com>
CC:     "songliubraving@...com" <songliubraving@...com>,
        "toke@...hat.com" <toke@...hat.com>,
        "daniel@...earbox.net" <daniel@...earbox.net>,
        "bpf@...r.kernel.org" <bpf@...r.kernel.org>,
        "andrew.gospodarek@...adcom.com" <andrew.gospodarek@...adcom.com>,
        "jesper.brouer@...il.com" <jesper.brouer@...il.com>,
        "bblanco@...il.com" <bblanco@...il.com>,
        "jakub.kicinski@...ronome.com" <jakub.kicinski@...ronome.com>,
        "iovisor-dev@...ts.iovisor.org" <iovisor-dev@...ts.iovisor.org>,
        "bjorn.topel@...il.com" <bjorn.topel@...il.com>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        "kafai@...com" <kafai@...com>
Subject: Re: [RFC][Proposal] BPF Control MAP

On Tue, 2019-03-19 at 20:04 -0700, Alexei Starovoitov wrote:
> On Tue, Mar 19, 2019 at 06:54:08PM +0000, Saeed Mahameed wrote:
> > This is exactly the purpose of map_{create/delete} to define what
> > the
> > key,vlaue format will be ( it is not going to be ifindex for all
> > control maps), to make it clear, the key doesn't have to be an
> > ifindex
> > at all, it depends on the map_attr.ctrl_type which the user request
> > on
> > map creation, so different layouts are already supported by this
> > proposal
> > 
> > examples of different map_attr.ctrl_type:
> > 
> > fd = create_map(BPF_MAP_TYPE_CONTROL, map_attr.ctrl_type =
> > XDP_ATTR)
> > // Key layout == ifindex, vlaue format if_xdp_attributes
> > 
> > fd = create_map(BPF_MAP_TYPE_CONTROL, map_attr.ctrl_type =
> > BPF_PROG_STATS)
> > // Key layout == prog_fd, value format struct bpf_prog_stats
> > 
> > fd = create_map(BPF_MAP_TYPE_CONTROL, map_attr.ctrl_type =
> > BPF_SOCK_ATTR)
> > // Key layout == socket_fd, value layout struct bpf_sock_attr
> > 
> > extending this can be done in any linux BPF subsystem, by
> > introducing 
> > new map_attr.ctrl_types and new key value layouts of that control
> > type.
> > 
> > on map creation we will attach the btf format of the (key, value)
> > pair
> > to the map created for that user.
> 
> In your examples above does netdev with corresponding ifindex
> exist before map is created?
> Does prog_fd exist ? and socket?
> In all cases yes. they do. Hence creation of 'map' (even pseudo map)
> is an unnecessary step.
> 
> User space providing BTF for input and output makes little sense to
> me.
> What kernel suppose to do with it?
> 
> In case of XDP stats (that could be different between drivers)
> the driver would provide a BTF to describe the stats it's collecting.
> So it's kernel supplied BTF instead of user.
> 

ok, i think i wasn't clear enough, let me retry.

So map creation has nothing to do with ifindex or the object the user
is trying to access.

on map creation the user will define what map sub type (control type)
the user want to create/access, and the kernel job will be to setup
this user map attributes, connect that map to the subsystem that is
going to handle this map operations (lookup and update). by subsystem i
mean XDP subsystem for example (net/core layer and not device driver)
There will be no direct connection to the device driver, direct map
operations that go to device drivers is something we should avoid, the
whole idea here is to provide standard sustainable uapi and not device
specific key values that will go out of control.

BTF layout for key value pair is going to be defined and assigned by
kernel middle layers and NOT the user space nor device drivers, each
subsystem that want to support new control MAP sub types should pre-
define its own BTF layou.

on map creation user can query the BTF key value layouts and any other
map attributes.

For example to enable user to access XDP attributes, we will add a
kernel BPF layer under net/core/xdp_ctrl.c which will handle all 
map operations coming in from maps that are created via:

create_map(BPF_MAP_TYPE_CONTROL, map_attr.ctrl_type = XDP_ATTR)

other subsystems/layers (eg. sockets/kprobe/cgroups/tracepoints) can
define there own map_attr.ctrl_type and implement thier own mid-layer
which will handle control maps operation of that sub type.

So you don't create a map to get connected to a specific netdev, you
create a map to get a connection to a specific kernel BPF subsystem and
query modify that subsystem objects (in XDP case netdevs).

User can create multiple maps with different ap_attr.ctrl_type to
access different system attributes.

> > > I think we need something else here. Using BTF to describe
> > > output stats is nice, but using BTF to describe input query is
> > > problematic,
> > > since user cannot know before hand what kernel can and cannot
> > > accept.
> > > imo input should be stable uapi with fixed constants whereas
> > > stats-like output can be BTF based and vary from driver to driver
> > > and from one nic version to another.
> > > 
> > 
> > Well, we can decide to use static stable uapi, and still use this
> > special map to leverage the map API as described in this doc.
> > 
> > but also we can allow dynamic value layouts, but any extension
> > should
> > be done to the end of any value structuer and on lookup we will
> > only
> > copy the size that the user already recognizes .. ? or we can
> > assume/force the user to use the map_attributes to figure out
> > format
> > layouts and sizes, but still we will guarantee backward
> > compatibility
> > in kernel level by keeping old format the same, and extension is
> > only
> > allowed to the end of the value structures.
> 
> for input queries - yes. See how 'union bpf_attr' works.
> It can accommodate any number of new commands with its own arguments
> to query XDP stats.
> For output the driver can supply BTF back to user space along with
> a blob of data.
> 
> If I got your intent correctly you want only BPF_MAP_TYPE_CONTROL to
> be processed by generic bpf layer and everything else to be done in
> the driver? All commands, values, input/output to be driver specific?
> 

hmm, no i want the kernel core code to process everything, different
subsystems (not device drivers) can provide there own
BPF_MAP_TYPE_CONTROL sub types and provide the map operation
implementation. As a backend such mid layer subsystem can use whatever
interface to access objects (netdevs ndos in our case and could be via 
union bpf_attr), this is implementation specific that is transparent to
the user.

bottom line we don't want drivers/users to create BTF layouts it is the
kernel mid layer job to connect between the two via the control MAPs,
and this mid layer will audit drivers and users.

I know we can achieve same thing with union bpf_attr and use plain
ethtool or netlink, but this means that we need to integrate
ethtool/netlink APIs into libbpf but the whole beauty if this proposal
is that a BPF developer will have to deal only with MAP operations even
if he wants to change system attributes.

One extra benefit of this is forward compatibility of a simple user
space program that can access new kernel attributes with out the need
to modify the user space program

consider the following command line, with bpftool that compiled way
before xsk-ring-szie attribute was introduced.

$ bpftool xdp set eth0 xsk-ring-size 128

what this can eventually do:
 
fd = create_map(BPF_CONTROL, map_attr.ctrl_type = XDP_ATTR)

//Query BTF of the control (XDP) map
query_map_attributes(fd, &map_attr);

// Find if control map values btf layout include the new member
offset = btf_get_offset_of_memeber_name(map_attr.btf, "xsk-ring-size",
BTF_TYPE_INT);
if (offset < 0)
     return offset; // the new member is not supported in kernel

*(value + offset) = (int)new_val;
map_update(fd, <eth0 ifindex>, value);