linux-kernel - Re: RFC rdma cgroup

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5639F2E3.8090101@mellanox.com>
Date:	Wed, 4 Nov 2015 13:58:27 +0200
From:	Haggai Eran <haggaie@...lanox.com>
To:	Parav Pandit <pandit.parav@...il.com>
CC:	Tejun Heo <tj@...nel.org>, Doug Ledford <dledford@...hat.com>,
	"Hefty, Sean" <sean.hefty@...el.com>,
	"linux-rdma@...r.kernel.org" <linux-rdma@...r.kernel.org>,
	"cgroups@...r.kernel.org" <cgroups@...r.kernel.org>,
	Liran Liss <liranl@...lanox.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"lizefan@...wei.com" <lizefan@...wei.com>,
	Johannes Weiner <hannes@...xchg.org>,
	Jonathan Corbet <corbet@....net>,
	"james.l.morris@...cle.com" <james.l.morris@...cle.com>,
	"serge@...lyn.com" <serge@...lyn.com>,
	Or Gerlitz <ogerlitz@...lanox.com>,
	Matan Barak <matanb@...lanox.com>,
	"raindel@...lanox.com" <raindel@...lanox.com>,
	"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
	"linux-security-module@...r.kernel.org" 
	<linux-security-module@...r.kernel.org>,
	Jason Gunthorpe <jgunthorpe@...idianresearch.com>
Subject: Re: RFC rdma cgroup

On 03/11/2015 21:11, Parav Pandit wrote:
> So it looks like below,
> #cat rdma.resources.verbs.list
> Output:
> mlx4_0 uctx ah pd cq mr mw srq qp flow
> mlx4_1 uctx ah pd cq mr mw srq qp flow rss_wq
What happens if you set a limit of rss_wq to mlx4_0 in this example?
Would it fail? I think it would be simpler for administrators if they
can configure every resource supported by uverbs. If a resource is not
supported by a specific device, you can never go over the limit anyway.

> #cat rdma.resources.hw.list
> hfi1 hw_qp hw_mr sw_pd
> (This particular one is hypothical example, I haven't actually coded
> this, unlike uverbs which is real).
Sounds fine to me. We will need to be careful to make sure that driver
maintainers don't break backward compatibility with this interface.

>> I guess there aren't a lot of options when the resources can belong to
>> multiple cgroups. So after migrating, new resources will belong to the
>> new cgroup or the old one?
> Resource always belongs to the cgroup in which its created, regardless
> of process migration.
> Again, its owned at the css level instead of cgroup. Therefore
> original cgroup can also be deleted but internal reference to data
> structure and that is freed and last rdma resource is freed.
Okay.

>>> For applications that doesn't use RDMA-CM, query_device and query_port
>>> will filter out the GID entries based on the network namespace in
>>> which caller process is running.
>> This could work well for RoCE, as each entry in the GID table is
>> associated with a net device and a network namespace. However, in
>> InfiniBand, the GID table isn't directly related to the network
>> namespace. As for the P_Keys, you could deduce the set of P_Keys of a
>> namespace by the set of IPoIB netdevs in the network namespace, but
>> InfiniBand is designed to also work without IPoIB, so I don't think it's
>> a good idea.
> Got it. Yeah, this code can be under if(device_type RoCE).
IIRC there's a core capability for the new GID table code that contains
namespace, so you can use that.

>> I think it would be better to allow each cgroup to limit the pkeys and
>> gids its processes can use.
> 
> o.k. So the use case is P_Key? So I believe requirement would similar
> to device cgroup.
> Where set of GID table entries are configured as white list entries.
> and when they are queried or used during create_ah or modify_qp, its
> compared against the white list (or in other words as ACL).
> If they are found in ACL, they are reported in query_device or in
> create_ah, modify_qp. If not they those calls are failed with
> appropriate status?
> Does this look ok? 
Yes, that sounds good to me.

> Can we address requirement as additional feature just after first path?
> Tejun had some other idea on this kind of requirement, and I need to
> discuss with him.
Of course. I think there's use for the RDMA cgroup even without a pkey
or GID ACL, just to make sure one application doesn't hog hardware
resources.

>>> One of the idea I was considering is: to create virtual RDMA device
>>> mapped to physical device.
>>> And configure GID count limit via configfs for each such device.
>> You could probably achieve what you want by creating a virtual RDMA
>> device and use the device cgroup to limit access to it, but it sounds to
>> me like an overkill.
> 
> Actually not much. Basically this virtual RDMA device points to the
> struct device of the physical device itself.
> So only overhead is linking this structure to native device structure
> and  passing most of the calls to native ib_device with thin filter
> layer in control path.
> post_send/recv/poll_cq will directly go native device and same performance.
Still, I think we already have code that wraps ib_device calls for
userspace, which is the ib_uverbs module. There's no need for an extra
layer.

Regards,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/