[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z-0BoF4vkC2IS1W4@redhat.com>
Date: Wed, 2 Apr 2025 10:21:36 +0100
From: Daniel P. Berrangé <berrange@...hat.com>
To: Stefano Garzarella <sgarzare@...hat.com>
Cc: Bobby Eshleman <bobbyeshleman@...il.com>,
Jakub Kicinski <kuba@...nel.org>,
"K. Y. Srinivasan" <kys@...rosoft.com>,
Haiyang Zhang <haiyangz@...rosoft.com>,
Wei Liu <wei.liu@...nel.org>, Dexuan Cui <decui@...rosoft.com>,
Stefan Hajnoczi <stefanha@...hat.com>,
"Michael S. Tsirkin" <mst@...hat.com>,
Jason Wang <jasowang@...hat.com>,
Xuan Zhuo <xuanzhuo@...ux.alibaba.com>,
Eugenio PĂ©rez <eperezma@...hat.com>,
Bryan Tan <bryan-bt.tan@...adcom.com>,
Vishnu Dasa <vishnu.dasa@...adcom.com>,
Broadcom internal kernel review list <bcm-kernel-feedback-list@...adcom.com>,
"David S. Miller" <davem@...emloft.net>,
virtualization@...ts.linux.dev, netdev@...r.kernel.org,
linux-kernel@...r.kernel.org, linux-hyperv@...r.kernel.org,
kvm@...r.kernel.org
Subject: Re: [PATCH v2 0/3] vsock: add namespace support to vhost-vsock
On Wed, Apr 02, 2025 at 10:13:43AM +0200, Stefano Garzarella wrote:
> On Wed, 2 Apr 2025 at 02:21, Bobby Eshleman <bobbyeshleman@...il.com> wrote:
> >
> > I do like Stefano's suggestion to add a sysctl for a "strict" mode,
> > Since it offers the best of both worlds, and still tends conservative in
> > protecting existing applications... but I agree, the non-strict mode
> > vsock would be unique WRT the usual concept of namespaces.
>
> Maybe we could do the opposite, enable strict mode by default (I think
> it was similar to what I had tried to do with the kernel module in v1, I
> was young I know xD)
> And provide a way to disable it for those use cases where the user wants
> backward compatibility, while paying the cost of less isolation.
I think backwards compatible has to be the default behaviour, otherwise
the change has too high risk of breaking existing deployments that are
already using netns and relying on VSOCK being global. Breakage has to
be opt in.
> I was thinking two options (not sure if the second one can be done):
>
> 1. provide a global sysfs/sysctl that disables strict mode, but this
> then applies to all namespaces
>
> 2. provide something that allows disabling strict mode by namespace.
> Maybe when it is created there are options, or something that can be
> set later.
>
> 2 would be ideal, but that might be too much, so 1 might be enough. In
> any case, 2 could also be a next step.
>
> WDYT?
It occured to me that the problem we face with the CID space usage is
somewhat similar to the UID/GID space usage for user namespaces.
In the latter case, userns has exposed /proc/$PID/uid_map & gid_map, to
allow IDs in the namespace to be arbitrarily mapped onto IDs in the host.
At the risk of being overkill, is it worth trying a similar kind of
approach for the vsock CID space ?
A simple variant would be a /proc/net/vsock_cid_outside specifying a set
of CIDs which are exclusively referencing /dev/vhost-vsock associations
created outside the namespace. Anything not listed would be exclusively
referencing associations created inside the namespace.
A more complex variant would be to allow a full remapping of CIDs as is
done with userns, via a /proc/net/vsock_cid_map, which the same three
parameters, so that CID=15 association outside the namespace could be
remapped to CID=9015 inside the namespace, allow the inside namespace
to define its out association for CID=15 without clashing.
IOW, mapped CIDs would be exclusively referencing /dev/vhost-vsock
associations created outside namespace, while unmapped CIDs would be
exclusively referencing /dev/vhost-vsock associations inside the
namespace.
A likely benefit of relying on a kernel defined mapping/partition of
the CID space is that apps like QEMU don't need changing, as there's
no need to invent a new /dev/vhost-vsock-netns device node.
Both approaches give the desirable security protection whereby the
inside namespace can be prevented from accessing certain CIDs that
were associated outside the namespace.
Some rule would need to be defined for updating the /proc/net/vsock_cid_map
file as it is the security control mechanism. If it is write-once then
if the container mgmt app initializes it, nothing later could change
it.
A key question is do we need the "first come, first served" behaviour
for CIDs where a CID can be arbitrarily used by outside or inside namespace
according to whatever tries to associate a CID first ?
IMHO those semantics lead to unpredictable behaviour for apps because
what happens depends on ordering of app launches inside & outside the
namespace, but they do sort of allow for VSOCK namespace behaviour to
be 'zero conf' out of the box.
A mapping that strictly partitions CIDs to either outside or inside
namespace usage, but never both, gives well defined behaviour, at the
cost of needing to setup an initial mapping/partition.
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
Powered by blists - more mailing lists