[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <54DA6FED.5020907@gmail.com>
Date: Tue, 10 Feb 2015 13:54:05 -0700
From: David Ahern <dsahern@...il.com>
To: Thomas Graf <tgraf@...g.ch>
CC: netdev@...r.kernel.org, ebiederm@...ssion.com
Subject: Re: [RFC PATCH 00/29] net: VRF support
On 2/9/15 5:53 PM, Thomas Graf wrote:
> On 02/04/15 at 06:34pm, David Ahern wrote:
>> Namespaces provide excellent separation of the networking stack from the
>> netdevices and up. The intent of VRFs is to provide an additional,
>> logical separation at the L3 layer within a namespace.
>
> What you ask for seems to be L3 micro segmentation inside netns. I
I would not label it 'micro' but yes a L3 separation within a L1 separation.
> would argue that we already support this through multiple routing
> tables. I would prefer improving the existing architecture to cover
> your use cases: Increase the number of supported tables, extend
> routing rules as needed, ...
I've seen that response for VRFs as well. I have not personally tried
it, but from what I have read it does not work well. I think Roopa
responded that Cumulus has spent time on that path and has hit some
roadblocks.
>
>> The VRF id of tasks defaults to 1 and is inherited parent to child. It can
>> be read via the file '/proc/<pid>/vrf' and can be changed anytime by writing
>> to this file (if preferred this can be made a prctl to change the VRF id).
>> This allows services to be launched in a VRF context using ip, similar to
>> what is done for network namespaces.
>> e.g., ip vrf exec 99 /usr/sbin/sshd
>
> I think such as classification should occur through cgroups instead
> of touching PIDs directly.
That is an interesting idea -- using cgroups for task labeling. It
presents a creation / deletion event for VRFs which I was trying to
avoid, and there will be some amount of overhead with a cgroup. I'll
take a look at that option when I get some time.
As for as the current proposal I am treating VRF as part of a network
context. Today 'ip netns' is used to run a command in a specific network
namespace; the proposal with the VRF layering is to add a vrf context
within a namespace so in keeping with how 'ip netns' works the above
syntax allows a user to supply both a network namespace + VRF for
running a command.
>
>> Network devices belong to a single VRF context which defaults to VRF 1.
>> They can be assigned to another VRF using IFLA_VRF attribute in link
>> messages. Similarly the VRF assignment is returned in the IFLA_VRF
>> attribute. The ip command has been modified to display the VRF id of a
>> device. L2 applications like lldp are not VRF aware and still work through
>> through all network devices within the namespace.
>
> I believe that binding net_devices to VRFs is misleading and the
> concept by itself is non-scalable. You do not want to create 10k
> net_devices for your overlay of choice just to tie them to a
> particular VRF. You want to store the VRF identifier as metadata and
> have a stateless classifier included it in the VRF decision. See the
> recent VXLAN-GBP work.
I'll take a look when I get time.
I have not seen scalability issues creating 1,000+ net_devices.
Certainly the 40k'ish memory per net_device is noticeable but I believe
that can be improved (e.g., a number of entries can be moved under
proper CONFIG_ checks). I do need to repeat the tests on newer kernels.
>
> You could either map whatever selects the VRF to the mark or support it
> natively in the routing rules classifier.
>
> An obvious alternative is OVS. What you describe can be implemented in
> a scalable matter using OVS and mark. I understand that OVS is not for
> everybody but it gets a fundamental principle right: Scalability
> demands for programmability.
>
> I don’t think we should be adding a new single purpose metadata field
> to arbitrary structures for every new use case that comes up. We
> should work on programmability which increases flexibility and allows
> decoupling application interest from networking details.
>
>> On RX skbs get their VRF context from the netdevice the packet is received
>> on. For TX the VRF context for an skb is taken from the socket. The
>> intention is for L3/raw sockets to be able to set the VRF context for a
>> packet TX using cmsg (not coded in this patch set).
>
> Specyfing L3 context in cmsg seems very broken to me. We do not want
> to bind applications any closer to underlying networking infrastructure.
> In fact, we should do the opposite and decouple this completely.
That suggestion is inline with what is done today for other L3
parameters -- TOS, TTL, and a few others.
>
>> The 'any' context applies to listen sockets only; connected sockets are in
>> a VRF context. Child sockets accepted by the daemon acquire the VRF context
>> of the network device the connection originated on.
>
> Linux considers an address local regardless of the interface the packet
> was received on. So you would accept the packet on any interface and
> then bind it to the VRF of that interface even though the route for it
> might be on a different interface.
>
> This really belongs into routing rules from my perspective which takes
> mark and the cgroup context into account.
Expanding the current network namespace checks to a networking context
is a very simple and clean way of implementing VRFs versus cobbling
together a 'VRF like' capability using marks, multiple tables, etc (ie.,
the existing capabilities). Further, the VRF tagging of net_devices
seems to readily fit into the hardware offload and switchdev
capabilities (e.g., add a ndo operation for setting the VRF tag on a
device which passes it to the driver).
Big picture wise where is OCP and switchdev headed? Top-of-rack switches
seem to be the first target, but after that? Will the kernel ever
support MPLS? Will the kernel attain the richer feature set of high-end
routers? If so, how does VRF support fit into the design? As I
understand it a scalable VRF solution is a fundamental building block.
Will a cobbled together solution of cgroups, marks, rules, multiple
tables really work versus the simplicity of an expanded network context?
David
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists