netdev - Re: [RFC PATCH 00/29] net: VRF support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <54DA6FED.5020907@gmail.com>
Date:	Tue, 10 Feb 2015 13:54:05 -0700
From:	David Ahern <dsahern@...il.com>
To:	Thomas Graf <tgraf@...g.ch>
CC:	netdev@...r.kernel.org, ebiederm@...ssion.com
Subject: Re: [RFC PATCH 00/29] net: VRF support

On 2/9/15 5:53 PM, Thomas Graf wrote:
> On 02/04/15 at 06:34pm, David Ahern wrote:
>> Namespaces provide excellent separation of the networking stack from the
>> netdevices and up. The intent of VRFs is to provide an additional,
>> logical separation at the L3 layer within a namespace.
>
> What you ask for seems to be L3 micro segmentation inside netns. I

I would not label it 'micro' but yes a L3 separation within a L1 separation.

> would argue that we already support this through multiple routing
> tables. I would prefer improving the existing architecture to cover
> your use cases: Increase the number of supported tables, extend
> routing rules as needed, ...

I've seen that response for VRFs as well. I have not personally tried 
it, but from what I have read it does not work well. I think Roopa 
responded that Cumulus has spent time on that path and has hit some 
roadblocks.

>
>> The VRF id of tasks defaults to 1 and is inherited parent to child. It can
>> be read via the file '/proc/<pid>/vrf' and can be changed anytime by writing
>> to this file (if preferred this can be made a prctl to change the VRF id).
>> This allows services to be launched in a VRF context using ip, similar to
>> what is done for network namespaces.
>>      e.g., ip vrf exec 99 /usr/sbin/sshd
>
> I think such as classification should occur through cgroups instead
> of touching PIDs directly.

That is an interesting idea -- using cgroups for task labeling. It 
presents a creation / deletion event for VRFs which I was trying to 
avoid, and there will be some amount of overhead with a cgroup. I'll 
take a look at that option when I get some time.

As for as the current proposal I am treating VRF as part of a network 
context. Today 'ip netns' is used to run a command in a specific network 
namespace; the proposal with the VRF layering is to add a vrf context 
within a namespace so in keeping with how 'ip netns' works the above 
syntax allows a user to supply both a network namespace + VRF for 
running a command.

>
>> Network devices belong to a single VRF context which defaults to VRF 1.
>> They can be assigned to another VRF using IFLA_VRF attribute in link
>> messages. Similarly the VRF assignment is returned in the IFLA_VRF
>> attribute. The ip command has been modified to display the VRF id of a
>> device. L2 applications like lldp are not VRF aware and still work through
>> through all network devices within the namespace.
>
> I believe that binding net_devices to VRFs is misleading and the
> concept by itself is non-scalable. You do not want to create 10k
> net_devices for your overlay of choice just to tie them to a
> particular VRF. You want to store the VRF identifier as metadata and
> have a stateless classifier included it in the VRF decision. See the
> recent VXLAN-GBP work.

I'll take a look when I get time.

I have not seen scalability issues creating 1,000+ net_devices. 
Certainly the 40k'ish memory per net_device is noticeable but I believe 
that can be improved (e.g., a number of entries can be moved under 
proper CONFIG_ checks). I do need to repeat the tests on newer kernels.

>
> You could either map whatever selects the VRF to the mark or support it
> natively in the routing rules classifier.
>
> An obvious alternative is OVS. What you describe can be implemented in
> a scalable matter using OVS and mark. I understand that OVS is not for
> everybody but it gets a fundamental principle right: Scalability
> demands for programmability.
>
> I don’t think we should be adding a new single purpose metadata field
> to arbitrary structures for every new use case that comes up. We
> should work on programmability which increases flexibility and allows
> decoupling application interest from networking details.
>
>> On RX skbs get their VRF context from the netdevice the packet is received
>> on. For TX the VRF context for an skb is taken from the socket. The
>> intention is for L3/raw sockets to be able to set the VRF context for a
>> packet TX using cmsg (not coded in this patch set).
>
> Specyfing L3 context in cmsg seems very broken to me. We do not want
> to bind applications any closer to underlying networking infrastructure.
> In fact, we should do the opposite and decouple this completely.

That suggestion is inline with what is done today for other L3 
parameters -- TOS, TTL, and a few others.

>
>> The 'any' context applies to listen sockets only; connected sockets are in
>> a VRF context. Child sockets accepted by the daemon acquire the VRF context
>> of the network device the connection originated on.
>
> Linux considers an address local regardless of the interface the packet
> was received on.  So you would accept the packet on any interface and
> then bind it to the VRF of that interface even though the route for it
> might be on a different interface.
>
> This really belongs into routing rules from my perspective which takes
> mark and the cgroup context into account.

Expanding the current network namespace checks to a networking context 
is a very simple and clean way of implementing VRFs versus cobbling 
together a 'VRF like' capability using marks, multiple tables, etc (ie., 
the existing capabilities). Further, the VRF tagging of net_devices 
seems to readily fit into the hardware offload and switchdev 
capabilities (e.g., add a ndo operation for setting the VRF tag on a 
device which passes it to the driver).

Big picture wise where is OCP and switchdev headed? Top-of-rack switches 
seem to be the first target, but after that? Will the kernel ever 
support MPLS? Will the kernel attain the richer feature set of high-end 
routers? If so, how does VRF support fit into the design? As I 
understand it a scalable VRF solution is a fundamental building block. 
Will a cobbled together solution of cgroups, marks, rules, multiple 
tables really work versus the simplicity of an expanded network context?

David
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html