[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <54D800F5.60603@gmail.com>
Date: Sun, 08 Feb 2015 17:36:05 -0700
From: David Ahern <dsahern@...il.com>
To: "Eric W. Biederman" <ebiederm@...ssion.com>
CC: Stephen Hemminger <stephen@...workplumber.org>,
netdev@...r.kernel.org,
Nicolas Dichtel <nicolas.dichtel@...nd.com>,
roopa <roopa@...ulusnetworks.com>, hannes@...essinduktion.org,
Dinesh Dutt <ddutt@...ulusnetworks.com>,
Vipin Kumar <vipin@...ulusnetworks.com>,
Shmulik Ladkani <shmulik.ladkani@...il.com>
Subject: Re: [RFC PATCH 00/29] net: VRF support
On 2/6/15 1:50 PM, Eric W. Biederman wrote:
>
> David looking at your patches and reading through your code I think I
> understand what you are proposing, let me see if I can sum it up.
>
> Semantics:
> - The same as a network namespace.
> - Addition of a VRF-ANY context
> - No per VRF settings
> - No creation or removal operation.
Yes. Plus what I see is an important feature: the ability to layer VRFs
and network namespaces.
Network namespaces can be used to create smaller, logical switches
within a single physical switch. (i.e., A network namespace is on par
with what Cisco calls a VDC, virtual device context, for its Nexus 7000
switches -- logical separation at the device layer front panel port).
Layering VRFs (L3 separation) within a network namespace (L1 separation)
provides some nice end-user features.
>
> Implementation:
> - Add a VRF-id to every affected data structure.
>
> This implies that you see the current implementation of network
> namespaces to be space inefficient, and that you think you can remove
> this inefficiency by simply removing the settings and the associated
> proc files.
Not exactly. I see the current namespace implementation as an excellent
L1 separation construct, but not an L3 construct.
>
> Given that you have chosen to keep the same semantics as network
> namespaces for everything except for settings this raises the questions:
> - Are the settings and their associated proc files what actually cause
> the size cost you see in network namespaces?
> - Can we instead of reimplementing network namespaces instead optimize
> the current implementation?
The namespace memory consumption is side problem to the bigger problem
of how the isolation of namespaces affect processes (the need to have a
presence in each namespace).
What I was targeting is a trade-off. To make an L3 separation efficient
from a large scale* perspective one needs to give up something -- here
it is per-VRF procfs settings. Replicating procfs tree for each
namespace does have a high cost.
* scale here meaning VRFs from 1 to N, N for current products goes up to
4000, though I know of that has mentioned 16k VRFs.
>
> We need measurements to answer either of those questions and I think
> before proceeding we need to answer those questions.
agreed.
> Beyond that I want to point out that in general a data structure that
> has a tag on every member is going to have a larger memory foot print
> per entry, contain more entries, and by virtue of both of those be less
> memory efficient and less time efficient to use. So it is not clear
> that a implementation that tags everything with a vrf-id will actually
> be lighter weight.
The memory hit for a network namespace is >100k (yes, CONFIG option
dependent and that 100k is based on a 3.10 kernel which is higher than
what was measured for a 3.4 kernel).
This proposal puts a 4-byte tag to netdevices, sockets and tasks (skb's
are out per a prior email). So yes there will be a point that the number
of netdevices (logical + physical), plus tasks, and sockets will make
the memory hit of a VRF tag on par with namespace overhead. But the VRF
tagging alleviates the need to replicate processes/multiple
sockets/threads so in the big picture I can;t see how the overall hit to
memory is higher with a VRF id tag.
>
> Also there is a concern that placing tags in every data structure may be
> significantly more error prone to implement (as it is more more thing to
> keep trace of), and that can impact the maintainability and the
> correctness of the code for everyone.
I don't agree with this. You have already done the groundwork here by
plumbing through the namespace checks. Adding a vrf id has not proven to
be a huge problem. The patch changes are highly repetitive because again
it can leverage the namespace changes you have done.
> The standard that was applied to the network namespace was that
> it did not have any measurable performance impact when enabled. The
> measurments taken at the time did not show a slow down when a 1Gig
> interface was place in a network namespace. Compared to running an
> unpatched kernel.
Sure. I will build kernels at the commit id my patches are based on and
one with my changes and do a comparison. Virtual machines on KVM
emphasize the performance effects so I will compare a few netperf runs
with and without my changes. On a newer 3.x kernel I typically see
network throughput rates in 15 to 16 Gbps range (though H/W dependent),
so this far exceeds the 1G rate. Does that sound reasonable?
>
> I suspect your extra layer of indirection to get to struct net in
> addition to touching struct skb will give you a noticable performance
> impact.
I don't understand the 'extra layer of indirection' comment. I don't see
the indirection, I see an extra comparison. ie., from net_eq to net_eq +
(vrf_eq || vrf_eq_any).
From a struct comparison it has gone from:
struct net_device {
...
struct net *nd_net;
...
};
to
struct net_device {
...
struct net_ctx net_ctx {
struct net *net;
__u32 vrf;
};
...
};
>
>
> I have another concern. I don't think it is wise to have a data
> structure modified two different ways to deal with network namespaces
> and vrfs. For maintainability and our own sanity we should pick which
> version that we judge to be the most efficient implementation and go
> with it.
you lost me. What data structure is modified 2 different ways? VRFs are
sub context to a namespace.
>
>
>
> The architecture I imagine for using network namespaces as vrfs for
> devices that perform layer 2 bridging and layer 3 routing.
>
> port1 port2 port3 port4 port5 port6 port7 port8 port9 port10
> | | | | | | | | | |
> +-----+-----+-----+-----+-----+-----+-----+-----+-----+
> / Link Aggregation \
> + +
> | Bridging |
> +----------------------------+----------------------------+
> |
> cpu port
> |
> +---------------------+---------------------+
> / +---------------/ \---------------+ \
> / / +---------/ \---------+ \ \
> / / / +---/ \---+ \ \ \
> / / / / | | \ \ \ \
> | | | | | | | | | |
> vlan1 vlan2 vlan3 vlan4 vlan5 vlan6 vlan7 vlan8 vlan9 vlan10
> | | | | | | | | | |
> +-+-----+-----+-----+-----+-+ +-+-----+-----+-----+-----+-+
> | network namespace 1 | | network namespace2 |
> +---------------------------+ +---------------------------+
>
> Traffic to and from the rest of the world comes through the
> external ports.
>
> The traffic is then processed at layer two including link
> aggregation, bridging and classifying which vlan the traffic
> belongs in.
>
> If the traffic needs to be routed it then comes up to the cpu port.
> The cpu port looks at the tags on the traffic and places it into
> the appropriate vlan device.
>
> From the various vlans the traffic is then routed according
> to the routing table of whichever network namespace the vlan
> device is in.
>
> There are stateless offloads to this in modern hardware but this is a
> reasonable model how all of this works semantically.
>
> As such the vlan devices can be moved between network namespaces without
> affecting any layer two monitoring or work that happens on the lower
> level devices. The practical restriction is that L2 and L3 need to be
> handled on different network devices.
>
> This split of network devices ensures that L2 code that works today
> should not need any changes or in any way be concerned about network
> namespaces or that the parent devices are in.
9+ months ago I had considered something similar. I'll try to amend your
picture to show my concept:
port1 port2 port3 port4 port5 port6 port7 port8 port9 port10
| | | | | | | | | |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+
/ Link Aggregation \
+ +
| Bridging |
+-----------------------------+---------------------------+
|
+-----------------------------+----------------------------+
|default namespace | |
| (init_net) NIC driver |
| | |
| +----+----+-----+-----+-------+----+----+----+----+ |
| eth1 eth2 eth3 eth4 eth5 eth6 eth7 eth8 eth9 eth10 |
| | | | | | | | | | | |
+-----------------------------+----------------------------+
| | | | | | | | | |
+--+----+----+-----+-----+---+ +--+----+----+----+----+----+
| | | | | | | | | | | | | |
| seth1 | seth3 | seth5 | | seth6 | seth8 | seth10|
| seth2 seth4 | | seth7 seth9 |
| | | |
| network namespace 1 | | network namespace 2 |
+----------------------------+ +---------------------------+
Essentially netdevices for front panel ports exist in the default
namespace (init_net). L2 processes, monitoring processes (collectd, snmp
agents for devices, etc) and such would run here. From there "shadow
devices" (the 's' on the eth pairs) are created for namespaces where the
path between real and shadow is similar to how veth pairs work. In the
end this approach seemed to be a rather complex solution playing a lot
of games so I abandoned it in favor of the approach in this patch set --
adding a VRF id to a network context.
The patch diff might be large but almost all of it is converting the
existing struct net passing and net_eq checks to the broader struct
net_ctx and net_ctx_eq comparisons. Really the change rides on top of
what you have done for namespaces.
As for your proposal with VLAN based tagging, I do not understand the
packet path from driver for front panel ports to namespace based
netdevices. The VLAN sorting, and hence VRF sorting, is done in H/W? So
there are netdevices in init_net the driver uses and then VLAN devices
in the namespaces -- so those would correspond to what I called a shadow
device? If the packets are also VLAN tagged we have nested tagging --
one for the port and one for the VRF?
Also, doesn't the VLAN design limit number of VRFs to 4096? My current
patch set might limit it to 4096 but fix the genid piece (IPv6 seems to
have removed genid comparisons betweeen 3.17 and 3.19 -- need to look
into that) and it becomes a 32-bit tag which is a huge range for VRFs.
David
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists