netdev - Re: [WARNING - VERY LONG] A sysadmin's understanding of HFSC and IFB

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5f2da864419b2bd2a28c57128df418a6@njm.linuxwall.info>
Date:	Tue, 13 Dec 2011 18:10:07 -0500
From:	Julien Vehent <julien@...uxwall.info>
To:	"John A. Sullivan III" <jsullivan@...nsourcedevel.com>
Cc:	Eric Dumazet <eric.dumazet@...il.com>, <netdev@...r.kernel.org>
Subject: Re: [WARNING - VERY LONG] A sysadmin's understanding of HFSC and IFB

This is an amazing email and I will definitely sit down and read it 
carefully.
I've been (silently) following your discussions about HFSC because I've been 
wanting to add a section on HFSC in the TC doc I wrote here:
http://wiki.linuxwall.info/doku.php/en:ressources:dossiers:networking:traffic_control

The doc I've read so far left me staring at the wall with more questions 
than answers. I'm hoping your text will finally bring a clear view of the 
problem :)

Cheers,
Julien

On 2011-12-13 18:01, John A. Sullivan III wrote:
> On Fri, 2011-12-09 at 08:59 +0100, Eric Dumazet wrote:
>> Le vendredi 09 décembre 2011 à 02:17 -0500, John A. Sullivan III a
>> écrit :
>> > Hello, all.  I've been trying to summarize my last week's worth of
>> > research into HFSC and IFB.  Being neither a developer nor a
>> > mathematician, I found most of the existing documentation daunting and
>> > am truly grateful for the help I received on this list to understand 
>> the
>> > technologies.
>> >
>> > I do not currently have a blog to post this research and so was 
>> thinking
>> > of posting it to this list so that it could be archived and searchable
>> > for other poor sysadmin types like me to save them the days of
>> > struggling to get their practical heads around the concepts.
>> >
>> > However, since this is probably a 15 page or so document, I realized
>> > this could be quite rude to those on dial up or metered connections.
>> > Would it be appropriate to post my summary here? Thanks - John
>> >
>>
>> Please post here, thats definitely a good idea.
>>
>> I am pretty sure your contribution could serve to add a page to
>>
>> http://www.linuxfoundation.org/collaborate/workgroups/networking/group
>>
>> I believe we lack documentation. A lot.
> <snip>
> OK - it's almost ready.  The only problem I know if is that, when I do
> the math based upon my explanation, I sometimes violate an rt curve when
> I should not.  I posed that in an email to the list entitled "An error
> in my HFSC sysadmin documentation".  If anyone can figure out where my
> error is, I will gladly change the documentation.  Here's what I have.
> The graphics except the ASCII art are obviously missing.  All
> corrections and suggestions very gratefully accepted - John
>
>
> IFB
> The Intermediate Functional Block (IFB) device is a pseudo-interface to
> which traffic can be redirected by using a tc filter with the mirred
> (mirror/redirect I assume) action.  The primary documentation appears to
> be at
> http://www.linuxfoundation.org/collaborate/workgroups/networking/ifb but
> I found it less than helpful.  Perhaps the most helpful information was
> found in a discussion on the mirred tc filter action at
> http://www.sephidev.net/external/iproute2/doc/actions/mirred-usage and
> particularly this section:
> Remember that IFB is a very specialized case of packet redirecting
> device. Instead of redirecting it puts packets at the exact spot
> on the stack it found them from.
> In other words, we redirect or mirror traffic from an interface into the
> ifb interface and, when ifb is finished with it, it puts it right back
> onto the interface which was redirected into it.  Why would one want to
> do that?
> Diagnostics are one important reason, e.g., to capture dropped packets.
> Packets may be dropped on the interface either because of an iptables
> rule or a traffic limiting rule.  The packets can be mirrored onto an
> ifb interface which will see them before they are dropped allowing
> packet analysis on them.  Another would be in a Linux based switch to
> port mirror traffic from one port to another.
> Another reason is for applying identical traffic shaping rules to
> multiple interfaces.  Let's say one is using complicated traffic shaping
> rules with several qdiscs (queue disciplines), classes, and filters and
> the same set applies to many interfaces.  Rather than defining and
> maintaining this complex set for multiple interfaces, one could set a
> filter for each of those interfaces to redirect into an ifb interface
> and apply the complex traffic shaping there.
> A final reason is both complex and critically important, especially for
> SimplicITy and Firepipes (or anyone using VDI across the Internet):
> ingress traffic shaping.  Normal Linux ingress traffic shaping can only
> police.  Policing cannot really shape packets.  It can edit them, drop
> them, pass them, or do nothing.  One of the important goals of either
> policing or shaping ingress traffic is to give us a fighting chance to
> shape inbound traffic from the Internet when we have no control over the
> upstream router.
> For example, we may prioritize NX/RDP/Citrix and VoIP traffic outbound
> to the Internet so that it is prioritized over web browsing and sending
> huge email attachments or FTP transfers.  But, if someone starts pulling
> down a huge file and consumes all the inbound bandwidth, our desktop
> sessions can slow to a crawl and our VoIP calls become unintelligible.
> We might face a similar problem if we host our own web servers; our web
> servers could slow to a crawl in the public perception if their ability
> to respond to queries is being blocked by monster downloads into our
> network.  How can we keep the bulk traffic internal users are
> downloading from mangling our time critical traffic?
> The basic idea is to use the flow control mechanisms of traffic flow.
> This is built in to any TCP transmission and may or may not be in the
> upper layers of a UDP stream.  If we start dropping inbound packets, the
> other side should get the point and start throttling its flow.
> The recommended way to do this is to place an egress filter on the
> internal interface of the firewall, i.e., we only let the firewall place
> traffic on the internal network at a particular rate.  Egress traffic
> shaping can be very powerful so we can use something like HTB
> (Hierarchical Token Bucket) to reserve bandwidth for time sensitive
> traffic but allow that bandwidth to be used for bulk traffic it is not
> needed for time sensitive traffic.  We can also distinguish between well
> behaved traffic (like most TCP traffic) and most poorly behaved traffic
> which won't throttle (like most UDP traffic) and not allow UDP traffic
> to use extra bandwidth if available.
> This is fine if the firewall has only a single internal interface.  The
> bulk traffic can only be output at the traffic shaping speed and, when
> the inbound buffer overflows, packets will drop and the well behaved
> application on the other end should throttle.  But what if we have a DMZ
> or, as we often do with Firepipes, multiple internal interfaces?
> Throttling the egress of the internal interface would also throttle
> traffic flowing to it from the DMZ or other internal interfaces; that
> would be very bad.
> In these cases, it is better to create an ingress filter, i.e., filter
> the traffic coming in off the Internet.  As mentioned, an ingress filter
> can only police so the only way we can throttle download traffic is to
> drop the packets.  We could create several bands to reserve bandwidth
> for different types of traffic and drop packets which exceed those
> bandwidth limits but we have no ability to do complex traffic shaping
> such as allowing bulk traffic to use available bandwidth from higher
> priority traffic if it is available.
> This is where IFB comes into play.  We can redirect the inbound traffic
> into an IFB interface and then apply an egress filter with full traffic
> shaping to the IFB.  This will shape the traffic before it is placed
> back onto the inbound interface.
> This would allow us to use something like HTB on the inbound traffic so
> that we could reserve bandwidth for time sensitive traffic, limit
> bandwidth for bulk traffic, yet allow the bulk traffic to use the time
> sensitive bandwidth when it is available.  But what happens if the bulk
> traffic is using the bandwidth and a time sensitive packet arrives? It
> is now queued behind the bulk packets and can suffer intolerable delay.
> This is where HFSC steps in.
> HFSC
> HFSC (Hierarchical Fair Service Curve) is an incredibly powerful but,
> again, poorly documented qdisc.  Its outstanding feature over qdiscs
> like HTB is that it separates latency guarantees from bandwidth
> guarantees.  In other queue disciplines, the only way to guarantee lower
> latency is to allocate more bandwidth.  This is counter productive with
> technologies such as VoIP which require low latency but send very small
> packets and typically require far less bandwidth per flow than bulk
> applications.  HFSC has three goals:
> 1. Decouple latency requirements from bandwidth requirements
> 2. Provide guaranteed bandwidth when specified
> 3. Fairly share available extra bandwidth
> It recognizes it cannot always achieve all three goals and will preserve
> guaranteed bandwidth over fair sharing when there is a conflict.
> The paper describing HFSC is at
> http://www.cs.cmu.edu/~istoica/hfsc_extended.ps.gz (with a note that
> says, "You might need to treat it first with: sed "s|\[FontBBox
> \]|/FontBBox load |" as it was generated with a little bit old (ancient)
> idraw version." It is highly technical and mathematical but has some
> excellent graphics to illustrate the principles.
> A related site is http://www.cs.cmu.edu/~hzhang/HFSC/main.html
> The latest man pages are a little less mathematical but still pretty
> dense with the references to functions to describe the service curves.
> I've downloaded the latest man pages
> 
> (http://git.kernel.org/?p=linux/kernel/git/shemminger/iproute2.git;a=blob_plain;f=man/man7/tc-hfsc.7;hb=HEAD
> and
> 
> http://git.kernel.org/?p=linux/kernel/git/shemminger/iproute2.git;a=blob_plain;f=man/man8/tc-hfsc.8;hb=HEAD)
> to Tech/Equipment/Endian; they can be read with:
> nroff -mandoc -rLL=<width>n <page> | less
> A commonly reference resource is http://linux-ip.net/articles/hfsc.en/
> I also found http://www.sonycsl.co.jp/~kjc/software/TIPS.txt helpful.
> However, many articles attempting to explain HFSC didn't seem to really
> get it - especially the part about decoupling latency and bandwidth
> requirements.  It is all quite confusing and not well understood.  In
> fact, as I'm thinking about how to record what I have learned, I'm
> really struggling because there is no simple way to describe HFSC!
> Let's break down the terms.  It is hierarchical because bandwidth can be
> allocated and controlled on a hierarchical basis.  For example, the top
> level of the hierarchy might be our overall available bandwidth.  The
> next level of the hierarchy might be gold level customers who get 70% of
> the available bandwidth and regular customers who get 30%.  The next
> level might be individual customers or groups of customers who divide
> the 70% or 30% allocated to their parent.  The final hierarchy might be
> groups of traffic like VoIP, interactive, web, or bulk.  These end nodes
> are called leaf nodes.  There is an important distinction between leaf
> nodes and container nodes.  Bandwidth guarantees are only made at the
> leaf node level.  Containers are only used for link sharing, i.e., how
> excess bandwidth is shared when it is available.  Remember that we said
> HFSC enforces guarantees even if it means bandwidth is not shared
> evenly.  Bandwidth guarantees are always honored (if physically
> possible) regardless of the bandwidth allocated to the higher layers of
> the hierarchy.  Thus one must be careful to not overallocate bandwidth
> at the leaves.
> Moreover, the fair sharing of bandwidth only happens between peers.  In
> other words, using our example, only two peers contend for the overall
> bandwidth - Gold and Regular.  If the second level hierarchy of Gold has
> clients A, B, and C and the second level of Regular has D and E, D and E
> contend for Regular's share of the bandwidth and A, B, and C contend for
> Gold's share of the bandwidth but, for example, D never contends with B
> because they are in different parts of the tree.  Only peers contend for
> available bandwidth.
> So that explains "hierarchical."  What do we mean by fair? This refers
> to how available extra bandwidth is allocated between those groups which
> need it.  Let's say that we have 100Mbps of bandwidth divided into four
> groups of traffic - Gold (40%), Silver (30%), Bronze (20%), and Regular
> (10%).  Let's say that no Gold or Bronze users are active so 60% of our
> bandwidth is unallocated but both Silver and Regular could use as much
> bandwidth as we can give them.  How do we fairly share that 60% between
> Silver and Regular? HFSC uses a concept called virtual time to which we
> will return later.  In effect, it will divide the 60% in proportion to
> the traffic allocated to each group asking for more.  Specifically, in
> this case, three shares of bandwidth will be allocated to Silver for
> every one allocated to Regular (30:10).
> This is another area where HFSC is brilliant over other queue
> disciplines.  For example, what if we have a stream of traffic that
> doesn't require much bandwidth but, when it needs more, we need to
> prioritize it over bulk traffic? If our service guarantees are that VoIP
> gets a guaranteed 10% of the bandwidth, web traffic gets 30% and bulk
> gets 60%, when web is inactive and both VoIP and bulk need more, the
> excess would be allocated to bulk in a 6:1 ratio.  That's not good.  But
> remember we said that guaranteeing bandwidth and sharing bandwidth are
> two separate and sometimes conflicting functions in HFSC.  Thus, we can
> say that VoIP is guaranteed 10% but, when it comes to sharing available
> bandwidth, VoIP should get 70%, web gets 20%, and bulk gets 10%.
> Brilliant! This is done by means of service curves, the last part of
> Hierarchical Fair Service Curve traffic shaping.
> A service curve is a mathematical way of describing how much service a
> process is given over time.  The use of the term curve can throw us
> non-mathematicians for a "curve" until we remember that, in strict
> mathematical terms, a straight line is a curve - a curve which
> represents a linear function.  To translate all that: a service curve is
> the allocated bandwidth.  The y axis is the amount of data transferred
> (service given) and the x axis is time.  The greater the bandwidth, the
> steeper the slope of the curve (line).  Another brilliant thing about
> HFSC is that a service curve can have two lines, i.e., for a certain
> period of time one bandwidth can be allocated and after that time
> period, a different bandwidth can be allocated.  I'll refer to this as a
> "di-linear" curve.  This is how HFSC decouples bandwidth from latency
> but we'll come back to that.
> A service curve can be defined using tc with a single parameter if we
> only need one line or with three parameters if we need two lines.  There
> are two ways the parameters can be given to tc.  The mathematically
> oriented way is "m1 <value> d <value> m2 <value>" where m1 is the slope
> of the first curve (line for the non-mathematicians - translate: the
> initial bandwidth), m2 is the slope of the second curve (translate: the
> second bandwidth setting) and d is the intersection of the two curves
> (translate how long we use the first bandwidth before switching to the
> second bandwidth).
> The second, more system administrator oriented syntax is "umax <value>
> dmax <value> rate <value>".  This syntax makes it a little easier to see
> how HFSC separates guarantees for latency from guarantees for bandwidth.
> The rate parameter is the second curve.  In practical terms, this is the
> guaranteed bandwidth.  The umax parameter is the size of some unit of
> data.  It will often be the size of a data packet, e.g., 1514 bytes
> (does not need to include CRC, preamble, or inter-frame gap) but it
> could be some other unit.  For example, some have suggested that one
> could use this so that the text portion of web pages are delivered with
> low latency and the rest of the non-text, typically larger data
> transmissions, will follow at a slower pace.  If the typical text page
> size is 10KB, we might set umax to 10KB. Of course, this assumes that
> text is always sent first. The dmax parameter is the maximum delay that
> umax can tolerate.  So we might say that the first 1514 byte packet must
> be delayed no more than 20ms, or the first 10KB of web page transmission
> should be delayed no more than 100ms.  This is how bandwidth and latency
> guarantees are separated.
> Let's illustrate this a little further.  Let's say we have an FTP
> transfer which is running as fast as its guaranteed bandwidth.  We also
> have a VoIP traffic stream also running within its guaranteed bandwidth.
> The classic problem is that my VoIP packet which must be delivered on
> time is sitting behind my much larger FTP packet which is allowed to be
> sent because it is within the guaranteed bandwidth for FTP.  How do I
> jump the VoIP packet ahead of the FTP packet?
> HFSC will see that both packets are eligible to be sent because both are
> within the bandwidth limits set for their kind of traffic.  It then sees
> that the FTP traffic only has a bandwidth guarantee, i.e., it can meet
> its obligations to the FTP traffic as long as the packet is delivered
> within the time frame necessary to meet the bandwidth requirements
> regardless of how much the packet is delayed (as long as the delay is
> not so long as to violate the bandwidth requirements).  It then looks at
> the equally eligible VoIP packet and sees that, not only does it have an
> obligation for bandwidth but it also must not delay this packet for more
> than, say, 20ms.  In effect, it looks at the deadline for delivering the
> FTP packet and compares it to the deadline for delivering the VoIP
> packet and if the deadline for the VoIP packet is sooner than the
> deadline for the FTP packet, it will service the VoIP packet first.
> But how, exactly, does it work? It is done by the interplay of service
> curves and timers.  In fact, we will come to see that eligible and
> deadline are actually technical terms to describe some of the timers but
> first, let's step back and explain a little more about service curves.
> Service curves
> There are actually three service curves used by HFSC.  They are:
> 1. The real-time service curve (rt)
> 2. The link-sharing service curve (ls)
> 3. The upper limit service curve (ul)
> The abbreviations are used to specify which service curve we are
> describing in the tc syntax, e.g.,
> tc add class dev <device> parent parentID classid <ID> hfsc [ [ rt SC ]
> [ ls SC ] | [ sc SC ] ] [ ul SC ]
> SC is the description of the curve as described above using m1 d m2 or
> umax dmax rate, e.g.,
> tc add class dev <device> parent parentID classid <ID> hfsc rt 1514b
> 20ms 200kbits
> If we are not using two curves, we only need the last parameter, e.g.,
> tc add class dev <device> parent parentID classid <ID> hfsc ls 500kbits
> The sc parameter is used when rt and ls are the same, i.e., one can
> specify them both in one definition rather than needing to write two.
> It is not usual to specify all three curves in the same class.  The rt
> parameter is only used on leaves.  The ul parameter is usually not used
> leaves.
> Real-time service curve
> Real-time service curves can only be applied to leaf nodes (remember
> this is Hierarchical Fair Service Curve).  Using our previous example,
> rt curve would not be applied to "Gold Clients" but to "Web traffic".
> The rt curves can have either one curve (line) or two.  In fact, it is
> usually only rt curves which have two curves; ls and ul curves generally
> do not.  If the first curve is steeper than the second curve
> (translated: if the initial bandwidth guarantee is greater than the
> sustained bandwidth guarantee), the curve is called concave in HFSC
> terminology.  If the opposite is true, it is called convex.  This is
> illustrated by the below graphic copied from
> http://www.cs.cmu.edu/~hzhang/HFSC/main.html
> One must be careful when specifying rt curves because they will be
> honored above all other curves (remember we said that, when it is
> impossible to satisfy the requirements for both guaranteed bandwidth and
> fair sharing, guaranteed bandwidth wins).  This will be more
> understandable when we discuss the upper limit (ul) curve.  To jump
> ahead a little bit, the ul is typically though not exclusively used to
> describe the maximum bandwidth available, e.g., a T1 circuit or a cable
> connection.  We said that rt is honored above all other curves.  If the
> ul curve says the maximum bandwidth available is 10 Mbps and the sum of
> all our rt curves is 20 Mbps, the system will try to guarantee 20 Mbps
> which, of course, means it will fail to meet its guarantees under
> maximum load.
> Jumping ahead once more, rt curves override ls curves, too.  The ls
> curves are used to share extra bandwidth fairly.  So, if the ls curve
> dictates that class A should have 2 Mbps of our 10 Mbps link and class B
> should have 8 Mbps but the rt curve says that class A is guaranteed 3
> Mbps, class A will get 3 Mbps and class B only 7 Mbps.  The rt curve
> will always be honored whenever it is physically possible.  The bottom
> line is that the rt service curve always "wins."
> An important point to remember is that the rt curve determines the
> bandwidth and latency guarantees, is the only curve which determines
> bandwidth and latency guarantees, and has nothing to do with how
> bandwidth is fairly shared (other than to override that sharing if
> necessary!).  To put it another way, the function of the rt service
> curve is independent of the ul and ls service curves.  They handle two
> completely separate characteristics of HFSC.
> Upper limit service curve
> As mentioned above, the ul curve is often used to describe the maximum
> bandwidth available.  More specifically, it sets an upper boundary to
> the bandwidth which can be allocated by the ls curve.  As we will see,
> the ls curve is used to determine how the excess available bandwidth is
> fairly shared.  The ul service curve sets limits on that sharing.
> Let's get some vocabulary defined.  Link sharing is the term used by
> HFSC to describe how the bandwidth not actively used by the rt service
> curve is fairly allocated among the rest of the network sessions which
> are requesting bandwidth.  Let's connect the dots again as a way of
> review.  Remember, the rt curve defines the guaranteed amounts of
> bandwidth.  Let's go back to our 10 Mbps circuit and say we have VoIP
> with an rt service curve rate of 1 Mbps, interactive traffic with an rt
> service curve rate of 5 Mbps, web traffic with an rt service curve rate
> of 2 Mbps, and bulk traffic with an rt service curve rate of 1 Mbps.
> Note that the sum of the rt curves does not need to equal the total
> bandwidth.  Now let's say it's the middle of the night so the phones are
> not in use and no one is generating interactive traffic however web
> traffic to our internal web servers is going at full speed as are our
> overnight off site backups and FTP transfers.  Web is guaranteed 2 Mbps,
> bulk is guaranteed 1 Mbps and link sharing will figure out how to divide
> up the remaining 7 Mbps.  However, the ul curve could override the way
> link sharing divides that 7 Mbps.
> Let's illustrate how and learn more about ul at the same time.  The ul
> service curve is normally specified at the top of the hierarchy and is
> set to the maximum bandwidth available.  Think way back to when we said
> that link sharing only happened between peers.  Classes lower in the
> hierarchy can only contend for the bandwidth allocated to their parent
> and not for bandwidth allocated to classes at the same level but under a
> different parent.  Once again, I will copy a graphic from
> http://www.cs.cmu.edu/~hzhang/HFSC/main.html to illustrate this:
> Since children can only contend for the bandwidth allocated to their
> parents, the ul limit on bandwidth flows down through the entire tree
> under it.  If the parent with the ul limitation cannot receive the
> bandwidth, its children cannot contend for it.  Thus, by placing the ul
> at the top of the hierarchy and setting it to the maximum bandwidth
> available, it flows down through the entire tree and we do not need to
> define it in every descendant class.  That is, unless we want to
> override it further down the tree.
> Let's go back to our illustration.  If we had set ul rate 10240kbits (10
> Mbps), the web traffic and the bulk traffic will be sharing the 7 Mbps
> available bandwidth in whatever ratio was specified in their ls service
> curves.  If no ls curves were specified, they will divide the bandwidth
> according to their rt curves - in this case a 2:1 split.  For now, let's
> assume that the ls curve is the same as the rt curve, that is 2 Mbps for
> web and 1 Mbps for bulk.  Remember that the ul curve sets a limit on the
> ls curve.  What if, for some reason, we wanted to make sure that bulk
> traffic never exceeds 3 Mbps even if there is more available?  We would
> create a  second ul service curve for the bulk traffic on the bulk
> traffic class with ul rate 3072bkits.
> Link sharing will try to allocate an extra 2.33 Mbps to bulk traffic
> (7 / 3).  When this link sharing bandwidth is added to the guaranteed
> bandwidth of 1 Mbps, we come out to 3.33 Mbps.  Since this is exceeds
> the ul service curve set specifically for bulk traffic, link sharing
> will not be able to service bulk traffic at 3.33 Mbps.  It will only be
> able to provide service at 3 Mbps.  The rest will go to the web service.
> So, to summarize, the ul service curve is not used to guarantee
> bandwidth or latency.  That is the job of the rt service curve.
> Allocated bandwidth can exceed the ul service curve if the sum of the rt
> service curves is greater than the ul service curve and the circuit can
> support the higher rt bandwidth allocation.  This is another way of
> saying the rt guaranteed bandwidth always takes precedence.  The ul
> service curve is used to determine the link sharing bandwidth, i.e., the
> bandwidth allocated to classes of traffic from the unused portion of the
> available bandwidth.  It is specifically used to limit how much
> bandwidth can be allocated through link sharing, is typically set to the
> maximum bandwidth of the circuit, and is normally set at the top of th
> hierarchy because its affect is to flow down to all classes underneath
> it.
> Link-sharing service curve
> The ls service curve is not used to guarantee bandwidth or latency.  Its
> sole purpose is to determine the proportions in which unused bandwidth
> is allocated to classes which are requesting more than their guaranteed
> bandwidth, that is link sharing.  Although it is good practice to use
> service curves with numbers that are realistic for the available
> bandwidth, technically, only the ratio is important.
> Let's illustrate by returning to our previous example.  The ul is set to
> a rate of 10 Mbps.  We would expect the web ls service curve to use a
> rate of 2 Mbps and the bulk ls service curve to use a rate of 1 Mbps.
> However, we could say the web ls sc is 200 Mbps and the bulk ls sc is
> 100 Mbps and it would not break and not be illegal - just confusing.
> The ratios are the same (2:1) and it is only the ratios which are
> important.
> The discussion about ul service curves has "stolen the thunder" from the
> ls service curve discussion, i.e., there is not much left to say.  The
> ls service curves determine how link sharing works, i.e., how the
> bandwidth with is not being used to satisfy the guarantees of the rt
> service curves is fairly allocated among all the other network traffic
> flows who would like more bandwidth than they are guaranteed.
> The Linux and BSD implementation of HFSC go beyond the original
> specification by allowing us to specify the ls service curve separately
> from the rt service curve.  This gives us some powerful flexibility as
> we are about to illustrate.
> Let's extend our previous example.  Remember we have a 10 Mbps circuit
> and thus have set the ul service curve rate at the top of the hierarchy
> to 10 Mbps.  We have four classes of traffic, VoIP, interactive, web,
> and bulk.  The rt rate for VoIP is 1 Mbps, for interactive 5 Mbps, for
> web 2 Mbps, and for bulk 1 Mbps.  That accurately reflects our desire
> for guaranteed bandwidth but those proportions may not accurately
> reflect how we want any available excess bandwidth distributed.
> For our purposes, link sharing should have the following
> characteristics.  Bulk should be able to use all the bandwidth if it is
> available and uncontended.  However, if the VoIP traffic needs extra
> bandwidth and there is available bandwidth, it should be granted that
> extra bandwidth with priority since we do not want to starve our voice
> traffic.  Furthermore, we want visitors to our web site to have a
> positive experience and not one with substantial delays so, if there is
> contention between bulk and web traffic, we want web traffic to take
> more than a 2:1 ration of available bandwidth.  Thus, we might setup our
> tc classes as follows:
> VoIP:
> tc add class dev eth0 parent 1:1 classid 1:10 hfsc rt rate1024kbits ls
> rate 7168kbits
> We first thought this should be rt umax 222b dmax 5ms rate 1024kbits
> because the maximum sized VoIP packet is typically 222 bytes (ulaw/alaw
> at 64Kbps = 8KB/s * 0.020s (20 ms per packet) = 160 Bytes + 8 (UDP
> header) + 40 (IP header) + 14 (Ethernet header) = 222 Bytes) and we want
> it delayed in the router for no more than 5ms.  However, to guarantee 1
> Mbps bandwidth, a 222 byte packet needs to be sent within roughly 1.7ms
> (222 * 8 = 1776 bits / (1024 * 1024)bits/s = 0.0017s).  Thus, our rate
> is sufficient and we do not need to jump the delivery of VoIP packets by
> specifying a larger initial bandwidth.
> Interactive
> tc add class dev eth0 parent 1:1 classid 1:11 hfsc sc rate5120kbits
> We use sc since we want rt and ls to be the same
> Web:
> tc add class dev eth0 parent 1:1 classid 1:12 hfsc rt rate 2048kbits ls
> rate 3072kbits
> Bulk:
> tc add class dev eth0 parent 1:1 classid 1:12 hfsc sc rate 1024kbits
> Note that the sum of our ls service curves exceeds the ul service curve.
> This is not a problem.  Remember it is only the ls service curve ratios
> which are meaningful.  So what have we done? Even though our guarantees
> can be met with a proportion of 1:5:2:1, we have specified that, if
> there is extra bandwidth available and, if there is contention for it,
> distribute it in the ratio of 7:5:3:1.  In other words, if interactive
> and web are idle and VoIP and bulk can use more bandwidth than their
> guarantees (backlogged in HFSC terminology), the bandwidth will be
> allocated between VoIP and bulk at a 7:1 ration and not a 1:1 ratio.
> Likewise, if VoIP and interactive are idle and web and bulk are
> backlogged, the extra bandwidth will be allocated at a 3:1 ratio rather
> than a 2:1 ratio.  Can we see the value of having separate rt and ls
> service curves?
> Timers
> However, service curves are not the whole story.  In fact, it is not
> really the service curves which determine which packet is sent when.
> The service curves are used to set the timers and it is the timers which
> determine which packet is sent when.
> While service curves are quite understandable to system administrator
> types once we translate the vocabulary (curves = bandwidth, umax and
> dmax = latency (technically bandwidth I suppose but, latency in
> practicality)), timers are they mysterious art of programmers and
> mathematicians - an impenetrable black box.  Let's see if we can pry
> open that black box and make it at least a little understandable to us
> sysadmin types.
> HFSC uses four separate timers to work its wonders:
> 1. Fit time
> 2. Virtual time
> 3. Eligible time
> 4. Deadline time
> Virtual time
> Let's start with virtual time.  Virtual time is used by link sharing to
> determine which queue should be serviced to keep things fair, that is to
> keep the bandwidth in proportion to the ls service curve ratios.
> Virtual time appears to be a very complicated topic addressing
> mind-bending issues like how to preserve fairness among several
> backlogged queues when a new queue becomes backlogged as well and to
> preserve this fairness without starving one of the queues while the
> others catch up. Thus my explanation will be a dramatic
> oversimplification but hopefully enough to give fellow sysadmins a
> toehold to ascend this seemingly sheer rock face.
> Virtual time is a measure of how much time has been spent servicing the
> queue and how much it will take to service the next packet.  Here is a
> very good ascii art representation copied from
> http://www.sonycsl.co.jp/~kjc/software/TIPS.txt :
>                     bytes
>                          |                   /
>                          |                  /service curve
>                          |                 /
>            next   -->+   +----------------+
>            packet    |   |               /|
>            length    |   |              / |
>                      |   |             /  |
>            total --> +   +------------+   |
>            bytes         |           /|   |
>            already       |          / |   |
>            sent          |         /  |   |
>                                   /   |   |
>                                       |   |
>                                       |   |
>                               --------+---+--------------> time
>                                           vt for next packet
>                                       vt for previous packet
>
> Not surprisingly, this makes a great deal of sense when we digest it.
> Remember, the service curve we are using is the ls service curve, i.e.,
> the one for link sharing.  The more the bandwidth allocated to the ls
> service curve, the steeper the slope of the line.  The steeper the slope
> of the line, the shorter the time to transmit the packet = the more
> bandwidth, the shorter the time to transmit the packet.  Thus, the
> change in virtual time for the same sized packet is less for a faster
> queue than for a slower queue.  We'll see why that is important in a
> second.
> Remember that link sharing is used only among peers.  Only peers can
> contend for available extra bandwidth.  Fairness is achieved in HFSC by
> trying to keep the virtual time of all peers equal.  When a queue
> becomes backlogged, i.e., needing more than its guaranteed bandwidth, it
> starts accumulating virtual time.  The more it has been serviced while
> backlogged, the more virtual time it accumulates.
> When HFSC looks at the various backlogged queues, how does it determine
> which one to service? Look again at the above diagram.  Each queue will
> have its current virtual time and the virtual time that it will have
> after it sends its next packet.  HFSC will choose the next packet which
> will have the lowest virtual time.  Let's make this more understandable
> with some real illustrations.  Let's say we have two backlogged queues
> with identical ls service curves and each has a full sized Ethernet
> packet pending.  Queue A has accumulated virtual time of 1000ms and
> Queue B has accumulated virtual time of 999ms.  Let's say that sending
> the full sized Ethernet packet will take 2ms.  The virtual time after
> Queue A sends its packet would be 1002 whereas Queue B would be 1001ms
> so HFSC sends Queue B's packet.  Now Queue B has VT (virtual time) of
> 1001ms and the next packet would bring it to 1003ms.  Queue A has VT of
> 1000ms and sending the next packet would bring it to 1002ms.  HFSC
> chooses the packet which will result in the lowest VT thus it sends the
> packet in Queue A.  See how it tries to keep the virtual times equal and
> thus produces fairness.
> Now let's alter the scenario.  Let's keep the virtual time the same,
> 1000ms for Queue A and 999ms for Queue B but Queue A is handling small
> packets which only take 0.5ms to send.  Queue A has one of these packets
> ready to send while Queue B has one of its full sized packets to send.
> VT for Queue A after sending its packet would be 1000.5ms and for Queue
> B would be 1001ms so, even though Queue A has already received more
> service, it is services again because its final virtual time would be
> smaller.
> More importantly, let's go back to the idea that both Queue A and Queue
> B use full sized packets but this time the ls service curves are not
> equal.  Let's say the ls service curve rate for Queue A is 5120kbits and
> for Queue B is 1024kbits.  Thus, a packet that would take Queue B 2ms to
> transmit would take Queue A 0.4ms to transmit.  Both queues are
> continually backlogged so they always have packets to send.  HFSC takes
> a look and calculates the next VT for Queue A as 1000.4 and Queue B as
> 1001 so it sends Queue A.  It calculates again and A is 1000.8 and B
> 1001 so it sends A again.  The next calculation of VT puts A at 1001.2
> and B at 1001 so it sends B.  The next calculation puts A at 1001.2 and
> B at 1003 so A is sent.  Then 1001.6 versus 1003, then 1002 vs. 1003,
> then 1002.4 vs. 1003, then 1002.8 vs. 1003, then 1003.2 vs. 1003.  Can
> you see how the virtual time calculation is allocating service to Queue
> A at a 5:1 ratio with Queue B - just the fairness we requested via the
> ls service curves.
> So, in summary, virtual time used to produce fairness when distributing
> available excess bandwidth to backlogged queues in the ratio of the ls
> service curves.  It does this by trying to keep virtual times for all
> peers in the hierarchy equal and it does this by always choosing the
> packet to send which will result in the smallest virtual time.
> Fit time
> Fit time is pretty simple to understand.  Recall that we said the ls
> service curve was bounded by the ul service curve, i.e., even if a queue
> should get say 10 Mbps according to link sharing, if the ul (Upper
> Limit) service curve says we can use 8 Mpbs maximum, we are only going
> to get 8 Mbps.  Fit time takes into account the ul service curve.
> In effect, fit time looks at what the clock time would be to send the
> packet based upon the ul service curve rate, i.e., what the time would
> be to send the pending packet at the maximum transmission rate.  If that
> time is later than the current clock time, the packet will not be sent
> not matter what virtual time says.  If the current clock time is later
> than fit time, i.e., we would not have exceeded out bandwidth
> limitation, then the decision is made based upon virtual time.
> As a reminder, this is all for traffic over and above the guaranteed
> rate, i.e., the rt service curve rate.  Packets are always sent to
> preserve the rt service curve rate as long as it is physically possible
> regardless of fit time or virtual time.  This is the same as saying that
> the rt service curve always wins or, to phrase it yet another way, to
> meet the HFSC design criterion that, if there is a conflict between link
> sharing and guaranteed bandwidth, choose guaranteed bandwidth.
> Eligible time
> The rt service curve uses two timers: eligible time and deadline time.
> Recall, we used referred to eligible and deadline time toward the
> beginning of this discussion when observed how HFSC provides decoupled
> guarantees for both bandwidth and latency.  Like fit time, eligible time
> is relative to clock time.  In other words, HFSC calculates what the
> clock time would be to send the queued packet at the guaranteed rate,
> i.e., the rt service curve rate.  If that time is later than the current
> clock time, that packet's time has not come yet and it cannot be sent
> based upon its real time guarantee.  I do believe it can be sent based
> upon virtual time assuming that fit time allows it.  After all, that is
> what link sharing is about - being able to exceed the guaranteed
> bandwidth if there is extra bandwidth available.  On the other hand, if
> current clock time is later than eligible time, then this packet had
> better get moving because it is potentially running behind schedule.
> So, eligible time is just as then name implies; it is the time the
> packet becomes eligible for sending based upon the real-time service
> curve, i.e., the guaranteed bandwidth/latency.
> An important point to understand the difference between eligible time
> and deadline time is that eligible time is measured from the beginning
> or head of the packet, i.e., is HFSC ready to begin sending this packet.
> But what happens when more than one queue has packets whose eligible
> time has come? That's where deadline time comes into play.
> Deadline time
> Deadline time is closely related to eligible time and is likewise
> measured against clock time.  However, deadline time is measured against
> the end or tail of the packet, i.e., by when must we have finished
> sending this packet at the specified bandwidth in order to meet our
> packet delivery guarantees for bandwidth and latency.  Once again,
> http://www.sonycsl.co.jp/~kjc/software/TIPS.txt has an excellent ASCII
> graphic for illustrating the relationship between eligible time measured
> at the beginning of the packet and deadline time measured from the end:
>                     bytes
>                          |                   /
>                          |                  /service curve
>                          |                 /
>            next   -->+   +----------------+
>            packet    |   |               /|
>            length    |   |              / |
>                      |   |             /  |
>       cumulative --> +   +------------+   |
>            bytes         |           /|   |
>            already       |          / |   |
>            sent          |         /  |   |
>                                   /   |   |
>                                       |   |
>                                       |   |
>                               --------+---+--------------> time
>                                 eligible  deadline
>                                 time
>
> This time, the slope is the rt service curve.  The way in which HFSC
> chooses from among several different queues all with packets whose
> eligible time is greater than current clock time is almost identical to
> the way it chooses among backlogged queues with virtual time, viz., it
> chooses the packet with the lowest deadline time.  Remember that the
> steeper the curve, the greater the bandwidth and the shorter the
> distance between eligible time and deadline time for the same sized
> packet.
> Let's walk through a real example.  We have Queue A with VoIP traffic -
> small, 222 byte packets and an rt service curve bandwidth such that it
> takes 0.2ms to send its packets; that equates to roughly 8.88 Mbps ((222
> * 8)bits/0.0002s).  Let's also assume that we have so much VoIP traffic
> that the queue is filled so we always have a VoIP packet asking to be
> dequeued.  Queue B is sending FTP with large packets and an rt service
> curve rate such that each packet takes 2ms to send; this equates to
> 6.056 Mbps ((1514 * 8)bits/0.002s).  Queue B is also completely full.
> Let's assume that the maximum bandwidth available is the sum of the
> guaranteed bandwidths, viz., 14.936 Mbps.  This will allow us to
> calculated the progress of clock time, i.e., how long it actually takes
> to send each packet.  Also remember that each packet has a 4 byte CRC,
> and 8 byte preamble, and a 12 byte interframe gap time at least in
> traditional Ethernet.  Thus to transmit a packet in Queue A, we really
> need to transmit 246 bytes and, to transmit one in Queue B, we need to
> transmit 1538 bytes.  Thus, the elapsed time to send a Queue A packet is
> (246 * 8)bits / 14,936,000(b/s) = 0.132ms and the time to transmit a
> Queue B packet is (1538 * 8)bits / 14,936,000(b/s) = 0.824ms.  Sorry for
> all the math but this is what is inside the black box (and a very
> simplified version!).
> Let's assume that clock time (CT - the actual time) is 1000ms (not
> realistic but it makes the explanation easier!).  The next packet queued
> in Queue A has an ET (eligible time) of 1000ms and the next packet in
> Queue B has an ET of 999ms, i.e., both are eligible to be sent.  A less
> sophisticated traffic shaping algorithm would send the FTP packet first.
> However, HFSC calculates the deadline time (DT) for the packet in Queue
> A at 1000.2 (1000 + 0.2) and the deadline time for the packet in Queue B
> at 1001ms (999 + 2) so it sends A instead since it has the smaller DT.
> 0.132ms has elapsed in real time so CT is now 1000.132.  The
> eligible/deadline times (ET/DT) for A and B respectively are
> 1000.2/1000.4 and 999/1001.  Notice that A is no longer eligible to send
> because its ET > CT so B is serviced.  0.824ms has elapsed to send B's
> packet so CT is now 1000.956.  ET/DT for A is still 1000.2/1000.4 but B
> has changed to 1001/1003.  B just misses being eligible to send but A is
> eligible so A is sent.  Elapsed time is 0.132, CT is now 1001.088, ET/DT
> for A is 1000.4/1000.6.  Both A and B are eligible at the same time
> again as both their ETs <= CT.  A's DT is less than B's DT so A is
> serviced.
> In fact, A will send 11 packets.  Let's see the result after A sends 11
> packets.  Elapsed time is 11* 0.132 = 1.452ms so clock time is 1002.54.
> A's ET/DT have incremented by 11 * 0.2 so they are 1002.6/1002.8.  B's
> ET/DT have remained at 1001/1003.  A is no longer eligible so the fact
> that its DT is less than B's DT is irrelevant.  B is serviced.
> Pulling it all together
> To this point, almost all of our discussion have involved rt service
> curves based solely upon rate and we have been using some fairly large
> circuits to illustrate.  What happens when our bandwidth really is
> constrained such as a T1 circuit or the upload bandwidth of an
> asymmetric DSL or cable connection and we need to balance time sensitive
> traffic with bulk traffic? This is where di-linear curves save the day.
> The best illustration I have found is from the SIGCOM97 paper on HFSC:
>
> The results might not be obvious at first.  The packets arriving, e.g.,
> coming from the internal network and heading toward the constricted T1
> circuit  (well, in this specific example, it is a 10 Mbps circuit) are
> shown in the second level of graphics from the top.  The resulting
> output of those packets is shown in the very bottom set of graphics.
> The illustrations on the left show a mono-linear curve, i.e., just based
> upon bandwidth.  The video and FTP packets are flooding in, are being
> lined up and scheduled for dequeueing based upon the bandwidth only.
> Thus, the initial video packet sits in queue behind the FTP packets as
> there is no need to rush to meet its bandwidth requirements.  Here is
> how the authors describe what is happening:
> "To illustrate the advantage of decoupling delay and bandwidth
> allocation with non-linear service curves, consider the example in
> Figure 2, where a video and a FTP session share a 10 Mbps link . . . .
> Let the video source sends 30 8KB frames per second, which corresponds
> to a required bandwidth of 2 Mbps. The remaining 8 Mbps is reserved by a
> continuously backlogged FTP session. For simplicity, let all packets be
> of size 8 KB. Thus, it takes roughly 6.5 ms to transmit a packet."
> "As can be seen, the deadlines of the video packets occur every 33 ms,
> while the deadlines of the FTP packets occur every 8.2 ms. This results
> in a delay of approximately 26 ms for a video packet."
> Let's work through the math to make that more understandable.  HFSC is
> committed to deliver 2 Mbps to video and each packet is 8KB long.  Thus,
> HFSC's commitment to deliver that packet is within (8000 * 8)bits /
> 2,000,000(b/s) = 32ms.  I'm not quite sure why I come up with 32 and
> they say 33 but we'll use 33.  In other words, to meet the deadline
> based solely upon the rate, the bandwidth part of the rt service curve,
> the packet needs to be finished dequeueing at 33ms.  Since it only takes
> 6.5ms to send the packet, HFSC can sit on the packet it received for 33
> - 6.5 = 26.5ms if it needs to in order to meet the guarantees for other
> traffic.  This adds unnecessary latency to the video stream.
> In the second scenario, we introduce an initial, elevated bandwidth
> guarantee for the first 10ms.  The bandwidth for the first 10ms is now
> 6.6 Mbps instead of 2 Mbps.  We do the math again and HFSC's commitment
> to video to maintain 6.6 Mbps is to finish dequeueing the packet within
> (8000 * 8)bits / 6,600,000(b/s) = 10ms.  Since it takes 6.5 ms to send
> the packet, HFSC can sit on the packet for no more than 10 - 6.5 = 3.5
> ms.  Quite a difference!
> I assume for simplicity's sake, the graphic leaves out an important
> point.  The rt service curve either fully or partially resets when the
> queue has been drained.  Fully or partially depends on how much time has
> elapsed since the last service rendered and the new service requested
> when the queue becomes active again.  Thanks for Michal Soltys on the
> netdev kernel mail list for clarifying this for me.
> Without this reset, the first part of the service curve (the m1 portion
> if you recall the earlier discussion about how service curves can be
> defined) would not have much practical value because once traffic
> activates the queue, only the first packet or first few packets would be
> guaranteed the latency of the first, accelerated part of the curve.
> Everything else would use the second part of the curve (the m2 portion -
> we'll use the terms m1 and m2 for the rest of the discussion).
> So let's re-examine the above example in more detail.  The video packet
> arrives, uses the m1 curve, is jumped ahead of the FTP packets because
> of the low latency guarantee of the m1 curve, is dequeued, and now the
> queue is empty for 33.33ms (remember the video is playing at 30 frames
> per second, well, 32.69ms when you account for transmission time on a
> 100 Mbps circuit) which allows the curve to reset.  The next video
> packet comes in and it is treated according to m1 and not m2 because we
> have reset the curve.  Thus, each video packet is jumped in front of any
> queued FTP packets.
> We can even multiplex several video streams.  As long as the queue is
> allowed to go idle long enough, each of those streams will be treated
> with very low latency and jumped in front of the FTP packets, i.e.,
> until we have so many video streams that the queue starts to backlog.
> Then they will use the m2 curve.
> This is not a bad thing; it is a good thing and allows us a new
> perspective on the m1 and m2 curves.  Hopefully, we have allocated
> enough bandwidth in our m2 curve to properly service our video or VoiP
> or whatever we are pushing through this queue.  Thus, even if we are not
> receiving the accelerated treatment, we are still experiencing
> sufficient service.  If our queue is badly backlogged and overflowing,
> then we have a different problem and need more raw bandwidth!
> In this way, we can think of the m2 curve, the second level of service
> in a concave service curve (remember a concave service curve is where we
> start out with a higher bandwidth and then change to a lower bandwidth),
> as a circuit breaker preventing overload.  In other words, we are saying
> to this prioritized traffic that we will deliver it with a lower than
> normal latency (and thus higher short term bandwidth) while we can but,
> if it becomes too much (as determined by the system administrator who
> defined the HFSC qdisc), we will drop it down to a more sustainable rate
> that will not exceed the physical capabilities of the circuit.  This
> assumes we have designed our traffic shaping so that the sum of all the
> m2 portions of all the rt service curves do not exceed the capabilities
> of the circuit.
> Another way of saying this with familiar terminology is that we can
> burst at the m1 rate as long as the queue doesn't backlog but, when it
> does, we drop to our sustainable rate.  Thus, concave rt service curves
> are very well suited to periodic traffic, i.e., traffic which sends
> packets on a recurring interval with space in between like VoIP or
> video.  It may be less effective on burstable traffic such as the
> example of using it to accelerate the delivery of text on a web site
> unless the traffic is low enough that the queue has a chance to drain
> regularly.
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html