lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200814104228.eidqu7fd7mfyur5n@skbuf>
Date:   Fri, 14 Aug 2020 13:42:28 +0300
From:   Vladimir Oltean <olteanv@...il.com>
To:     Jiri Pirko <jiri@...nulli.us>, netdev@...r.kernel.org,
        UNGLinuxDriver@...rochip.com
Subject: devlink-sb on ocelot switches

Hi,

Sorry in advance for the long message, I am trying to summarize with as
few words as possible, however the topic is not completely trivial so
explanation is needed.

I have a switch supported by drivers/net/mscc/ocelot*.c code, and I
would like to expose resource consumption watermarks through devlink-sb.

This switch family tracks consumption of memory and of frame references.
My switch has 128 KB of packet buffer and 1092 available frame
references. Packet buffers are accounted for in cells of 60 bytes each,
and frame references in units of 1.

There are 1024 watermarks inside the switch. Their role is fixed
according to their index. There are 16 types of watermarks (the
cartesian product between BUF_xxxx_I etc, and xxx_Q_RSRV_x etc).

The switch has queues on both ingress and egress.

/* The queue system tracks four resource consumptions:
 * Resource 0: Memory tracked per source port
 * Resource 1: Frame references tracked per source port
 * Resource 2: Memory tracked per destination port
 * Resource 3: Frame references tracked per destination port
 */
#define OCELOT_RESOURCE_SZ		256
#define OCELOT_NUM_RESOURCES		4

#define BUF_xxxx_I			(0 * OCELOT_RESOURCE_SZ)
#define REF_xxxx_I			(1 * OCELOT_RESOURCE_SZ)
#define BUF_xxxx_E			(2 * OCELOT_RESOURCE_SZ)
#define REF_xxxx_E			(3 * OCELOT_RESOURCE_SZ)

/* For each resource type there are 4 types of watermarks:
 * Q_RSRV: reservation per QoS class per port
 * PRIO_SHR: sharing watermark per QoS class across all ports
 * P_RSRV: reservation per port
 * COL_SHR: sharing watermark per color (drop precedence) across all ports
 */
#define xxx_Q_RSRV_x			0
#define xxx_PRIO_SHR_x			216
#define xxx_P_RSRV_x			224
#define xxx_COL_SHR_x			254

/*
 * Reservation Watermarks
 * ----------------------
 *
 * For setting up the reserved areas, egress watermarks exist per port and per
 * QoS class for both ingress and egress.
 */

/*
 *  Amount of packet buffer
 *  |  per QoS class
 *  |  |  reserved
 *  |  |  |   per egress port
 *  |  |  |   |
 *  V  V  v   v
 * BUF_Q_RSRV_E
 */
#define BUF_Q_RSRV_E(port, prio) \
	(BUF_xxxx_E + xxx_Q_RSRV_x + 8 * (port) + (prio))

/*
 *  Amount of packet buffer
 *  |  for all port's traffic classes
 *  |  |  reserved
 *  |  |  |   per egress port
 *  |  |  |   |
 *  V  V  v   v
 * BUF_P_RSRV_E
 */
#define BUF_P_RSRV_E(port) \
	(BUF_xxxx_E + xxx_P_RSRV_x + (port))

/*
 *  Amount of packet buffer
 *  |  per QoS class
 *  |  |  reserved
 *  |  |  |   per ingress port
 *  |  |  |   |
 *  V  V  v   v
 * BUF_Q_RSRV_I
 */
#define BUF_Q_RSRV_I(port, prio) \
	(BUF_xxxx_I + xxx_Q_RSRV_x + 8 * (port) + (prio))

/*
 *  Amount of packet buffer
 *  |  for all port's traffic classes
 *  |  |  reserved
 *  |  |  |   per ingress port
 *  |  |  |   |
 *  V  V  v   v
 * BUF_P_RSRV_I
 */
#define BUF_P_RSRV_I(port) \
	(BUF_xxxx_I + xxx_P_RSRV_x + (port))

/*
 *  Amount of frame references
 *  |  per QoS class
 *  |  |  reserved
 *  |  |  |   per egress port
 *  |  |  |   |
 *  V  V  v   v
 * REF_Q_RSRV_E
 */
#define REF_Q_RSRV_E(port, prio) \
	(REF_xxxx_E + xxx_Q_RSRV_x + 8 * (port) + (prio))

/*
 *  Amount of frame references
 *  |  for all port's traffic classes
 *  |  |  reserved
 *  |  |  |   per egress port
 *  |  |  |   |
 *  V  V  v   v
 * REF_P_RSRV_E
 */
#define REF_P_RSRV_E(port) \
	(REF_xxxx_E + xxx_P_RSRV_x + (port))

/*
 *  Amount of frame references
 *  |  per QoS class
 *  |  |  reserved
 *  |  |  |   per ingress port
 *  |  |  |   |
 *  V  V  v   v
 * REF_Q_RSRV_I
 */
#define REF_Q_RSRV_I(port, prio) \
	(REF_xxxx_I + xxx_Q_RSRV_x + 8 * (port) + (prio))

/*
 *  Amount of frame references
 *  |  for all port's traffic classes
 *  |  |  reserved
 *  |  |  |   per ingress port
 *  |  |  |   |
 *  V  V  v   v
 * REF_P_RSRV_I
 */
#define REF_P_RSRV_I(port) \
	(REF_xxxx_I + xxx_P_RSRV_x + (port))

/*
 * Sharing Watermarks
 * ------------------
 *
 * The shared memory area is shared between all ports.
 */

/*
 * Amount of buffer
 *  |   per QoS class
 *  |   |    from the shared memory area
 *  |   |    |  for egress traffic
 *  |   |    |  |
 *  V   V    v  v
 * BUF_PRIO_SHR_E
 */
#define BUF_PRIO_SHR_E(prio) \
	(BUF_xxxx_E + xxx_PRIO_SHR_x + (prio))

/*
 * Amount of buffer
 *  |   per color (drop precedence level)
 *  |   |   from the shared memory area
 *  |   |   |  for egress traffic
 *  |   |   |  |
 *  V   V   v  v
 * BUF_COL_SHR_E
 */
#define BUF_COL_SHR_E(dp) \
	(BUF_xxxx_E + xxx_COL_SHR_x + (1 - (dp)))

/*
 * Amount of buffer
 *  |   per QoS class
 *  |   |    from the shared memory area
 *  |   |    |  for ingress traffic
 *  |   |    |  |
 *  V   V    v  v
 * BUF_PRIO_SHR_I
 */
#define BUF_PRIO_SHR_I(prio) \
	(BUF_xxxx_I + xxx_PRIO_SHR_x + (prio))

/*
 * Amount of buffer
 *  |   per color (drop precedence level)
 *  |   |   from the shared memory area
 *  |   |   |  for ingress traffic
 *  |   |   |  |
 *  V   V   v  v
 * BUF_COL_SHR_I
 */
#define BUF_COL_SHR_I(dp) \
	(BUF_xxxx_I + xxx_COL_SHR_x + (1 - (dp)))

/*
 * Amount of frame references
 *  |   per QoS class
 *  |   |    from the shared area
 *  |   |    |  for egress traffic
 *  |   |    |  |
 *  V   V    v  v
 * REF_PRIO_SHR_E
 */
#define REF_PRIO_SHR_E(prio) \
	(REF_xxxx_E + xxx_PRIO_SHR_x + (prio))

/*
 * Amount of frame references
 *  |   per color (drop precedence level)
 *  |   |   from the shared area
 *  |   |   |  for egress traffic
 *  |   |   |  |
 *  V   V   v  v
 * REF_COL_SHR_E
 */
#define REF_COL_SHR_E(dp) \
	(REF_xxxx_E + xxx_COL_SHR_x + (1 - (dp)))

/*
 * Amount of frame references
 *  |   per QoS class
 *  |   |    from the shared area
 *  |   |    |  for ingress traffic
 *  |   |    |  |
 *  V   V    v  v
 * REF_PRIO_SHR_I
 */
#define REF_PRIO_SHR_I(prio) \
	(REF_xxxx_I + xxx_PRIO_SHR_x + (prio))

/*
 * Amount of frame references
 *  |   per color (drop precedence level)
 *  |   |   from the shared area
 *  |   |   |  for ingress traffic
 *  |   |   |  |
 *  V   V   v  v
 * REF_COL_SHR_I
 */
#define REF_COL_SHR_I(dp) \
	(REF_xxxx_I + xxx_COL_SHR_x + (1 - (dp)))

Now comes the tricky part. Here is how, to the best of my understanding,
the switch admission control based on watermarks works.

First it checks for buffer reservations for the associated QoS class
(tc) of the ingress port.

Then it checks for buffer reservations for the entire port, regardless
of tc.

If the reservation thresholds for the ingress port are exceeded, it
tries to consume buffers from the reservations of the egress port
(decided through forwarding).

If this doesn't work out either, the watermarks for shared (not reserved
to any port) buffers are checked. All sharing watermarks must be below
their configured thresholds.

For a frame to pass the controlling watermark checks, both buffers need
to be available, and frame references need to be available. The check
for frame references is identical to the one for buffers, in principle,
and shown to the right of the diagram below.

              Start
                v
                v
           Memory check               +>>>>>>>>>> Frame reference check
                v                     ^                    v
                v                     ^                    v
           Ingress memory             ^            Ingress references
           is available?              ^              are available?
                v                     ^                    v
                v        not exceeded ^                    v    not exceeded
           BUF_Q_RSRV_I >>>>>>>>>>>>>>+               REF_Q_RSRV_I >>> accept
                v                     ^                    v
       exceeded v                     ^           exceeded v
                v        not exceeded ^                    v    not exceeded
           BUF_P_RSRV_I >>>>>>>>>>>>>>+               REF_P_RSRV_I >>> accept
                v                     ^                    v
       exceeded v                     ^           exceeded v
                v                     ^                    v
           Egress memory              ^             Egress references
           is available?              ^              are available?
                v                     ^                    v
       exceeded v                     ^           exceeded v
                v        not exceeded ^                    v    not exceeded
           BUF_Q_RSRV_E >>>>>>>>>>>>>>+               REF_Q_RSRV_E >>> accept
                v                     ^                    v
       exceeded v                     ^           exceeded v
                v        not exceeded ^                    v    not exceeded
           BUF_P_RSRV_E >>>>>>>>>>>>>>+               REF_P_RSRV_E >>> accept
                v                     ^                    v
       exceeded v                     ^           exceeded v
                v                     ^                    v
           Shared memory              ^             Shared references
           is available?              ^              are available?
                v                     ^                    v
   exceeded     v                     ^                    v      exceeded
drop <<<< BUF_PRIO_SHR_E              ^              REF_PRIO_SHR_E >>>> drop
                v                     ^                    v
                v not exceeded        ^       not exceeded v
   exceeded     v                     ^                    v      exceeded
drop <<<< BUF_COL_SHR_E               ^              REF_COL_SHR_E >>>>> drop
                v                     ^                    v
                v not exceeded        ^       not exceeded v
   exceeded     v                     ^                    v      exceeded
drop <<< BUF_PRIO_SHR_I               ^             REF_PRIO_SHR_I >>>>> drop
                v                     ^                    v
                v not exceeded        ^       not exceeded v
   exceeded     v                     ^                    v      exceeded
drop <<< BUF_COL_SHR_I                ^             REF_COL_SHR_I >>>>>> drop
                v                     ^                    v
                v not exceeded        ^       not exceeded v
                v                     ^                    v
                +>>>>>>>>>>>>>>>>>>>>>+                 accept

Now, I was trying to understand whether these watermarks can be exposed
through devlink-sb.

Step 1:

   devlink sb show - display available shared buffers and their attributes
       DEV - specifies the devlink device to show shared buffers.  If
       this argument is omitted all shared buffers of all devices are
       listed.

       SB_INDEX - specifies the shared buffer.  If this argument is
       omitted shared buffer with index 0 is selected.  Behaviour of
       this argument it the same for every command.

What should this list for ocelot?
I was thinking it could list 2 shared buffers:
- SB_INDEX 0: this is the packet buffer. Its size is 128 KB. Its cell
  size is 60 bytes.
- SB_INDEX 1: this is the group of frame references. Its size is 1092.
  Its cell size is 1.

Step 2:

   devlink sb pool show - display available pools and their attributes
       DEV - specifies the devlink device to show pools.  If this
       argument is omitted all pools of all devices are listed.

       Display available pools listing their type, size, thtype and
       cell_size. cell_size is the allocation granularity of memory
       within the shared buffer. Drivers may round up, round down or
       reject size passed to the set command if it is not multiple of
       cell_size.

Hmmmmm....
I have 2 conflicting thoughts here.

First would be that both SB_INDEX 0 and SB_INDEX 1 would have a single
pool, POOL_INDEX 0. The size of this pool would be equal to the size of
the SB_INDEX it belongs to.

The other thought is that maybe it would be overall simpler if I could
just add a new DEVLINK_ATTR_SB_POOL_NAME attribute, which is a string,
and expose all the BUF_Q_RSRV_E_PORT0_TC6 stuff as its own pool. Then,
configuring the size of this pool would in fact configure its threshold.
See more below on why I think this is simpler.

Step 3:

   devlink sb pool set - set attributes of pool
       DEV - specifies the devlink device to set pool.

       size POOL_SIZE
              size of the pool in Bytes.

       thtype { static | dynamic }
              pool threshold type.

              static - Threshold values for the pool will be passed in Bytes.

              dynamic - Threshold values ("to_alpha") for the pool will
                        be used to compute alpha parameter according to
                        formula:
                              alpha = 2 ^ (to_alpha - 10)

                        The range of the passed value is between 0 to
                        20. The computed alpha is used to determine the
                        maximum usage of the flow:
                              max_usage = alpha / (1 + alpha) * Free_Buffer

Ok, so if I go with first thought, to only implement POOL_INDEX 0, then
nothing is configurable. The size is fixed, and the thtype is static.

Step 4:

   devlink sb port pool show - display port-pool combinations and
                               threshold for each
       DEV/PORT_INDEX - specifies the devlink port.

       pool POOL_INDEX
              pool index.

   devlink sb port pool set - set port-pool threshold
       DEV/PORT_INDEX - specifies the devlink port.

       pool POOL_INDEX
              pool index.

       th THRESHOLD
              threshold value. Type of the value is either Bytes or
              "to_alpha", depends on thtype set for the pool.

Ok, the 'port pool' is what? Is it BUF_P_RSRV_E (egress reservation) or
BUF_P_RSRV_I (ingress reservation)? Unlike traffic classes, the port
pool does not have a "type { ingress | egress }".

Also, I cannot assign a pool to a port dynamically. The design of this
switch is as such that the assignment is fixed.

Step 5:

   devlink sb tc bind set - set port-TC to pool binding with specified
                            threshold
       DEV/PORT_INDEX - specifies the devlink port.

       tc TC_INDEX
              index of either ingress or egress TC, usually in range 0
              to 8 (depends on device).

       type { ingress | egress }
              TC type.

       pool POOL_INDEX
              index of pool to bind this to.

       th THRESHOLD
              threshold value. Type of the value is either Bytes or
              "to_alpha", depends on thtype set for the pool.

The 'tc' pool should be BUF_Q_RSRV_E, if type==egress, or BUF_Q_RSRV_I,
if type==ingress. This is, I think, the only aspect that maps well over
the hardware. The POOL_INDEX could only be zero, same as the POOL_INDEX
for the port.

Step 6:

What isn't shown:

1. How could I model the sharing watermarks?
- Buffers that frames with tc={0,1,2...7} can consume, irrespective of
  source or destination port, when the reservation watermarks are
  exceeded. Per ingress and per egress direction.
- Buffers that frames with dp={0,1} can consume when reservations are
  exceeded. Per ingress and per egress direction.
- Ports can be configured to draw from the sharing watermarks or only
  from their own reservations.

2. Is it ok if I model the frame references as another shared buffer?
   There are some places that refer to devlink-sb as bytes, and these
   wouldn't be that.

3. What I've shown here is _not_ tail dropping. It is _congestion_
   dropping. These watermarks are used for:
   - Flow control, if the port is in pause mode.
   - Congestion dropping, if the port is in drop mode.
   But each port also has a setting called INGRESS_DROP_MODE and another
   one called EGRESS_DROP_MODE. When these are set to zero, the
   controlling watermarks do not cause packet drops. The watermarks are
   simply allowed to exceed, and the packets are kept in the ingress
   queues (not transferred to the egress ones).
   The packets _will_ be dropped eventually, when the tail drop
   watermarks are reached. A packet will be tail dropped when:
   - The ingress port memory consumption exceeds the
     SYS:PORT:ATOP_CFG.ATOP watermark
   AND
   - The total consumed memory in the shared queue system exceeds the
     SYS:PORT:ATOP_TOT_CFG.ATOP_TOT watermark.
   Aka: when there's no memory left, drop the traffic of the offending
   ingress ports.
   How can I configure the tail dropping watermarks (global and per
   port)? Still through devlink-sb or through a different mechanism?

Step 7:

   devlink sb occupancy show - display shared buffer occupancy values
                                for device or port
       This command is used to browse shared buffer occupancy values.
       Values are showed for every port-pool combination as well as for
       all port-TC combinations (with pool this port-TC is bound to).
       Format of value is:
                       current_value/max_value
       Note that before showing values, one has to issue occupancy
       snapshot command first.

       DEV - specifies the devlink device to show occupancy values for.

       DEV/PORT_INDEX - specifies the devlink port to show occupancy values for.

   devlink sb occupancy snapshot - take occupancy snapshot of shared
                                   buffer for device
       This command is used to take a snapshot of shared buffer
       occupancy values. After that, the values can be showed using
       occupancy show command.

       DEV - specifies the devlink device to take occupancy snapshot on.

   devlink sb occupancy clearmax - clear occupancy watermarks of shared
                                   buffer for device
       This command is used to reset maximal occupancy values reached
       for whole device. Note that before browsing reset values, one has
       to issue occupancy snapshot command.

       DEV - specifies the devlink device to clear occupancy watermarks
       on.

There are 2 aspects when it comes to the ocelot switches:
1. There isn't any way to atomically snapshot all the 1024 watermarks.
2. Each watermark has 2 status values: INUSE (current) and MAXUSE
   (maximum since last time it was read). But the MAXUSE counter resets
   on its own when reading it...

Also, there is another major issue, I think. The occupancy is per pool,
am I right? And I have a single pool...

Thanks,
-Vladimir

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ