lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200519190534.78bb8389@turbo.teknoraver.net>
Date:   Tue, 19 May 2020 19:05:34 +0200
From:   Matteo Croce <mcroce@...hat.com>
To:     Russell King - ARM Linux admin <linux@...linux.org.uk>
Cc:     Antoine Tenart <antoine.tenart@...tlin.com>,
        netdev <netdev@...r.kernel.org>,
        "gregory.clement@...tlin.com" <gregory.clement@...tlin.com>,
        LKML <linux-kernel@...r.kernel.org>,
        Maxime Chevallier <maxime.chevallier@...tlin.com>,
        Nadav Haklai <nadavh@...vell.com>,
        Thomas Petazzoni <thomas.petazzoni@...tlin.com>,
        "miquel.raynal@...tlin.com" <miquel.raynal@...tlin.com>,
        Stefan Chulski <stefanc@...vell.com>,
        Marcin Wojtas <mw@...ihalf.com>,
        "David S . Miller" <davem@...emloft.net>,
        Linux ARM <linux-arm-kernel@...ts.infradead.org>
Subject: Re: [EXT] Re: [PATCH net-next 3/5] net: mvpp2: cls: Use RSS
 contexts to handle RSS tables

On Tue, 19 May 2020 12:05:20 +0200
Matteo Croce <mcroce@...hat.com> wrote:

> On Tue, May 19, 2020 at 11:54 AM Russell King - ARM Linux admin
> <linux@...linux.org.uk> wrote:
> >
> > On Sat, May 09, 2020 at 09:20:50PM +0100, Russell King - ARM Linux
> > admin wrote:
> > > On Sat, May 09, 2020 at 08:52:46PM +0100, Russell King - ARM
> > > Linux admin wrote:
> > > > It is highly likely that 895586d5dc32 is responsible for this
> > > > breakage. I've been investigating this afternoon, and what I've
> > > > found, comparing a kernel without 895586d5dc32 and with
> > > > 895586d5dc32 applied is:
> > > >
> > > > - The table programmed into the hardware via
> > > > mvpp22_rss_fill_table() appears to be identical with or without
> > > > the commit.
> > > >
> > > > - When rxhash is enabled on eth2, mvpp2_rss_port_c2_enable()
> > > > reports that c2.attr[0] and c2.attr[2] are written back
> > > > containing:
> > > >
> > > >    - with 895586d5dc32, failing:    00200000 40000000
> > > >    - without 895586d5dc32, working: 04000000 40000000
> > > >
> > > > - When disabling rxhash, c2.attr[0] and c2.attr[2] are written
> > > > back as:
> > > >
> > > >    04000000 00000000
> > > >
> > > > The second value represents the MVPP22_CLS_C2_ATTR2_RSS_EN bit,
> > > > the first value is the queue number, which comprises two
> > > > fields.  The high 5 bits are 24:29 and the low three are 21:23
> > > > inclusive.  This comes from:
> > > >
> > > >        c2.attr[0] = MVPP22_CLS_C2_ATTR0_QHIGH(qh) |
> > > >                      MVPP22_CLS_C2_ATTR0_QLOW(ql);
> > > > #define     MVPP22_CLS_C2_ATTR0_QHIGH(qh)       (((qh) & 0x1f)
> > > > << 24) #define     MVPP22_CLS_C2_ATTR0_QLOW(ql)        (((ql) &
> > > > 0x7) << 21)
> > > >
> > > > So, the working case gives eth2 a queue id of 4.0, or 32 as per
> > > > port->first_rxq, and the non-working case a queue id of 0.1, or
> > > > 1.
> > > >
> > > > The allocation of queue IDs seems to be in mvpp2_port_probe():
> > > >
> > > >         if (priv->hw_version == MVPP21)
> > > >                 port->first_rxq = port->id * port->nrxqs;
> > > >         else
> > > >                 port->first_rxq = port->id *
> > > > priv->max_port_rxqs;
> > > >
> > > > Where:
> > > >
> > > >         if (priv->hw_version == MVPP21)
> > > >                 priv->max_port_rxqs = 8;
> > > >         else
> > > >                 priv->max_port_rxqs = 32;
> > > >
> > > > Making the port 0 (eth0 / eth1) have port->first_rxq = 0, and
> > > > port 1 (eth2) be 32.  It seems the idea is that the first 32
> > > > queues belong to port 0, the second 32 queues belong to port 1,
> > > > etc.
> > > >
> > > > mvpp2_rss_port_c2_enable() gets the queue number from it's
> > > > parameter, 'ctx', which comes from mvpp22_rss_ctx(port, 0).
> > > > This returns port->rss_ctx[0].
> > > >
> > > > mvpp22_rss_context_create() is responsible for allocating that,
> > > > which it does by looking for an unallocated priv->rss_tables[]
> > > > pointer.  This table is shared amongst all ports on the CP
> > > > silicon.
> > > >
> > > > When we write the tables in mvpp22_rss_fill_table(), the RSS
> > > > table entry is defined by:
> > > >
> > > >             u32 sel = MVPP22_RSS_INDEX_TABLE(rss_ctx) |
> > > >                           MVPP22_RSS_INDEX_TABLE_ENTRY(i);
> > > >
> > > > where rss_ctx is the context ID (queue number) and i is the
> > > > index in the table.
> > > >
> > > > #define     MVPP22_RSS_INDEX_TABLE_ENTRY(idx)   (idx)
> > > > #define     MVPP22_RSS_INDEX_TABLE(idx)         ((idx) << 8)
> > > > #define     MVPP22_RSS_INDEX_QUEUE(idx)         ((idx) << 16)
> > > >
> > > > If we look at what is written:
> > > >
> > > > - The first table to be written has "sel" values of
> > > > 00000000..0000001f, containing values 0..3. This appears to be
> > > > for eth1.  This is table 0, RX queue number 0.
> > > > - The second table has "sel" values of 00000100..0000011f, and
> > > > appears to be for eth2.  These contain values 0x20..0x23.  This
> > > > is table 1, RX queue number 0.
> > > > - The third table has "sel" values of 00000200..0000021f, and
> > > > appears to be for eth3.  These contain values 0x40..0x43.  This
> > > > is table 2, RX queue number 0.
> > > >
> > > > Okay, so how do queue numbers translate to the RSS table?
> > > > There is another table - the RXQ2RSS table, indexed by the
> > > > MVPP22_RSS_INDEX_QUEUE field of MVPP22_RSS_INDEX and accessed
> > > > through the MVPP22_RXQ2RSS_TABLE register.  Before
> > > > 895586d5dc32, it was:
> > > >
> > > >        mvpp2_write(priv, MVPP22_RSS_INDEX,
> > > >                    MVPP22_RSS_INDEX_QUEUE(port->first_rxq));
> > > >        mvpp2_write(priv, MVPP22_RXQ2RSS_TABLE,
> > > >                    MVPP22_RSS_TABLE_POINTER(port->id));
> > > >
> > > > and after:
> > > >
> > > >        mvpp2_write(priv, MVPP22_RSS_INDEX,
> > > > MVPP22_RSS_INDEX_QUEUE(ctx)); mvpp2_write(priv,
> > > > MVPP22_RXQ2RSS_TABLE, MVPP22_RSS_TABLE_POINTER(ctx));
> > > >
> > > > So, before the commit, for eth2, that would've contained '32'
> > > > for the index and '1' for the table pointer - mapping queue 32
> > > > to table 1. Remember that this is queue-high.queue-low of 4.0.
> > > >
> > > > After the commit, we appear to map queue 1 to table 1.  That
> > > > again looks fine on the face of it.
> > > >
> > > > Section 9.3.1 of the A8040 manual seems indicate the reason
> > > > that the queue number is separated.  queue-low seems to always
> > > > come from the classifier, whereas queue-high can be from the
> > > > ingress physical port number or the classifier depending on the
> > > > MVPP2_CLS_SWFWD_PCTRL_REG.
> > > >
> > > > We set the port bit in MVPP2_CLS_SWFWD_PCTRL_REG, meaning that
> > > > queue-high comes from the MVPP2_CLS_SWFWD_P2HQ_REG()
> > > > register... and this seems to be where our bug comes from.
> > > >
> > > > mvpp2_cls_oversize_rxq_set() sets this up as:
> > > >
> > > >         mvpp2_write(port->priv,
> > > > MVPP2_CLS_SWFWD_P2HQ_REG(port->id), (port->first_rxq >>
> > > > MVPP2_CLS_OVERSIZE_RXQ_LOW_BITS));
> > > >
> > > >         val = mvpp2_read(port->priv, MVPP2_CLS_SWFWD_PCTRL_REG);
> > > >         val |= MVPP2_CLS_SWFWD_PCTRL_MASK(port->id);
> > > >         mvpp2_write(port->priv, MVPP2_CLS_SWFWD_PCTRL_REG, val);
> > > >
> > > > so, the queue-high for eth2 is _always_ 4, meaning that only
> > > > queues 32 through 39 inclusive are available to eth2.  Yet,
> > > > we're trying to tell the classifier to set queue-high, which
> > > > will be ignored, to zero.
> > > >
> > > > So we end up directing traffic from eth2 not to queue 1, but to
> > > > queue 33, and then we tell it to look up queue 33 in the RSS
> > > > table.  However, RSS table has not been programmed for queue
> > > > 33, and so it ends up (presumably) dropping the packets.
> > > >
> > > > It seems that mvpp22_rss_context_create() doesn't take account
> > > > of the fact that the upper 5 bits of the queue ID can't
> > > > actually be changed due to the settings in
> > > > mvpp2_cls_oversize_rxq_set(), _or_ it seems that
> > > > mvpp2_cls_oversize_rxq_set() has been missed in this commit.
> > > > Either way, these two functions mutually disagree with what
> > > > queue number should be used.
> > > >
> > > > So, 895586d5dc32 is indeed the cause of this problem.
> > >
> > > Looking deeper into what mvpp2_cls_oversize_rxq_set() and the MTU
> > > validation is doing, it seems that MVPP2_CLS_SWFWD_P2HQ_REG() is
> > > used for at least a couple of things.
> > >
> > > So, with the classifier having had RSS enabled and directing eth2
> > > traffic to queue 1, we can not ignore the fact that we may have
> > > packets appearing on queue 32 for this port.
> > >
> > > One of the things that queue 32 will be used for is if an
> > > over-sized packet attempts to egress through eth2 - it seems that
> > > the A8040 has the ability to forward frames between its ports.
> > > However, afaik we don't support that feature, and the kernel
> > > restricts the packet size, so we should never violate the MTU
> > > validator and end up with such a packet.  In any case, _if_ we
> > > were to attempt to transmit an oversized packet, we have no
> > > support in the kernel to deal with that appearing in the port's
> > > receive queue.
> > >
> > > Maybe it would be safe to clear the MVPP2_CLS_SWFWD_PCTRL_MASK()
> > > bit?
> > >
> > > My testing seems to confirm my findings above - clearing this bit
> > > means that if I enable rxhash on eth2, the interface can then pass
> > > traffic, as we are now directing traffic to RX queue 1 rather than
> > > queue 33.  Traffic still seems to work with rxhash off as well.
> > >
> > > So, I think it's clear where the problem lies, but not what the
> > > correct solution is; someone with more experience of packet
> > > classifiers (this one?) needs to look at this - this is my first
> > > venture into these things, and certainly the first time I've
> > > traced through how this is trying to work (or not)...
> >
> > This is what I was using here to work around the problem, and what I
> > mentioned above.
> >
> > diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_cls.c
> > b/drivers/net/ethernet/marvell/mvpp2/mvpp2_cls.c index
> > fd221d88811e..0dd3b65822dd 100644 ---
> > a/drivers/net/ethernet/marvell/mvpp2/mvpp2_cls.c +++
> > b/drivers/net/ethernet/marvell/mvpp2/mvpp2_cls.c @@ -1058,7 +1058,7
> > @@ void mvpp2_cls_oversize_rxq_set(struct mvpp2_port *port)
> > (port->first_rxq >> MVPP2_CLS_OVERSIZE_RXQ_LOW_BITS));
> >
> >         val = mvpp2_read(port->priv, MVPP2_CLS_SWFWD_PCTRL_REG);
> > -       val |= MVPP2_CLS_SWFWD_PCTRL_MASK(port->id);
> > +       val &= ~MVPP2_CLS_SWFWD_PCTRL_MASK(port->id);
> >         mvpp2_write(port->priv, MVPP2_CLS_SWFWD_PCTRL_REG, val);
> >  }
> >
> >
> 
> Hi,
> 
> I will try this change and let you know if it works.
> 
> Thanks
> 

Hi,

The patch seems to work. I'm generating traffic with random MAC and IP
addresses, to have many flows:

# tcpdump -tenni eth2
9a:a9:b1:3a:b1:6b > 00:51:82:11:22:02, ethertype IPv4 (0x0800), length 60: 10.0.0.4.0 > 192.168.0.4.0: UDP, length 12
9e:92:fd:f8:7f:0a > 00:51:82:11:22:02, ethertype IPv4 (0x0800), length 60: 10.0.0.4.0 > 192.168.0.4.0: UDP, length 12
66:b7:11:8a:c2:1f > 00:51:82:11:22:02, ethertype IPv4 (0x0800), length 60: 10.0.0.1.0 > 192.168.0.1.0: UDP, length 12
7a:ba:58:bd:9a:62 > 00:51:82:11:22:02, ethertype IPv4 (0x0800), length 60: 10.0.0.1.0 > 192.168.0.1.0: UDP, length 12
7e:78:a9:97:70:3a > 00:51:82:11:22:02, ethertype IPv4 (0x0800), length 60: 10.0.0.2.0 > 192.168.0.2.0: UDP, length 12
b2:81:91:34:ce:42 > 00:51:82:11:22:02, ethertype IPv4 (0x0800), length 60: 10.0.0.2.0 > 192.168.0.2.0: UDP, length 12
2a:05:52:d0:d9:3f > 00:51:82:11:22:02, ethertype IPv4 (0x0800), length 60: 10.0.0.3.0 > 192.168.0.3.0: UDP, length 12
ee:ee:47:35:fa:81 > 00:51:82:11:22:02, ethertype IPv4 (0x0800), length 60: 10.0.0.3.0 > 192.168.0.3.0: UDP, length 12

This is the default rate, with rxhash off:

# utraf eth2
tx: 0 bps 0 pps rx: 397.4 Mbps 827.9 Kpps
tx: 0 bps 0 pps rx: 396.3 Mbps 825.7 Kpps
tx: 0 bps 0 pps rx: 396.6 Mbps 826.3 Kpps
tx: 0 bps 0 pps rx: 396.5 Mbps 826.1 Kpps

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
      9 root      20   0       0      0      0 R  99.7   0.0   7:02.58 ksoftirqd/0
     15 root      20   0       0      0      0 S   0.0   0.0   0:00.00 ksoftirqd/1
     20 root      20   0       0      0      0 S   0.0   0.0   2:01.48 ksoftirqd/2
     25 root      20   0       0      0      0 S   0.0   0.0   0:32.86 ksoftirqd/3

and this with rx hashing enabled:

# ethtool -K eth2 rxhash on
# utraf eth2
tx: 0 bps 0 pps rx: 456.4 Mbps 950.8 Kpps
tx: 0 bps 0 pps rx: 458.4 Mbps 955.0 Kpps
tx: 0 bps 0 pps rx: 457.6 Mbps 953.3 Kpps
tx: 0 bps 0 pps rx: 462.2 Mbps 962.9 Kpps

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
     20 root      20   0       0      0      0 R   0.7   0.0   2:02.34 ksoftirqd/2
     25 root      20   0       0      0      0 S   0.3   0.0   0:33.25 ksoftirqd/3
      9 root      20   0       0      0      0 S   0.0   0.0   7:52.57 ksoftirqd/0
     15 root      20   0       0      0      0 S   0.0   0.0   0:00.00 ksoftirqd/1


The throughput doesn't increase so much, maybe we hit an HW limit of
the gigabit port. The interesting thing is how the global CPU usage
drops from 25% to 1%.
I can't explain this, it could be due to the reduced contention?

Regards,
-- 
Matteo Croce
per aspera ad upstream

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ