netdev - Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKgT0Ufoy2WM3=aMNOdq2PFYL8AH9QSs=QrP_Xx59uouTnKLJg@mail.gmail.com>
Date:   Sun, 26 Mar 2023 14:23:07 -0700
From:   Alexander Duyck <alexander.duyck@...il.com>
To:     Jakub Kicinski <kuba@...nel.org>
Cc:     davem@...emloft.net, netdev@...r.kernel.org, edumazet@...gle.com,
        pabeni@...hat.com, willemb@...gle.com
Subject: Re: [PATCH net-next 1/3] net: provide macros for commonly copied
 lockless queue stop/wake code

On Fri, Mar 24, 2023 at 2:28 PM Jakub Kicinski <kuba@...nel.org> wrote:
>
> On Fri, 24 Mar 2023 08:45:23 -0700 Alexander Duyck wrote:
> > On Thu, Mar 23, 2023 at 8:09 PM Jakub Kicinski <kuba@...nel.org> wrote:
> > > > We may want to change the values here. The most likely case is "left
> > > > enabled" with that being the case we probably want to make that the 0
> > > > case. I would then probably make 1 the re-enabled case and -1 the
> > > > stopped case.
> > >
> > > I chose the return values this way because the important information is
> > > whether the queue was in fact stopped (in case the macro is used at the
> > > start of .xmit as a safety check). If stopped is zero caller can check
> > > !ret vs !!ret.
> > >
> > > Seems pretty normal for the kernel function called stop() to return 0
> > > if it did stop.
> >
> > Except this isn't "stop", this is "maybe stop".
>
> So the return value from try_stop and maybe_stop would be different?
> try_stop needs to return 0 if it stopped - the same semantics as
> trylock(), AFAIR. Not that I love those semantics, but it's a fairly
> strong precedent.

The problem is this isn't a lock. Ideally with this we aren't taking
the action. So if anything this functions in my mind more like the
inverse where if this does stop we have to abort more like trylock
failing.

This is why I mentioned that maybe this should be renamed. I view this
more as a check to verify we are good to proceed. In addition there is
the problem that there are 3 possible outcomes with maybe_stop versus
the two from try_stop.

> > Maybe we should just
> > do away with the stop/wake messaging and go with something such as a
> > RTS/CTS type setup. Basically this function is acting as a RTS to
> > verify that we have room on the ring to place the frame. If we don't
> > we are stopped. The "wake" function is on what is essentially the
> > receiving end on the other side of the hardware after it has DMAed the
> > frames and is providing the CTS signal back.
>
> I'm definitely open to different naming but wouldn't calling RTS _after_
> send be even more confusing than maybe_stop?

We don't call maybe_stop after the send. For that we would be calling
try_stop. The difference being in the case of maybe_stop we might have
to return busy.

> > > > With that the decision tree becomes more straightforward as we would do
> > > > something like:
> > > >       if (result) {
> > > >               if (result < 0)
> > > >                       Increment stopped stat
> > > >                       return
> > > >               else
> > > >                       Increment restarted stat
> > > >       }
> > >
> > > Do you see a driver where it matters? ixgbe and co. use
> > > netif_tx_queue_try_stop() and again they just test stopped vs not stopped.
> >
> > The thing is in order to make this work for the ixgbe patch you didn't
> > use the maybe_stop instead you went with the try_stop. If you replaced
> > the ixgbe_maybe_stop_tx with your maybe stop would have to do
> > something such as the code above to make it work. That is what I am
> > getting at. From what I can tell the only real difference between
> > ixgbe_maybe_stop_tx and your maybe_stop is that you avoided having to
> > move the restart_queue stat increment out.
>
> I can convert ixgbe further, true, but I needed the try_stop, anyway,
> because bnxt does:
>
> if (/* need to stop */) {
>         if (xmit_more())
>                 flush_db_write();
>         netif_tx_queue_try_stop();
> }
>
> which seems reasonable.

I wasn't saying we didn't need try_stop. However the logic here
doesn't care about the return value. In the ixgbe case we track the
queue restarts so we would want a 0 on success and a non-zero if we
have to increment the stat. I would be okay with the 0 (success) / -1
(queue restarted) in this case.

> > The general thought is I would prefer to keep it so that 0 is the
> > default most likely case in both where the queue is enabled and is
> > still enabled. By moving the "take action" items into the 1/-1 values
> > then it becomes much easier to sort them out with 1 being a stat
> > increment and -1 being an indication to stop transmitting and prep for
> > watchdog hang if we don't clear this in the next watchdog period.
>
> Maybe worth taking a step back - the restart stat which ixgbe
> maintains made perfect sense when you pioneered this approach but
> I think we had a decade of use, and have kprobes now, so we don't
> really need to maintain a statistic for a condition with no impact
> to the user? New driver should not care 1 vs -1..

Actually the restart_queue stat is VERY useful for debugging. It tells
us we are seeing backlogs develop in the Tx queue. We track it any
time we wake up the queue, not just in the maybe_stop case.

WIthout that we are then having to break out kprobes and the like
which we could only add after-the-fact which makes things much harder
to debug when issues occur. For example, a common case to use it is to
monitor it when we see a system with slow Tx connections. With that
stat we can tell if we are building a backlog in the qdisc or if it is
something else such as a limited amount of socket memory is limiting
the transmits.

> > Also in general it makes it easier to understand if these all work
> > with the same logic.
> >
> > > > In addition for readability we may want consider adding an enum simliar
> > > > to the netdev_tx enum as then we have the return types locked and
> > > > usable should we want to specifically pick out one.
> > >
> > > Hm, I thought people generally dislike the netdev_tx enum.
> > > Maybe it's just me.
> >
> > The thought I had with the enum is to more easily connect the outcomes
> > with the sources. It would also help to prevent any confusion on what
> > is what. Having the two stop/wake functions return different values is
> > a potential source for errors since 0/1 means different things in the
> > different functions. Basically since we have 3 possible outcomes using
> > the enum would make it very clear what the mapping is between the two.
>
> IMO only two outcomes matter in practice (as mentioned above).
> I really like the ability to treat the return value as a bool, if only
> we had negative zero we would have a perfect compromise :(

I think we are just thinking about two different things. I am focusing
on the "maybe" calls that have 3 outcomes whereas I think you are
mostly focused on the "try" calls. My thought is to treat it something
like the msix allocation calls where a negative indicates a failure
forcing us to stop since the ring is full, 0 is a success, and a value
indicates that there are resources but they are/were limited.