netdev - Re: Use of genradix in sctp

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200821204636.GO3399@localhost.localdomain>
Date:   Fri, 21 Aug 2020 17:46:36 -0300
From:   'Marcelo Ricardo Leitner' <marcelo.leitner@...il.com>
To:     David Laight <David.Laight@...lab.com>
Cc:     "'linux-sctp@...r.kernel.org'" <linux-sctp@...r.kernel.org>,
        "'netdev@...r.kernel.org'" <netdev@...r.kernel.org>
Subject: Re: Use of genradix in sctp

On Wed, Aug 19, 2020 at 08:18:50AM +0000, David Laight wrote:
> From:'Marcelo Ricardo Leitner
> > Sent: 18 August 2020 22:38
> > 
> > On Tue, Aug 18, 2020 at 03:38:09PM +0000, David Laight wrote:
> > > A few years ago (for 5.1) the 'arrays' that sctp uses for
> > > info about data streams was changed to use the 'genradix'
> > > functions.
> > >
> > > I'm not sure of the reason for the change, but I don't
> > > thing anyone looked at the performance implications.
> > 
> > I don't have something like a CI for it, but I do run some performance
> > benchmarks every now and then and it didn't trigger anything
> > noticeable in my tests.
> 
> We have some customers who we think are sending 10000+ short
> SCTP data chunks a second.
> They are probably sending SMS messages, so that is 5000+ text
> messages a second!
> It is hard to stop those being sent with more than one
> data chunk in each ethernet frame!

Thanks for this info.

> 
> > Yet, can it be improved? Certainly. Patches welcomed. :-)
> 
> I'll apply some of my copious free time to it...
> Actually some simple changes would help:
> 
> 1) Change SCTP_SO()->x to so=SCTP_SO(); so->x in places
>    where there are multiple references to the same stream.

Agree. This is an easy change, mostly just mechanical.

> 
> 2) Optimise the genradix lookup for the case where there
>    is a single page - it can be completely inlined.

And/or our struct sizes.

__idx_to_offset() will basically do:
        if (!is_power_of_2(obj_size)) {
                return (idx / objs_per_page) * PAGE_SIZE +
                        (idx % objs_per_page) * obj_size;
        } else {
                return idx * obj_size;

if we can get it to his the else block, it saves some CPU cycles already (at
the expense of memory).

> 
> 3) Defer the allocation until the stream is used.
>    for outbound streams this could remove the extra buffer.

This can be tricky. What should happen if it gets a packet on a stream
that it couldn't allocate, and then another on a stream that was
already allocated? Just a drop, it will retransmit and recover, and
then again.. While, OTOH, if the application requested such amount of
streams, it is likely going to use it. If not, that's an application
bug.

...
> > > For example look at the object code for sctp_stream_clear()
> > > (__genradix_ptr() is in lib/generic-radix-tree.c).
> > 
> > sctp_stream_clear() is rarely called.
> > 
> > Caller graph:
> > sctp_stream_clear
> >   sctp_assoc_update
> >     SCTP_CMD_UPDATE_ASSOC
> >       sctp_sf_do_dupcook_b
> >       sctp_sf_do_dupcook_a
> > 
> > So, well, I'm not worried about it.
> 
> I wasn't considering the loop.
> It was just a place where the object code can be looked at.
> 
> But there are quite a few places where the same stream
> is looked for lots of times in succession.
> Even saving the pointer is probably noticeable.

Yes.

> 
> > Specs say 64k streams, so we should support that and preferrably
> > without major regressions. Traversing 64k elements each time to find
> > an entry is very not performant.
> > 
> > For a more standard use case, with something like you were saying, 17
> > streams, genradix here doesn't use too much memory. I'm afraid a
> > couple of integer calculations to get an offset is minimal overhead if
> > compared to the rest of the code.
> 
> It is probably nearer 40 including a function call - which
> is likely to cause register spills.
> 
> > For example, the stream scheduler
> > operations, accessed via struct sctp_sched_ops (due to retpoline), is
> > probably more interesting fixing than genradix effects here.
> 
> Hmmm... the most scheduling M3UA/M2PA (etc) want is (probably)
> to send stream 0 first.
> But even the use of stream 0 (for non-data messages) is a
> misunderstanding (of my understanding) of what SCTP streams are.
> IIRC there is only one flow control window.

Only one window, yes, but now we have a more fine grained control on
how it is used:

> I thought that tx data just got added to a single tx queue,
> and the multiple streams just allowed some messages to passed
> on to the receiving application while waiting for retransmissions
> (head of line blocking).

That was before stream schedulers.
https://developers.redhat.com/blog/2018/01/03/sctp-stream-schedulers/
https://tools.ietf.org/html/rfc8260

> 
> OTOH M2PA seems to wand stream 0 to have the semantics of
> ISO transport 'expedited data' - which can be sent even when
> the main flow is blocked because it has its own credit (of 1).

That is now doable with the stream schedulers.

But note that even if you don't enable it, you're paying a price
already due to function pointers on struct sctp_sched_ops, which are
used even if you use the standard scheduler (FCFS, First Come First
Served).  In there, it should start using the indirect call wrappers,
such as INDIRECT_CALL_3. After all, they are always compiled in,
anyway.