lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 23 Oct 2019 08:49:30 -0400
From:   Jamal Hadi Salim <jhs@...atatu.com>
To:     Vlad Buslov <vladbu@...lanox.com>, netdev@...r.kernel.org
Cc:     xiyou.wangcong@...il.com, jiri@...nulli.us, davem@...emloft.net,
        mleitner@...hat.com, dcaratti@...hat.com,
        Eric Dumazet <edumazet@...gle.com>
Subject: Re: [PATCH net-next 00/13] Control action percpu counters allocation
 by netlink flag


Hi Vlad,

On 2019-10-22 10:17 a.m., Vlad Buslov wrote:
> Currently, significant fraction of CPU time during TC filter allocation
> is spent in percpu allocator. Moreover, percpu allocator is protected
> with single global mutex which negates any potential to improve its
> performance by means of recent developments in TC filter update API that
> removed rtnl lock for some Qdiscs and classifiers. In order to
> significantly improve filter update rate and reduce memory usage we
> would like to allow users to skip percpu counters allocation for
> specific action if they don't expect high traffic rate hitting the
> action, which is a reasonable expectation for hardware-offloaded setup.
> In that case any potential gains to software fast-path performance
> gained by usage of percpu-allocated counters compared to regular integer
> counters protected by spinlock are not important, but amount of
> additional CPU and memory consumed by them is significant.

Great to see this becoming low hanging on the fruit tree
after your improvements.
Note: had a discussion a few years back with Eric D.(on Cc)
when i was trying to improve action dumping; what you are seeing
was very visible when doing a large batch creation of actions.
At the time i was thinking of amortizing the cost of that mutex
in a batch action create i.e you ask the per cpu allocator
to alloc a batch of the stats instead of singular.

I understand your use case being different since it is for h/w
offload. If you have time can you test with batching many actions
and seeing the before/after improvement?

Note: even for h/w offload it makes sense to first create the actions
then bind to filters (in my world thats what we end up doing).
If we can improve the first phase it is a win for both s/w and hw use
cases.

Question:
Given TCA_ACT_FLAGS_FAST_INIT is common to all actions would it make
sense to use Could you have used a TLV in the namespace of TCA_ACT_MAX
(outer TLV)? You will have to pass a param to ->init().

cheers,
jamal

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ