lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <tRA-1eVt0Av_cRCmND6povnCqYiBpaOoilgpCM2qNbo3GIe6szAEIN1mI20gRjgf215ODBQJBfolBlBzyJ4en67AQVHhLt6QmtWlQUjLqfc=@1g4.org>
Date: Sun, 01 Feb 2026 09:57:25 +0000
From: Paul Moses <p@....org>
To: Jamal Hadi Salim <jhs@...atatu.com>
Cc: netdev@...r.kernel.org, xiyou.wangcong@...il.com, jiri@...nulli.us, davem@...emloft.net, edumazet@...gle.com, kuba@...nel.org, pabeni@...hat.com, horms@...nel.org, linux-kernel@...r.kernel.org, stable@...r.kernel.org
Subject: Re: [PATCH net] net: sched: act_api: size RTM_GETACTION reply by fill size

The hardware manufacturers impose their own limits based on design constraints, it's not based on the spec. iproute2's value seems arbitrary, 1024 comes out to be about 32 entries, based on the message length of 3112 at 100 entries (this isn't counting overhead). Is page size ever less than 4k? May as well see what can safely fit into NLMSG_GOODSIZE at it's lowest possible value.

With 4k page size, the failure point appears to be 93 entries:
  large dump                     DEBUG: large dump msg_len=2904 cap=12288 entries=93 cycle_time=9304278

So bounding it at 64 entries or so(for now at least) would be a safe choice to maintain a margin and not impose arbitrarily low values.

Yes, I've wanted to talk to Po for a while now. :)

Thanks,
Paul

On Saturday, January 31st, 2026 at 11:34 AM, Jamal Hadi Salim <jhs@...atatu.com> wrote:

> 
> 
> On Sat, Jan 31, 2026 at 12:18 PM Paul Moses p@....org wrote:
> 
> > 1. Your script creates 100 separate gate actions, not one gate action with a large schedule.
> > 2. Each “tc actions add … gate …” call creates a new action, so you end up with 100 small actions.
> > 3. The issue I am reporting needs one single gate action that contains many sched-entry objects.
> > 4. Because of that, your test only exercises the dump path with many small actions.
> > 5. The failure I see is in the GETACTION notify path, not in the generic dump batching logic.
> > 6. In that path, tcf_get_notify() allocates a fixed-size skb using NLMSG_GOODSIZE.
> > 7. The kernel then tries to serialize one action into that skb.
> > 8. If a single action contains a large gate schedule, tca_get_fill() runs out of tailroom and fails, and the kernel returns -EINVAL.
> > 9. A single sched-entry does not exceed NLMSG_GOODSIZE.
> > 10. The problem is one action with many sched-entries, because the entire entry list is serialized into the payload of that one action.
> > 11. The “total acts 12 / 12 / 76” output only shows how many small actions were packed into each dump batch.
> > 12. It does not reflect the size of an individual action dump, and in your test each action is small.
> > 13. To reproduce with tc, you need one tc invocation that adds many sched-entry attributes to the same gate action, and then run “tc actions get action gate index <idx>” on that action.
> > 14. tc has it's own limit at 1024 apparently "addattr_l ERROR: message exceeded bound of 1024"
> 
> 
> Yes, thats the same error i was getting (with script below).
> ---
> ENTRY="sched-entry open 200000000 -1 8000000 sched-entry close 100000000 -1 -1 "
> SCHEDULE=$(printf "$ENTRY%.0s" {1..100})
> #SCHEDULE=$(printf "$ENTRY%.0s" {1..10})
> 
> for i in {1..2}; do
> echo "Iteration: $i"
> tc actions add action gate clockid CLOCK_TAI $SCHEDULE
> done
> ----
> 
> I know of no other action that exceeds this limit with all its params
> batched, and of course tc in userspace truncates it to about 32.
> Addition does succeed at 32 of those things per action.
> I have no idea if above is legal but it is allowed by the system.
> 
> > I'm not opposed to gate being clamped instead of adding support for large schedule sizes, but I wanted to thoroughly document why it's not possible so the next person isn't chasing a cryptic -EINVAL like I did.
> 
> 
> We cant have it to be infinite for sure - we will need to put an upper
> bound in parse_gate_list().
> Are you knowledgeable about this spec? I was Ccing Po Liu but his
> email is bouncing (so i removed him).
> 
> So back to your first post: I agree we have an issue here. Your
> solution will solve the event notifications but then we will need an
> upper bound check. We will also need to check that same upper bound in
> user space iproute2 code so we dont allow arbitrary values. Current
> number of 16 seems to work just fine - if we agree that is a "good"
> number (or if the specs dicate it is) then you can simply provide that
> fix..
> 
> cheers,
> jamal
> 
> > Thanks
> > Paul
> > 
> > On Saturday, January 31st, 2026 at 11:14 AM, Jamal Hadi Salim jhs@...atatu.com wrote:
> > 
> > > On Sat, Jan 31, 2026 at 11:51 AM Jamal Hadi Salim jhs@...atatu.com wrote:
> > > 
> > > > .
> > > > 
> > > > On Fri, Jan 30, 2026 at 3:48 PM Paul Moses p@....org wrote:
> > > > 
> > > > > What version of act_gate.c are you currently testing?
> > > > 
> > > > I am running plain ubuntu on this machine using their shipped kernel 6.8.0.
> > > > But i did look at the latest kernel tree and the dumping code has not changed.
> > > > +Cc Po Liu who i believe added that code.
> > > > 
> > > > > Did you actually run the tests? “large dump” creates ONE action at base_index, with num_entries=100, then immediately does GETACTION. So “tc actions ls action gate | grep index | wc -l” won’t exercise this, because it only counts actions. It doesn’t amplify the per action dump size (the entry list does). It uses libmnl (mnl_socket_sendto / mnl_socket_recvfrom) with MNL_SOCKET_BUFFER_SIZE. There is no custom netlink handling. The failure is returned by the kernel before userspace parses anything. The dumps are transactional at the netlink level, but an individual action dump still has to fit in the skb backing that message.
> > > > 
> > > > Sorry - I am not running your code (didnt want to compile anything on
> > > > this machine), just plain tc and i have to admit I dont know much
> > > > about the mechanics or spec for gate, so my example is based on
> > > > something Po Liu posted, here's a script to add 100 entries:
> > > > ---
> > > > for i in {1..100}; do
> > > > echo "$i"
> > > > tc actions add action gate clockid CLOCK_TAI sched-entry open
> > > > 200000000 -1 8000000 sched-entry close 100000000 -1 -1
> > > > done
> > > > ---
> > > > 
> > > > Then dumping:
> > > > 
> > > > $ sudo tc actions ls action gate | grep index
> > > > index 1 ref 1 bind 0
> > > > index 2 ref 1 bind 0
> > > > index 3 ref 1 bind 0
> > > > index 4 ref 1 bind 0
> > > > index 5 ref 1 bind 0
> > > > index 6 ref 1 bind 0
> > > > ..
> > > > ...
> > > > ....
> > > > index 95 ref 1 bind 0
> > > > index 96 ref 1 bind 0
> > > > index 97 ref 1 bind 0
> > > > index 98 ref 1 bind 0
> > > > index 99 ref 1 bind 0
> > > > index 100 ref 1 bind 0
> > > > $
> > > > 
> > > > > look at af_netlink.c
> > > > > /* NLMSG_GOODSIZE is small to avoid high order allocations being
> > > > > * required, but it makes sense to attempt a 32KiB allocation
> > > > > * to reduce number of system calls on dump operations, if user
> > > > > * ever provided a big enough buffer.
> > > > > /
> > > > > ...
> > > > > / Trim skb to allocated size. User is expected to provide buffer as
> > > > > * large as max(min_dump_alloc, 32KiB (max_recvmsg_len capped at
> > > > > * netlink_recvmsg())). dump will pack as many smaller messages as
> > > > > * could fit within the allocated skb. skb is typically allocated
> > > > > * with larger space than required (could be as much as near 2x the
> > > > > * requested size with align to next power of 2 approach). Allowing
> > > > > * dump to use the excess space makes it difficult for a user to have a
> > > > > * reasonable static buffer based on the expected largest dump of a
> > > > > * single netdev. The outcome is MSG_TRUNC error.
> > > > > */
> > > > > 
> > > > > This is where I am currently but I have seen these bugs appear throughout all my iterations including what's in the tree currently, if you show me better alternatives that solve my problems, I'll gladly accept.
> > > > > https://github.com/torvalds/linux/compare/master...jopamo:linux:net-stable-upstream-v4
> > > > 
> > > > I dont see a problem with "dump" as you seem to be suggesting. I asked
> > > > earlier if it is possible that you can create some single entry - not
> > > > 100 as shown above that will consume more than NLMSG_GOODSIZE? My
> > > > limited knowledge is not helping me see such a scenario.
> > > 
> > > Aha. I think there is a terminology mixup ;->
> > > 
> > > "dump" (a very unfortunate use of that word in the netlink world ;->)
> > > 
> > > is a very special word. So when you take a dump in this world you are
> > > GETing a whole table. In this case all the gate actions.
> > > 
> > > If i am not mistaken in your case this is not a dump - rather, you are
> > > CREATing a single entry which is bigger than NLMSG_GOODSIZE as i
> > > suspected. I dont believe iproute2 will allow you to do that.
> > > What's happening then is that the generated netlink event notification
> > > for that single entry is too big to fit in NLMSG_GOODSIZE.
> > > Let me try to craft something for that...
> > > 
> > > cheers,
> > > jamal

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ