netdev - Re: [PATCH net-next v3 11/12] net: homa: create homa_plumbing.c homa

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGXJAmxbv=0Aw61cfOg+mtcrheV7y3db7xYcwTOfjvLYT61P7g@mail.gmail.com>
Date: Tue, 10 Dec 2024 17:59:00 -0800
From: John Ousterhout <ouster@...stanford.edu>
To: "D. Wythe" <alibuda@...ux.alibaba.com>
Cc: netdev@...r.kernel.org
Subject: Re: [PATCH net-next v3 11/12] net: homa: create homa_plumbing.c homa_utils.c

On Mon, Dec 9, 2024 at 10:08 PM D. Wythe <alibuda@...ux.alibaba.com> wrote:
>
> > +             if (first_packet)
> > +                     first_packet = 0;
> Looks useless.

In the full GitHub version of Homa there is diagnostic code that
executes when first_packet is true; I have a script that automatically
strips out the diagnostic code, but it isn't smart enough to optimize
out the check (and the variable declaration). I refactored the code to
work around the limitations of the stripper, so this code won't appear
in future versions of the patch series.

> > +
> > +             /* Process the packet now if it is a control packet or
> > +              * if it contains an entire short message.
> > +              */
> > +             if (h->type != DATA || ntohl(((struct data_header *)h)
> > +                             ->message_length) < 1400) {
>
> Why 1400? Maybe compare with skb->len?

1400 is a somewhat arbitrary constant to separate short messages from
longer ones; it's roughly one packet. skb->len isn't exactly one
packet either (a packet can't hold quite skb->len of message data, so
this would allow messages longer than one packet to get the fast
path). I could compute *exactly* one packet's worth of data, but
that's a nontrivial calculation and I don't think it's worth either
the code complexity or the runtime. The exact value of the constant
doesn't really matter...

> > +                     if (h2->sender_id == h->sender_id) {
> > +                             saddr2 = skb_canonical_ipv6_saddr(skb2);
> > +                             if (ipv6_addr_equal(&saddr, &saddr2)) {
>
> Does two skbs with the same ID and source address must come from the
> same RPC? if so, why is there an additional check for dport in
> homa_find_server_rpc().

It used to be that RPC ids were only unique to a port, not to a node;
when I changed Homa to make ids unique within a client node I somehow
forgot to remove the dport argument from homa_find_server_rpc. I have
now removed that argument.

> Anyway, the judgment of whether the skb comes from the same RPC
> should be consistent, using a unified function or macro definition.

I agree with this in principle, but this case is a bit special: it's
on the fast path, and I'd like to not  invoke skb_canonical_ipv6_saddr
if the ids don't match; if I move this code to a shared function, I'll
have to pass in the canonical address which will require the
conversion even when the ids don't match. I don't think there are any
other places in the code that compare packets for "same RPC" except
the lookup code in homa_rpc.c, so a shared function would probably
only be used in this one place. Thus, I'm inclined to leave this code
as is. If you still feel that I should take the more expensive but
more modular path let me know and I'll bite the bullet. Another option
would be to write a special-case inline function that does this check
efficiently, but move it to homa_rpc.h so that it's grouped with other
RPC-related code; would that mitigate your concern (its interface will
be a bit awkward)?

> > +                                     *prev_link = skb2;
> > +                                     prev_link = &skb2->next;
> > +                                     continue;
> > +                             }
> > +                     }
> > +                     *other_link = skb2;
> > +                     other_link = &skb2->next;
> > +             }
> > +             *prev_link = NULL;
> > +             *other_link = NULL;
> > +             homa_dispatch_pkts(packets, homa);
> Is it really necessary to gather packets belonging to the same RPC for
> processing together, just to save some time finding the RPC ?
>
> If each packet is independent, it instead introduces a lot of
> unnecessary cycles.

Grouping the packets for an RPC doesn't matter in this stripped-down
version of Homa, but when the grant mechanism gets added in later
patches it will make a significant difference. Batching the packets
for an RPC allows Homa to invoke homa_grant_check_rpc (which checks to
see whether additional grants need to be sent out) only once, after
all of the packets for the RPC have been processed.  This is a
significant optimization because (a) home_grant_check_rpc is fairly
expensive and (b) we can often just send a single grant packet,
whereas calling homa_grant_check_rpc after each individual packet
could result in multiple grants being sent. I have added a short
comment to motivate the batching.

> > +int homa_err_handler_v4(struct sk_buff *skb, u32 info)
> > +{
> > +     const struct in6_addr saddr = skb_canonical_ipv6_saddr(skb);
> > +     const struct iphdr *iph = ip_hdr(skb);
> What's this for? only used in ICMP_PORT_UNREACH with re-assignment.

This is another case where diagnostic code in the GitHub version
caused ugliness after stripping. I've refactored to eliminate the
problem.

> > +
> > +             iph = (struct iphdr *)(icmp + sizeof(struct icmphdr));
> > +             h = (struct common_header *)(icmp + sizeof(struct icmphdr)
> > +                             + iph->ihl * 4);
>
> You need to ensure that the comm_header can be accessed linearly. The
> icmp_socket_deliver() only guarantees that the full IP header plus 8 bytes
> can be accessed linearly.

This code only accesses the destination port, which is within the
first 4 bytes of the common_header (the same position as in a TCP
header). Thus I think it's safe without calling pskb_may_pull?

> > +int homa_err_handler_v6(struct sk_buff *skb, struct inet6_skb_parm *opt,
> > +                     u8 type,  u8 code,  int offset,  __be32 info)
> > +{
> > +     const struct ipv6hdr *iph = (const struct ipv6hdr *)skb->data;
> > +     struct homa *homa = global_homa;
> > +
> > +     if (type == ICMPV6_DEST_UNREACH && code == ICMPV6_PORT_UNREACH) {
> > +             char *icmp = (char *)icmp_hdr(skb);
> > +             struct common_header *h;
> > +
> > +             iph = (struct ipv6hdr *)(icmp + sizeof(struct icmphdr));
> > +             h = (struct common_header *)(icmp + sizeof(struct icmphdr)
> > +                             + HOMA_IPV6_HEADER_LENGTH);
> > +             homa_abort_rpcs(homa, &iph->daddr, ntohs(h->dport), -ENOTCONN);
>
> This cannot be right; ICMPv6 and ICMP are differ. At this point,
> there is no icmphdr anymore, you should obtain the common header from
> skb->data + offset.

Yow, you're right. I've completely refactored this (and tested it more
thoroughly...).

> Also need to ensure that the comm_header can be accessed linearly, only
> 8 bytes of common_header was checked in icmpv6_notify().

Same situation as for IPv4: only the destination port is used.

> > +__poll_t homa_poll(struct file *file, struct socket *sock,
> > +                struct poll_table_struct *wait)
> > +{
> > +     struct sock *sk = sock->sk;
> > +     __u32 mask;
> > +
> > +     /* It seems to be standard practice for poll functions *not* to
> > +      * acquire the socket lock, so we don't do it here; not sure
> > +      * why...
> > +      */
> > +
> That's true. sock_poll_wait will first add the socket to a waiting
> queue, so that even if an incomplete state is read subsequently, once
> the complete state is updated, wake_up_interruptible_sync_poll() will
> notify the socket to poll again. (From sk_data_ready .etc)

Thanks for the explanation; I have now removed the comment.

> > +     sock_poll_wait(file, sock, wait);
> > +     mask = POLLOUT | POLLWRNORM;
> > +
> > +     if (!list_empty(&homa_sk(sk)->ready_requests) ||
> > +         !list_empty(&homa_sk(sk)->ready_responses))
> > +             mask |= POLLIN | POLLRDNORM;
> > +     return (__poll_t)mask;
> > +}
>
>  Always writable? At least, you should check the state of the
>  socket. For example, is a socket that has already been shutdown still
>  writable? Alternatively, how does Homa notify the user when it
>  receives an ICMP error? You need to carefully consider the return
>  value of this poll. This is very important for the behavior of the
>  application.

I think it's OK for Homa sockets to be writable always. If the socket
has been shut down and an application attempts to write to it, it will
get ESHUTDOWN then. Message sends in Homa are always asynchronous, so
there is no notion of them blocking. ICMP errors are reflected at the
level of RPCs (on the server side, the RPC is discarded if an ICMP
error occurs; on the client side, an error will be returned as
response, which will make the socket "readable").

However, your comment makes me wonder about polling for read on a
socket that has been shutdown. If I don't return -ESHUTDOWN from
homa_poll in this case, I believe the application could block waiting
for reads on a socket that has been shutdown? Of course this will
never complete. So, should I check for shutdown and return -ESHUTDOWN
immediately in homa_poll, before invoking sock_poll_wait?

-John-