lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Fri, 25 Nov 2022 07:15:33 +0100 From: Jan Karcher <jaka@...ux.ibm.com> To: Alexandra Winter <wintera@...ux.ibm.com>, Tony Lu <tonylu@...ux.alibaba.com> Cc: David Miller <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>, netdev@...r.kernel.org, linux-s390@...r.kernel.org, Heiko Carstens <hca@...ux.ibm.com>, Wenjia Zhang <wenjia@...ux.ibm.com>, Thorsten Winkler <twinkler@...ux.ibm.com>, Stefan Raspl <raspl@...ux.ibm.com>, Karsten Graul <kgraul@...ux.ibm.com> Subject: Re: [PATCH net] net/smc: Fix expected buffersizes and sync logic On 24/11/2022 15:07, Alexandra Winter wrote: > > > On 24.11.22 14:00, Alexandra Winter wrote: >> >> [ ... ]>>>>> On Wed, Nov 23, 2022 at 11:49:07AM +0100, Jan Karcher wrote: >>>>>> The fixed commit changed the expected behavior of buffersizes >>>>>> set by the user using the setsockopt mechanism. >>>>>> Before the fixed patch the logic for determining the buffersizes used >>>>>> was the following: >>>>>> >>>>>> default = net.ipv4.tcp_{w|r}mem[1] > Jan, you explained to me: "the minima is 16Kib. This is enforced in smc_compress_bufsize > which would move any value <= 16Kib into bucket 0 - which is 16KiB " > net.ipv4.tcp_wmem[1] defaults to 8Kib. So in the default case (unchanged net.ipv4.tcp_wmem[1]) > the default for the send path is not net.ipv4.tcp_wmem[1]. Should be clarified here. The default value is still set to the net.ipv4.tcp_{w|r}mem[1]. This is a *very* top level overview about what is happening and *not* a documentation. I don't really want to explain the full code flow here. What we still should do - as Tony aggreed on - is documenting the SMC behavior. This is a follow up on my list. >>>>>> sockopt = the setsockopt mechanism >>>>>> val = the value assigned in default or via setsockopt >>>>>> sk_buf = short for sk_{snd|rcv}buf >>>>>> real_buf = the real size of the buffer (sk_buf_size in __smc_buf_create) >>>>>> >>>>>> exposed | net/core/sock.c | af_smc.c | smc_core.c >>>>>> | | | >>>>>> +---------+ | | +------------+ | +-------------------+ >>>>>> | default |----------------------| sk_buf=val |---| real_buf=sk_buf/2 | >>>>>> +---------+ | | +------------+ | +-------------------+ >>>>>> | | | ^ >>>>>> | | | | >>>>>> +---------+ | +--------------+ | | | >>>>>> | sockopt |---| sk_buf=val*2 |-----------------------| >>>>>> +---------+ | +--------------+ | | >>>>>> | | | >>>>>> >>>>>> The fixed patch introduced a dedicated sysctl for smc >>>>>> and removed the /2 in smc_core.c resulting in the following flow: >>>>>> >>>>>> default = net.smc.{w|r}mem (which defaults to net.ipv4.tcp_{w|r}mem[1]) >>>>>> sockopt = the setsockopt mechanism >>>>>> val = the value assigned in default or via setsockopt >>>>>> sk_buf = short for sk_{snd|rcv}buf >>>>>> real_buf = the real size of the buffer (sk_buf_size in __smc_buf_create) >>>>>> >>>>>> exposed | net/core/sock.c | af_smc.c | smc_core.c >>>>>> | | | >>>>>> +---------+ | | +------------+ | +-----------------+ >>>>>> | default |----------------------| sk_buf=val |---| real_buf=sk_buf | >>>>>> +---------+ | | +------------+ | +-----------------+ >>>>>> | | | ^ >>>>>> | | | | >>>>>> +---------+ | +--------------+ | | | >>>>>> | sockopt |---| sk_buf=val*2 |-----------------------| >>>>>> +---------+ | +--------------+ | | >>>>>> | | | >>>>>> >>>>>> This would result in double of memory used for existing configurations >>>>>> that are using setsockopt. >>>>> >>>>> Firstly, thanks for your detailed diagrams :-) >>>>> >>>>> And the original decision to use user-provided values rather than >>>>> value/2 to follow the instructions of the socket manual [1]. >>>>> >>>>> SO_RCVBUF >>>>> Sets or gets the maximum socket receive buffer in bytes. >>>>> The kernel doubles this value (to allow space for >>>>> bookkeeping overhead) when it is set using setsockopt(2), >>>>> and this doubled value is returned by getsockopt(2). The >>>>> default value is set by the >>>>> /proc/sys/net/core/rmem_default file, and the maximum >>>>> allowed value is set by the /proc/sys/net/core/rmem_max >>>>> file. The minimum (doubled) value for this option is 256. >>>>> >>>>> [1] https://man7.org/linux/man-pages/man7/socket.7.html >>>>> >>>>> The user of SMC should know that setsockopt() with SO_{RCV|SND}BUF will >>>> >>>> I totally agree that an educated user of SMC should know about that behavior >>>> if they decide to use it. >>>> We do provide our users preload libraries where they can pass preferred >>>> buffersizes via arguments and we handle the Sockopts for them. >>>> >>>>> double the values in kernel, and getsockopt() will return the doubled >>>>> values. So that they should use half of the values which are passed to >>>>> setsockopt(). The original patch tries to make things easier in SMC and >>>>> let user-space to handle them following the socket manual. >>>>> >>>>>> SMC historically decided to use the explicit value given by the user >>>>>> to allocate the memory. This is why we used the /2 in smc_core.c. >>>>>> That logic was not applied to the default value. >>>>> >>>>> Yep, let back to the patch which introduced smc_{w|r}mem knobs, it's a >>>>> trade-off to follow original logic of SMC, or follow the socket manual. >>>>> We decides to follow the instruction of manuals in the end. >>>> >>>> I understand the point. I spend a lot of time trying to decide what to do. >>>> >>>> Since it was an intentional decision to not follow the general socket >>>> option, and we do not have anyone complaining we do not really have a reason >>>> to change it. >>>> Changing it means that users with existing configurations would have to >>>> change their configs on an update or suddenly expect double the memory >>>> consumption. >>>> That's why we in the end preffered to stay with the current logic. >>> >>> I can't agree with you more with the points to follow the historic logic >>> and not break the user-space applications. >>> >>>> I'm thinking that maybe - if we stay with the historic logic - we should >>>> document that desicion somewhere. So that in the future, if a user that >>>> expects the man page behavior, has a way to understand what SMC is doing. >>>> What do oyu think? >>> >>> Yep, we _really_ need to document it if we change the convention. >>> Actually, I spent a lot of time to find the history about the logic of >>> buffer (/2 and *2) in SMC. So I'm really in favor of adding >>> documentation, at least code comments to help others to understand them. >>> >>> Cheers, >>> Tony Lu >> Iiuc you are changing the default values in this a patch and your other patch: >> Default values for real_buf for send and receive: >> >> before 0227f058aa29 ("net/smc: Unbind r/w buffer size from clcsock and make them tunable") >> real_buf=net.ipv4.tcp_{w|r}mem[1]/2 send: 8k recv: 64k > see above: send: 16k recv: 64k >> >> after 0227f058aa29 ("net/smc: Unbind r/w buffer size from clcsock and make them tunable") >> real_buf=net.ipv4.tcp_{w|r}mem[1] send: 16k (16*1024) recv: 128k (131072) >> >> after net/smc: Fix expected buffersizes and sync logic >> real_buf=net.ipv4.tcp_{w|r}mem[1] send: 16k (16*1024) recv: 128k (131072) >> >> after net/smc: Unbind smc control from tcp control >> real_buf=SMC_*BUF_INIT_SIZE send: 16k (16384) recv: 64k (65536) >> >> If my understanding is correct, then I nack this. >> Defaults should be restored to the values before 0227f058aa29. >> Otherwise users will notice a change in memory usage that needs to >> be avoided or announced more explicitely. (and don't change them twice) > See above, I stand corrected. However this patch fixes/restores the buffersize > for setsockopt, but not for the default recieve path. > Could you please clarify that in the title and description? > I am trying to keep the commit title as crisp as possible while providing enough information and set the context in the commit message: "The fixed commit changed the expected behavior of buffersizes set by the user using the setsockopt mechanism." + There is now a whole e-mail thread to consult in case of any further questions. Thank you for your comments - Jan > Reviewed-by: Alexandra Winter <wintera@...ux.ibm.com> >>> >>>> - Jan >>>> >>>>> >>>>> Cheers, >>>>> Tony Lu >>>>> >>>>>> Since we now have our own sysctl, which is also exposed to the user, >>>>>> we should sync the logic in a way that both values are the real value >>>>>> used by our code and shown by smc_stats. To achieve this this patch >>>>>> changes the behavior to: >>>>>> >>>>>> default = net.smc.{w|r}mem (which defaults to net.ipv4.tcp_{w|r}mem[1]) >>>>>> sockopt = the setsockopt mechanism >>>>>> val = the value assigned in default or via setsockopt >>>>>> sk_buf = short for sk_{snd|rcv}buf >>>>>> real_buf = the real size of the buffer (sk_buf_size in __smc_buf_create) >>>>>> >>>>>> exposed | net/core/sock.c | af_smc.c | smc_core.c >>>>>> | | | >>>>>> +---------+ | | +-------------+ | +-----------------+ >>>>>> | default |----------------------| sk_buf=val*2|---|real_buf=sk_buf/2| >>>>>> +---------+ | | +-------------+ | +-----------------+ >>>>>> | | | ^ >>>>>> | | | | >>>>>> +---------+ | +--------------+ | | | >>>>>> | sockopt |---| sk_buf=val*2 |------------------------| >>>>>> +---------+ | +--------------+ | | >>>>>> | | | >>>>>> >>>>>> This way both paths follow the same pattern and the expected behavior >>>>>> is re-established. >>>>>> >>>>>> Fixes: 0227f058aa29 ("net/smc: Unbind r/w buffer size from clcsock and make them tunable") >>>>>> Signed-off-by: Jan Karcher <jaka@...ux.ibm.com> >>>>>> Reviewed-by: Wenjia Zhang <wenjia@...ux.ibm.com> >>>>>> --- >>>>>> net/smc/af_smc.c | 9 +++++++-- >>>>>> net/smc/smc_core.c | 8 ++++---- >>>>>> 2 files changed, 11 insertions(+), 6 deletions(-) >>>>>> >>>>>> diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c >>>>>> index 036532cf39aa..a8c84e7bac99 100644 >>>>>> --- a/net/smc/af_smc.c >>>>>> +++ b/net/smc/af_smc.c >>>>>> @@ -366,6 +366,7 @@ static void smc_destruct(struct sock *sk) >>>>>> static struct sock *smc_sock_alloc(struct net *net, struct socket *sock, >>>>>> int protocol) >>>>>> { >>>>>> + int buffersize_without_overhead; >>>>>> struct smc_sock *smc; >>>>>> struct proto *prot; >>>>>> struct sock *sk; >>>>>> @@ -379,8 +380,12 @@ static struct sock *smc_sock_alloc(struct net *net, struct socket *sock, >>>>>> sk->sk_state = SMC_INIT; >>>>>> sk->sk_destruct = smc_destruct; >>>>>> sk->sk_protocol = protocol; >>>>>> - WRITE_ONCE(sk->sk_sndbuf, READ_ONCE(net->smc.sysctl_wmem)); >>>>>> - WRITE_ONCE(sk->sk_rcvbuf, READ_ONCE(net->smc.sysctl_rmem)); >>>>>> + buffersize_without_overhead = >>>>>> + min_t(int, READ_ONCE(net->smc.sysctl_wmem), INT_MAX / 2); >>>>>> + WRITE_ONCE(sk->sk_sndbuf, buffersize_without_overhead * 2); >>>>>> + buffersize_without_overhead = >>>>>> + min_t(int, READ_ONCE(net->smc.sysctl_rmem), INT_MAX / 2); >>>>>> + WRITE_ONCE(sk->sk_rcvbuf, buffersize_without_overhead * 2); >>>>>> smc = smc_sk(sk); >>>>>> INIT_WORK(&smc->tcp_listen_work, smc_tcp_listen_work); >>>>>> INIT_WORK(&smc->connect_work, smc_connect_work); >>>>>> diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c >>>>>> index 00fb352c2765..36850a2ae167 100644 >>>>>> --- a/net/smc/smc_core.c >>>>>> +++ b/net/smc/smc_core.c >>>>>> @@ -2314,10 +2314,10 @@ static int __smc_buf_create(struct smc_sock *smc, bool is_smcd, bool is_rmb) >>>>>> if (is_rmb) >>>>>> /* use socket recv buffer size (w/o overhead) as start value */ >>>>>> - sk_buf_size = smc->sk.sk_rcvbuf; >>>>>> + sk_buf_size = smc->sk.sk_rcvbuf / 2; >>>>>> else >>>>>> /* use socket send buffer size (w/o overhead) as start value */ >>>>>> - sk_buf_size = smc->sk.sk_sndbuf; >>>>>> + sk_buf_size = smc->sk.sk_sndbuf / 2; >>>>>> for (bufsize_short = smc_compress_bufsize(sk_buf_size, is_smcd, is_rmb); >>>>>> bufsize_short >= 0; bufsize_short--) { >>>>>> @@ -2376,7 +2376,7 @@ static int __smc_buf_create(struct smc_sock *smc, bool is_smcd, bool is_rmb) >>>>>> if (is_rmb) { >>>>>> conn->rmb_desc = buf_desc; >>>>>> conn->rmbe_size_short = bufsize_short; >>>>>> - smc->sk.sk_rcvbuf = bufsize; >>>>>> + smc->sk.sk_rcvbuf = bufsize * 2; >>>>>> atomic_set(&conn->bytes_to_rcv, 0); >>>>>> conn->rmbe_update_limit = >>>>>> smc_rmb_wnd_update_limit(buf_desc->len); >>>>>> @@ -2384,7 +2384,7 @@ static int __smc_buf_create(struct smc_sock *smc, bool is_smcd, bool is_rmb) >>>>>> smc_ism_set_conn(conn); /* map RMB/smcd_dev to conn */ >>>>>> } else { >>>>>> conn->sndbuf_desc = buf_desc; >>>>>> - smc->sk.sk_sndbuf = bufsize; >>>>>> + smc->sk.sk_sndbuf = bufsize * 2; >>>>>> atomic_set(&conn->sndbuf_space, bufsize); >>>>>> } >>>>>> return 0; >>>>>> -- >>>>>> 2.34.1
Powered by blists - more mailing lists