lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <32f56a2e-8142-4391-916a-65fe51a57933@ssi.bg>
Date: Wed, 17 Apr 2024 16:02:14 +0300 (EEST)
From: Julian Anastasov <ja@....bg>
To: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@...onical.com>
cc: horms@...ge.net.au, netdev@...r.kernel.org, lvs-devel@...r.kernel.org,
        netfilter-devel@...r.kernel.org, linux-kernel@...r.kernel.org,
        Stéphane Graber <stgraber@...raber.org>,
        Christian Brauner <brauner@...nel.org>,
        Pablo Neira Ayuso <pablo@...filter.org>,
        Jozsef Kadlecsik <kadlec@...filter.org>,
        Florian Westphal <fw@...len.de>
Subject: Re: [PATCH net-next] ipvs: allow some sysctls in non-init user
 namespaces


	Hello,

On Tue, 16 Apr 2024, Alexander Mikhalitsyn wrote:

> Let's make all IPVS sysctls visible and RO even when
> network namespace is owned by non-initial user namespace.
> 
> Let's make a few sysctls to be writable:
> - conntrack
> - conn_reuse_mode
> - expire_nodest_conn
> - expire_quiescent_template
> 
> I'm trying to be conservative with this to prevent
> introducing any security issues in there. Maybe,
> we can allow more sysctls to be writable, but let's
> do this on-demand and when we see real use-case.
> 
> This list of sysctls was chosen because I can't
> see any security risks allowing them and also
> Kubernetes uses [2] these specific sysctls.
> 
> This patch is motivated by user request in the LXC
> project [1].
> 
> [1] https://github.com/lxc/lxc/issues/4278
> [2] https://github.com/kubernetes/kubernetes/blob/b722d017a34b300a2284b890448e5a605f21d01e/pkg/proxy/ipvs/proxier.go#L103
> 
> Cc: Stéphane Graber <stgraber@...raber.org>
> Cc: Christian Brauner <brauner@...nel.org>
> Cc: Julian Anastasov <ja@....bg>
> Cc: Simon Horman <horms@...ge.net.au>
> Cc: Pablo Neira Ayuso <pablo@...filter.org>
> Cc: Jozsef Kadlecsik <kadlec@...filter.org>
> Cc: Florian Westphal <fw@...len.de>
> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@...onical.com>
> ---
>  net/netfilter/ipvs/ip_vs_ctl.c | 18 +++++++++++++++---
>  1 file changed, 15 insertions(+), 3 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> index 143a341bbc0a..92a818c2f783 100644
> --- a/net/netfilter/ipvs/ip_vs_ctl.c
> +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> @@ -4285,10 +4285,22 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)

	As the list of privileged vars is short I prefer
to use a bool and to make only some vars read-only:

	bool unpriv = false;

>  		if (tbl == NULL)
>  			return -ENOMEM;
>  
> -		/* Don't export sysctls to unprivileged users */
> +		/* Let's show all sysctls in non-init user namespace-owned
> +		 * net namespaces, but make them read-only.
> +		 *
> +		 * Allow only a few specific sysctls to be writable.
> +		 */
>  		if (net->user_ns != &init_user_ns) {

	Here we should just set: unpriv = true;

> -			tbl[0].procname = NULL;
> -			ctl_table_size = 0;
> +			for (idx = 0; idx < ARRAY_SIZE(vs_vars); idx++) {
> +				if (!tbl[idx].procname)
> +					continue;
> +
> +				if (!((strcmp(tbl[idx].procname, "conntrack") == 0) ||
> +				      (strcmp(tbl[idx].procname, "conn_reuse_mode") == 0) ||
> +				      (strcmp(tbl[idx].procname, "expire_nodest_conn") == 0) ||
> +				      (strcmp(tbl[idx].procname, "expire_quiescent_template") == 0)))
> +					tbl[idx].mode = 0444;
> +			}
>  		}
>  	} else
>  		tbl = vs_vars;

	And below at every place to use:

	if (unpriv)
		tbl[idx].mode = 0444;

	for the following 4 privileged sysctl vars:

- sync_qlen_max:
	- allocates messages in kernel context
	- this needs better tunning in another patch

- sync_sock_size:
	- allocates messages in kernel context

- run_estimation:
	- for now, better init ns to decide if to use est stats

- est_nice:
	- for now, better init ns to decide the value

- debug_level:
	- already set to 0444

	I.e. these vars allocate resources (mem, CPU) without
proper control, so for now we will just copy them from init ns
without allowing writing. And they are vars that are not tuned
often. Also we do not know which netns is supposed to be the
privileged one, some solutions move all devices out of init_net,
so we can not decide where to use lower limits.

	OTOH, "amemthresh" is not privileged but needs single READ_ONCE 
for sysctl_amemthresh in update_defense_level() due to the possible
div by zero if we allow writing to anyone, eg.:

	int amemthresh = max(READ_ONCE(ipvs->sysctl_amemthresh), 0);
	...
	nomem = availmem < amemthresh;
	... use only amemthresh

	All other vars can be writable.

Regards

--
Julian Anastasov <ja@....bg>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ