linux-kernel - Re: [PATCH 9/9] habanalabs: allow multiple processes to open FD

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190728114435.GB12033@kroah.com>
Date:   Sun, 28 Jul 2019 13:44:35 +0200
From:   Greg KH <gregkh@...uxfoundation.org>
To:     Oded Gabbay <oded.gabbay@...il.com>
Cc:     linux-kernel@...r.kernel.org, oshpigelman@...ana.ai,
        ttayar@...ana.ai
Subject: Re: [PATCH 9/9] habanalabs: allow multiple processes to open FD

On Sun, Jul 28, 2019 at 02:28:18PM +0300, Oded Gabbay wrote:
> This patch removes the limitation of a single process that can open the
> device.
> 
> Now, there is no limitation on the number of processes that can open the
> device and have a valid FD.
> 
> However, only a single process can perform compute operations. This is
> enforced by allowing only a single process to have a compute context.
> 
> Signed-off-by: Oded Gabbay <oded.gabbay@...il.com>
> ---
>  drivers/misc/habanalabs/context.c          | 100 +++++++++++++++------
>  drivers/misc/habanalabs/device.c           |  18 ++--
>  drivers/misc/habanalabs/habanalabs.h       |   1 -
>  drivers/misc/habanalabs/habanalabs_drv.c   |   8 --
>  drivers/misc/habanalabs/habanalabs_ioctl.c |   7 +-
>  5 files changed, 85 insertions(+), 49 deletions(-)
> 
> diff --git a/drivers/misc/habanalabs/context.c b/drivers/misc/habanalabs/context.c
> index 57bbe59da9b6..f64220fc3a55 100644
> --- a/drivers/misc/habanalabs/context.c
> +++ b/drivers/misc/habanalabs/context.c
> @@ -56,7 +56,7 @@ void hl_ctx_do_release(struct kref *ref)
>  	kfree(ctx);
>  }
>  
> -int hl_ctx_create(struct hl_device *hdev, struct hl_fpriv *hpriv)
> +static int hl_ctx_create(struct hl_device *hdev, struct hl_fpriv *hpriv)
>  {
>  	struct hl_ctx_mgr *mgr = &hpriv->ctx_mgr;
>  	struct hl_ctx *ctx;
> @@ -89,9 +89,6 @@ int hl_ctx_create(struct hl_device *hdev, struct hl_fpriv *hpriv)
>  	/* TODO: remove for multiple contexts per process */
>  	hpriv->ctx = ctx;
>  
> -	/* TODO: remove the following line for multiple process support */
> -	hdev->compute_ctx = ctx;
> -
>  	return 0;
>  
>  remove_from_idr:
> @@ -206,13 +203,22 @@ bool hl_ctx_is_valid(struct hl_fpriv *hpriv, bool requires_compute_ctx)
>  	int rc;
>  
>  	/* First thing, to minimize latency impact, check if context exists.
> -	 * Also check if it matches the requirements. If so, exit immediately
> +	 * This is relevant for the "steady state", where a process context
> +	 * already exists, and we want to minimize the latency in command
> +	 * submissions. In that case, we want to see if we can quickly exit
> +	 * with a valid answer.
> +	 *
> +	 * If a context doesn't exists, we must grab the mutex. Otherwise,
> +	 * there can be nasty races in case of multi-threaded application.
> +	 *
> +	 * So, if the context exists and we don't need a compute context,
> +	 * that's fine. If it exists and the context we have is the compute
> +	 * context, that's also fine. Other then that, we can't check anything
> +	 * without the mutex.
>  	 */
> -	if (hpriv->ctx) {
> -		if ((requires_compute_ctx) && (hdev->compute_ctx != hpriv->ctx))
> -			return false;
> +	if ((hpriv->ctx) && ((!requires_compute_ctx) ||
> +					(hdev->compute_ctx == hpriv->ctx)))
>  		return true;
> -	}
>  
>  	mutex_lock(&hdev->lazy_ctx_creation_lock);
>  
> @@ -222,35 +228,73 @@ bool hl_ctx_is_valid(struct hl_fpriv *hpriv, bool requires_compute_ctx)
>  	 * creation of a context
>  	 */
>  	if (hpriv->ctx) {
> -		if ((requires_compute_ctx) && (hdev->compute_ctx != hpriv->ctx))
> +		if ((!requires_compute_ctx) ||
> +					(hdev->compute_ctx == hpriv->ctx))
> +			goto unlock_mutex;
> +
> +		if (hdev->compute_ctx) {
>  			valid = false;
> -		goto unlock_mutex;
> -	}
> +			goto unlock_mutex;
> +		}
>  
> -	/* If we already have a compute context, there is no point
> -	 * of creating one in case we are called from ioctl that needs
> -	 * a compute context
> -	 */
> -	if ((hdev->compute_ctx) && (requires_compute_ctx)) {
> +		/* If we reached here, it means we have a non-compute context,
> +		 * but there is no compute context on the device. Therefore,
> +		 * we can try to "upgrade" the existing context to a compute
> +		 * context
> +		 */
> +		dev_dbg_ratelimited(hdev->dev,
> +				"Non-compute context %d exists\n",
> +				hpriv->ctx->asid);
> +
> +	} else if ((hdev->compute_ctx) && (requires_compute_ctx)) {
> +
> +		/* If we already have a compute context in the device, there is
> +		 * no point of creating one in case we are called from ioctl
> +		 * that needs a compute context
> +		 */
>  		dev_err(hdev->dev,
>  			"Can't create new compute context as one already exists\n");
>  		valid = false;
>  		goto unlock_mutex;
> -	}
> +	} else {
> +		/* If we reached here it is because there isn't a context for
> +		 * the process AND there is no compute context or compute
> +		 * context wasn't required. In any case, must create a context
> +		 * for the process
> +		 */
>  
> -	rc = hl_ctx_create(hdev, hpriv);
> -	if (rc) {
> -		dev_err(hdev->dev, "Failed to create context %d\n", rc);
> -		valid = false;
> -		goto unlock_mutex;
> +		rc = hl_ctx_create(hdev, hpriv);
> +		if (rc) {
> +			dev_err(hdev->dev, "Failed to create context %d\n", rc);
> +			valid = false;
> +			goto unlock_mutex;
> +		}
> +
> +		dev_dbg_ratelimited(hdev->dev, "Created context %d\n",
> +					hpriv->ctx->asid);
>  	}
>  
> -	/* Device is IDLE at this point so it is legal to change PLLs.
> -	 * There is no need to check anything because if the PLL is
> -	 * already HIGH, the set function will return without doing
> -	 * anything
> +	/* If we reached here then either we have a new context, or we can
> +	 * upgrade a non-compute context to a compute context. Do the upgrade
> +	 * only if the caller required a compute context
>  	 */
> -	hl_device_set_frequency(hdev, PLL_HIGH);
> +	if (requires_compute_ctx) {
> +		WARN(hdev->compute_ctx,
> +			"Compute context exists but driver is setting a new one");

This will trigger syzbot and will reboot machines that have
'panic-on-warn' set (i.e. all cloud systems).  So be _VERY_ careful
about this.

If a user can trigger this, do not use WARN(), that's not what it is
for.

thanks,

greg k-h