[lvc-project] [PATCH] [RFC] net: smc: fix fasync leak in smc_release()

Thu Mar 7 11:58:51 MSK 2024

On 06/03/2024 19:07, Dmitry Antipov wrote:
> 
> Well, the whole picture is somewhat more complicated. Consider the
> following diagram (an underlying kernel socket is in [], e.g. [smc->sk]):
> 
> Thread 0                        Thread 1
> 
> ioctl(sock, FIOASYNC, [1])
> ...
> sock = filp->private_data;
> lock_sock(sock [smc->sk]);
> sock_fasync(sock, ..., 1)       ; new fasync_struct linked to smc->sk
> release_sock(sock [smc->sk]);
>                                  ...
>                                  lock_sock([smc->sk]);
>                                  ...
>                                  smc_switch_to_fallback()
>                                  ...
>                                  smc->clcsock->file->private_data = 
> smc->clcsock;
>                                  ...
>                                  release_sock([smc->sk]);
> ioctl(sock, FIOASYNC, [0])
> ...
> sock = filp->private_data;
> lock_sock(sock [smc->clcsock]);
> sock_fasync(sock, ..., 0)       ; nothing to unlink from smc->clcsock
>                                  ; since fasync entry was linked to smc->sk
> release_sock(sock [smc->clcsock]);
>                                  ...
>                                  close(sock [smc->clcsock]);
>                                  __fput(...);
>                                  file->f_op->fasync(sock, [0])   ; 
> always failed -
>                                                                  ; 
> should use
>                                                                  ; 
> smc->sk instead
>                                  file->f_op->release()
>                                     ...
>                                     smc_restore_fallback_changes()
>                                     ...
>                                     file->private_data = smc->sk.sk_socket;

Thank you Dmitry for your detailed explanations.
It helped me a lot trying to understand the problem.
And I'm still in the progress of trying to understand it.
 From my naive point of view:
Would it help if we catch the ioctl and check if there is an active 
fallback and either stall or route the ioctl to the correct socket?
I've seen that there is an ioctl function handle in the proto_ops.
So on a very abstract level we could do the following:
1. Indicate an active Fallback at the start of the fallback to tcp.
2. Catch ioctls.
3. Check if there is an active fallback.
	NO: Pass the ioctl.
	YES: Wait for the fallback to complete and pass after.

If this blocks too much we can obviously do some finer checks there.
E.g.: Just check if the private data is already at attached to the socket.

Do you think this would be a suiteable solution?
I'm going to look into your proposal Dmitry to see how you solved the 
problem and to understand it better.

Thanks
- Jan

> 
> That is, smc_restore_fallback_changes() restores filp->private_data to
> smc->sk. If __fput() would have called file->f_op->release() _before_
> file->f_op->fasync(), the fix would be as simple as adding
> 
> smc->sk.sk_socket->wq.fasync_list = smc->clcsock->wq.fasync_list;
> 
> to smc_restore_fallback_changes(). But since file->f_op->fasync() is called
> before file->f_op->release(), the former always makes an attempt to 
> unlink fasync
> entry from smc->clcsock instead of smc->sk, thus introducing the memory 
> leak.
> 
> And an idea with shared wait queue was intended in attempt to eliminate
> this chicken-egg lookalike problem completely.
> 
> Dmitry
>