[lvc-project] [PATCH v3 1/2] wifi: ath9k: fix races between ath9k_wmi_cmd and ath9k_wmi_ctrl_rx
Fedor Pchelkin
pchelkin at ispras.ru
Thu May 18 18:44:24 MSK 2023
On Thu, May 18, 2023 at 06:24:37PM +0800, Hillf Danton wrote:
> Fedor Pchelkin <pchelkin at ispras.ru> writes:
>
> > On Wed, Apr 26, 2023 at 07:07:08AM +0800, Hillf Danton wrote:
> >> Given similar wait timeout[1], just taking lock on the waiter side is not
> >> enough wrt fixing the race, because in case job done on the waker side,
> >> waiter needs to wait again after timeout.
> >>
> >
> > As I understand you correctly, you mean the case when a timeout occurs
> > during ath9k_wmi_ctrl_rx() callback execution. I suppose if a timeout has
> > occurred on a waiter's side, it should return immediately and doesn't have
> > to care in which state the callback has been at that moment.
> >
> > AFAICS, this is controlled properly with taking a wmi_lock on waiter and
> > waker sides, and there is no data corruption.
> >
> > If a callback has not managed to do its work entirely (performing a
> > completion and subsequently waking waiting thread is included here), then,
> > well, it is considered a timeout, in my opinion.
> >
> > Your suggestion makes a wmi_cmd call to give a little more chance for the
> > belated callback to complete (although timeout has actually expired). That
> > is probably good, but increasing a timeout value makes that job, too. I
> > don't think it makes any sense on real hardware.
> >
> > Or do you mean there is data corruption that is properly fixed with your patch?
>
> Given complete() not paired with wait_for_completion(), what is the
> difference after this patch?
The main thing in the patch is making ath9k_wmi_ctrl_rx() release wmi_lock
after calling ath9k_wmi_rsp_callback() which does copying data into the
shared wmi->cmd_rsp_buf buffer. Otherwise there can occur a data
corrupting scenario outlined in the patch description (added it here,
too).
On Tue, 25 Apr 2023 22:26:06 +0300, Fedor Pchelkin wrote:
> CPU0 CPU1
>
> ath9k_wmi_cmd(...)
> mutex_lock(&wmi->op_mutex)
> ath9k_wmi_cmd_issue(...)
> wait_for_completion_timeout(...)
> ---
> timeout
> ---
> /* the callback is being processed
> * before last_seq_id became zero
> */
> ath9k_wmi_ctrl_rx(...)
> spin_lock_irqsave(...)
> /* wmi->last_seq_id check here
> * doesn't detect timeout yet
> */
> spin_unlock_irqrestore(...)
> /* last_seq_id is zeroed to
> * indicate there was a timeout
> */
> wmi->last_seq_id = 0
> mutex_unlock(&wmi->op_mutex)
> return -ETIMEDOUT
>
> ath9k_wmi_cmd(...)
> mutex_lock(&wmi->op_mutex)
> /* the buffer is replaced with
> * another one
> */
> wmi->cmd_rsp_buf = rsp_buf
> wmi->cmd_rsp_len = rsp_len
> ath9k_wmi_cmd_issue(...)
> spin_lock_irqsave(...)
> spin_unlock_irqrestore(...)
> wait_for_completion_timeout(...)
> /* the continuation of the
> * callback left after the first
> * ath9k_wmi_cmd call
> */
> ath9k_wmi_rsp_callback(...)
> /* copying data designated
> * to already timeouted
> * WMI command into an
> * inappropriate wmi_cmd_buf
> */
> memcpy(...)
> complete(&wmi->cmd_wait)
> /* awakened by the bogus callback
> * => invalid return result
> */
> mutex_unlock(&wmi->op_mutex)
> return 0
So before the patch the wmi->last_seq_id check in ath9k_wmi_ctrl_rx()
wasn't helpful in case wmi->last_seq_id value was changed during
ath9k_wmi_rsp_callback() execution because of the next ath9k_wmi_cmd()
call.
With the proposed patch the wmi->last_seq_id check in ath9k_wmi_ctrl_rx()
accomplishes its job as:
- the next ath9k_wmi_cmd call changes last_seq_id value under lock so
it either waits for a belated ath9k_wmi_ctrl_rx() to finish or updates
last_seq_id value so that the timeout check in ath9k_wmi_ctrl_rx()
indicates that the waiter side has timeouted and we shouldn't further
process the callback.
- memcopying in ath9k_wmi_rsp_callback() is made to a valid place if
the last_seq_id check was successful under the lock.
More information about the lvc-project
mailing list