[lvc-project] [PATCH] net/9p: fix infinite loop in p9_client_rpc on fatal signal
Vasiliy Kovalev
kovalev at altlinux.org
Mon Jun 22 01:06:21 MSK 2026
On 6/21/26 16:00, Dominique Martinet wrote:
> Dominique Martinet wrote on Fri, Apr 17, 2026 at 07:52:52AM +0900:
>>> While the ideal long-term goal is the asynchronous implementation (as
>>> seen in your 9p-async-v2 branch [2]), this patch serves as a reliable
>>> intermediate solution for a critical regression.
>>> [2] https://github.com/martinetd/linux/commits/9p-async-v2
>>
>> iirc one of the problem with the async branch is that the process would
>> quit immediately on, say, ^C, before the IO has completed, but it's
>> possible for the server to process the IO (and not the flush) afterwards
>> and you'd get something that's not supposed to happen e.g.
>>
>> p1 p2
>>
>> write(1)
>> ^C/sigkill
>> flush sent but process exit without waiting for server ack
>> 1 not written yet
>> write(2) in same spot
>> write(2) done
>> write(1) completes
>> data isn't 2 as expected after p2 completed
>>
>>
>> So it's quite possible async isn't the way to go, but that there is no
>> good solution for this
>> (given this is true even without async on sigkill: if we have something
>> that works safely, there's no reason to wait only for non-fatal signals...)
>
> Sorry to come back to this after two months but I'm still a bit worried
> about this patch, and just came back to it as I'm about to send the PR
> to Linus...
> And I'm still thinking about the problem above, or rather possible
> variants involving cache (e.g. write going through the server, but
> client believing it didn't because the response didn't make it in time)
>
> .. But the thing is, I couldn't actually hit the `if
> (fatal_signal_pending(current))` you added (adding some print
> statement):
> - if cache is enabled, the actual I/Os are done by the vfs in the
> background, so any kill to user processes won't have any impact (and
> thus I guess my main worry about cache is alleviated there)
> - with cache=none I'm not sure why I can't hit it, I tried with an
> external server, breaking on the write() call while running dd, and
> killing dd with SIGKILL a few times but that doesn't appear to be
> enough? (task still stuck in write > rpc > flush > rpc, but it doesn't
> appear to ever get out of io_wait_event_killable() even when I hammer it
> with more signals?)
>
> So, given that my worry with cache is irrelevant (runs in background &
> won't ever hit this), I can't seem to hit this with what I consider
> to be normal workloads, and assuming it does fix your problems given you
> were able to test it... I'll leave it in and send to Linus now but I'd
> appreciate clarifications on how to test this more thoroughly as time
> permits...
> (I honestly probably should drop the patch at this point, but it'll
> still be time to revert if I figure something out in the next few weeks
> given it's been in -next for almost 2 months already)
>
> Thanks,
Quoting myself from April: "Severity is low and likely unreachable in
production, but it slows down syzkaller — the hung process ties up
a worker slot until the harness kills it by timeout (143s on our
setup)."
The deterministic path is the syzkaller C reproducer:
https://syzkaller.appspot.com/x/repro.c?x=156aa534580000
What it does:
1) mounts 9p with trans=fd, rfdno/wfdno pointing to open fds
with nothing speaking the 9p protocol on the other side
- RFLUSH can never arrive;
2) the 9p rpc from mount parks a thread in io_wait_event_killable;
3) another thread triggers SIGSEGV via prctl(PR_SET_MM) + brk()
corruption -> coredump_wait;
4) the harness's kill_and_wait() fires 5s later.
To make both branches visible, debug diff on top of the patch:
diff --git a/net/9p/client.c b/net/9p/client.c
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -600,8 +600,12 @@ p9_client_rpc(...)
if (err == -ERESTARTSYS && c->status == Connected &&
type == P9_TFLUSH) {
- if (fatal_signal_pending(current))
+ pr_info("9p-dbg: TFLUSH retry hit, fatal=%d\n",
+ fatal_signal_pending(current));
+ if (fatal_signal_pending(current)) {
+ pr_info("9p-dbg: bailing out via
recalc_sigpending\n");
goto recalc_sigpending;
+ }
sigpending = 1;
clear_thread_flag(TIF_SIGPENDING);
goto again;
In the VM:
# gcc repro.c -o repro
# ./repro
dmesg fires on every iteration:
[root at localhost repro]# ./repro
executing program
[ 126.254054] repro[363]: segfault at 558a42e9ff30 ip 0000558a42e9ff30
sp 00007f3225ee4e80 error 14 likely on CPU 0 (core 0, socket 0)
[ 126.258095] Code: Unable to access opcode bytes at 0x558a42e9ff06.
[ 131.199937] 9pnet: 9p-dbg: TFLUSH retry hit, fatal=1
[ 131.201868] 9pnet: 9p-dbg: bailing out via recalc_sigpending
executing program
[ 131.270955] repro[366]: segfault at 558a42e9ff30 ip 0000558a42e9ff30
sp 00007f3225ee4e80 error 14 likely on CPU 3 (core 3, socket 0)
[ 131.275131] Code: Unable to access opcode bytes at 0x558a42e9ff06.
[ 136.219066] 9pnet: 9p-dbg: TFLUSH retry hit, fatal=1
[ 136.221359] 9pnet: 9p-dbg: bailing out via recalc_sigpending
executing program
[ 136.290772] repro[369]: segfault at 558a42e9ff30 ip 0000558a42e9ff30
sp 00007f3225ee4e80 error 14 likely on CPU 2 (core 2, socket 0)
[ 136.295901] Code: Unable to access opcode bytes at 0x558a42e9ff06.
[ 141.237955] 9pnet: 9p-dbg: TFLUSH retry hit, fatal=1
[ 141.239800] 9pnet: 9p-dbg: bailing out via recalc_sigpending
executing program
...
Without the patch the second pr_info never appears and the task
hangs in D-state.
On a real server I couldn't reproduce this by hand. The reproducer
hits the branch deterministically (logs above); why hand-issued
SIGKILLs don't get there is a kernel signal-delivery question
outside the path this patch touches, and I didn't dig into it.
Feel free to revert if anything turns up in the next weeks.
--
Thanks,
Vasiliy
More information about the lvc-project
mailing list