• Web server leaks client slots: client list (sbbsctrl/MQTT) shows ~100

    From Rob Swindell@1:103/705 to GitLab issue in main/sbbs on Wed Jun 3 21:50:51 2026
    open https://gitlab.synchro.net/main/sbbs/-/issues/1155

    ## Summary

    The Web Server's client list (as shown in `sbbsctrl` on Windows, and the MQTT client topics) accumulates entries that do **not** correspond to live TCP sessions. On a production host (`vert`) under a crawler flood, the list showed **~100 web clients** while the OS TCP table for the web ports held only **~23 sockets**, of which exactly **one** was `ESTABLISHED`. The web slot table (`MaxClients`) fills with dead/phantom entries, which starves real connections — a self-inflicted availability problem that looks like the scrape attack "winning."

    ## Evidence (ground truth vs. the client list)

    Host runs the server (listeners on `:80`/`:443` owned by the `sbbsctrl`-hosted process). `Get-NetTCPConnection -LocalPort 80,443`, non-listen states:

    | State | Count | Meaning |
    |-------|------:|---------|
    | `ESTABLISHED` | 1 | the only genuinely live web session |
    | `CLOSE_WAIT` | ~22 | peer sent FIN (client hung up); the app still holds the socket/thread/slot |
    | `TIME_WAIT` | ~7 | already closed by the app; kernel cooldown (app no longer owns these) |

    vs. **~100** web clients listed in `sbbsctrl`/MQTT (`MaxClients = 100`). So of the listed ~100:

    - **1** maps to a live (`ESTABLISHED`) socket,
    - **~22** map to a half-closed (`CLOSE_WAIT`) socket — client gone, not reaped,
    - **~75+** have **no socket in the OS table at all** — pure leaked/zombie client-list entries.

    The real sockets are almost entirely one crawler operator: the single `ESTABLISHED` is `220.181.108.110` (Baidu), and the `CLOSE_WAIT` pile is 17× `116.179.37.x` + 3× `116.179.33.x` (Baidu) plus a couple of stragglers. All current web sockets are on `:443` (TLS).

    A `CLOSE_WAIT` socket persists only while the application has **not** called `close()` — i.e. the owning session thread is alive but not progressing to the point where it would notice the peer's FIN (`recv`==0) and tear down. `MaxInactivity` (60s here) should reap an idle keep-alive, but these are not being reaped, so the threads are stuck somewhere upstream of the read/timeout path.

    ## Prior fix attempt (insufficient)

    Commit `ead5ccf16` ("Detect disconnection in JavaScript callback", song-11-earn) added a disconnect check inside `js_OperationCallback()` (and the equivalent in `services.cpp`): if `js_callback.auto_terminate` is set and `session_check()` reports the socket offline for 10 consecutive callbacks, the script is aborted. This correctly fixes the case it targeted — a runaway SSJS/XJS (e.g. webv4 user/system stats) that loops without checking for disconnection.

    But the leak persists in the field (the ~100-vs-23 numbers above were observed after that commit), so it's not the whole story. Remaining gaps (hypotheses, not yet confirmed — see below):

    1. **The operation callback only fires while the script executes JS bytecode.** A session thread blocked in a *native* call — a record-lock retry loop on the SMB-mounted `user.tab` (see #1153), a blocking `recv`/`SSL_read`, etc. — never reaches `js_OperationCallback`, so the disconnect check can't run.
    2. **It's gated on `auto_terminate`;** sessions without it set are unaffected. 3. **It depends on `session_check()` actually detecting a half-closed (`CLOSE_WAIT`) TLS socket.** If a peer FIN on a TLS connection isn't surfaced until the next `SSL_read`, a stalled thread never sees it.
    4. **It only addresses sessions that are running a script on a still-present socket.** It cannot explain the **~75+ listed clients with no socket at all** — those are a separate `client_on()`/`client_off()` (or retained-MQTT `client/action/connect`) accounting leak, independent of any running script.

    ## Impact

    The web slot table (`MaxClients`) is consumed by corpses and phantoms rather than real load, so legitimate connections are refused/starved. Under a steady crawler (Baidu here, on persistent TLS keep-alive — cf. #1154), this compounds quickly.

    ## Relationship to other issues

    - **#1153** (Windows/SMB exclusive read locks serializing `user.tab` reads): the lock convoy is a strong candidate for *why* threads stall long enough to never reap their sockets — stuck threads hold slots and CLOSE_WAIT sockets.
    - **#1154** (no max-requests / max-age cap on HTTP keep-alive): long-lived crawler connections are what get stuck in the first place.

    ## Status

    Symptom and measurement are confirmed (above). Root cause is **under active investigation** — specifically: where the stuck threads are blocked (native lock path vs. TLS read vs. script), whether `session_check()` detects a TLS half-close, and how a client-list/slot entry can outlive its socket without `client_off()`. This issue tracks the leak itself; findings to follow.

    ---
    *Measured on the live `vert` host during a Baidu (`116.179.37.x`) crawl, while investigating SMB `user.tab` contention. Numbers are a point-in-time sample; the discrepancy (≈100 listed vs ≈23 sockets) is the stable signal.*

    — *Authored by Claude (Claude Code), on behalf of @rswindell*
    * Origin: Vertrauen - [vert/cvs/bbs].synchro.net (1:103/705)
  • From Rob Swindell@1:103/705 to GitLab note in main/sbbs on Wed Jun 3 23:23:17 2026
    https://gitlab.synchro.net/main/sbbs/-/issues/1155#note_9182

    ## Counter-evidence: not reproducing on the Linux host (`git.synchro.net`)

    The Linux box that backs `/share` also runs a public-facing Synchronet web server (`sbbs d w!`, the `git.synchro.net` site, `MaxClients=200`). It is **not** exhibiting this leak, which helps narrow the root cause.

    ### Live measurement (point-in-time, same method as the OP)

    | Signal | Linux `git` host | `vert` (from OP) | |--------|-----------------:|-----------------:|
    | Web clients (MQTT retained `…/server/web/client`) | **4 / 200** | 100 / 100 (maxed) |
    | Requests served | **3,927,093** | 15,563 |
    | Session-thread reaping (web log) | normal — *"… N clients and M threads remain"* | stuck at cap |
    | `CLOSE_WAIT` sockets on web ports (OS TCP table) | **0** | ~22 |

    After ~3.9M served requests the client count sits at **4**, threads terminate cleanly, and there are **zero `CLOSE_WAIT`** sockets on the web ports. A balanced-but-leaks-on-stall design that *were* stalling would show a climbing client count and a `CLOSE_WAIT` pile; neither is present.

    ### Why — the accounting is balanced by construction

    Reading `http_session_thread()` in `websrvr.cpp`: `client_on()` and `client_off()` are **structurally paired**. Between `client_on()` (~line 7003) and `client_off()` (~line 7175) the entire request loop falls through to a single linear teardown with **no early `return`**; the early-exit guards (host_can / ip_can / banned, ~6951/6983) all sit *before* `client_on()`, so they never create an unbalanced entry. `active_clients` is incremented (~6992) and decremented (~7173) in the same span. So there is **no "forgot to call `client_off()`" bug** — the only way a client-list/slot entry can outlive its socket is if the thread **never reaches the teardown**, i.e. it is wedged between `client_on()` and `client_off()`.

    That reframes both leak variants in the OP as **stall-driven**, not an accounting defect:

    - The `CLOSE_WAIT`-holding entries → thread blocked upstream of `recv`/timeout (the #1153 `user.tab` lock-retry path, a blocking `SSL_read`, etc.).
    - The "~75+ entries with no socket at all" → note that `close_session_socket()` runs (~7167) *before* `client_off()` (~7175), with `sem_wait(&session.output_thread_terminated)` (~7168) in between. If the output-thread join never completes, the socket is already gone but `client_off()` is never called — the exact "phantom entry, no socket" signature, still a stall, just in the join.

    ### Implication for root-causing

    The `websrvr.cpp` code is identical cross-platform, so the leak is **latent on Linux too** — but it is **entirely stall-driven**, and on the Linux host nothing is making session threads wedge, so entries are reaped normally. The differentiator is the **stall source**: VERT serves the heavily-crawled forum and does its per-page `user.tab` lookups over the SMB mount (#1153) under the Baidu flood; the Linux host serves different content and is not under that contention. This is consistent with #1153 being the dominant trigger rather than a platform-independent `client_on()`/`client_off()` accounting bug.

    (Caveat: on this host `/sbbs` is itself a loopback CIFS mount of `/share`, so a #1153-style lock convoy is not *theoretically* impossible here either — it would just be over loopback. It simply isn't being triggered, per the numbers above.)

    — *Authored by Claude (Claude Code), on behalf of @rswindell*
    --- SBBSecho 3.37-Linux
    * Origin: Vertrauen - [vert/cvs/bbs].synchro.net (1:103/705)
  • From Rob Swindell@1:103/705 to GitLab note in main/sbbs on Wed Jun 3 23:29:17 2026
    https://gitlab.synchro.net/main/sbbs/-/issues/1155#note_9183

    ## Root cause determined (live-dump analysis)

    Took a non-invasive full minidump of the running `sbbsctrl` (the process owns the `:80`/`:443` listeners) and analyzed it offline. The dominant leak is now fully explained.

    ### What the stuck threads are doing

    Of 221 threads, ~24 are web `http_session` threads stuck identically:

    ```
    ntdll!NtDelayExecution -> KERNELBASE!Sleep(1000ms)
    sbbs!js_mswait <- the SSJS script called mswait(1000) mozjs185!JS_ExecuteScript
    websrvr!exec_ssjs
    websrvr!http_session_thread
    ```

    i.e. an SSJS page (webv4 user/system stats) looping on `mswait()` after the client has gone. **Not** the lock convoy from #1153 — `0` threads were in `lockuserdat`/`readuserdat`/`xp_lockfile`.

    ### Why every backstop fails

    Live `js_callback` of a stuck session: `auto_terminate=1`, `counter=146958` (operation callback *is* firing constantly), **`offline_counter=0`**, `is_tls=true`, **`tls_pending=true`**.

    1. **Infinite-loop detector** (`js_internal.cpp:280`) — gated `if (cb->limit && cb->counter > cb->limit)`; `cb->limit == 0`, so disabled. The counter just climbs unbounded.
    2. **`terminated` flag** — never set (nothing signals the script the client left).
    3. **The `ead5ccf16` disconnect check** (added in `js_OperationCallback`, song-11-earn) — confirmed *present in the loaded binary* (disassembly shows `call websrvr!session_check` + `cmp …,0Ah`), so it's running. But `offline_counter` stays 0 after 146,958 callbacks, which means `session_check()` returned "connected" every single time.

    ### The bug: `session_check()` can't see a TLS disconnect

    In `session_check()`'s `is_tls` branch, once `tls_pending` latches `true` it returns "connected" via the early-return **without probing the socket** — and a peer's TLS `close_notify` arrives as readable bytes, which is exactly what set `tls_pending=true` (or a raw-readable event did). So after the client closes, `session_check()` reports connected forever, the `ead5ccf16` abort never arms, and the looping script never terminates — pinning the thread, a `MaxClients` slot, and the `CLOSE_WAIT` socket.

    This is **HTTPS-only**: the latch lives on the TLS path. Plain HTTP re-probes via `socket_check()` every call, which detects a FIN (`MSG_PEEK==0`/`POLLHUP`). Consistent with the evidence — every stuck/`CLOSE_WAIT` socket was on `:443`.

    ### Two distinct leaks (reminder)

    - **(a) Stuck looping-SSJS threads (~24)** — explained above; fix in progress (below).
    - **(b) Phantom client-list entries** — only ~26 web threads and ~23 sockets exist, vs ~100 in the client list, so ~75 listed clients have neither a thread nor a socket. This is a separate `client_on`/`client_off` (or stale retained-MQTT) accounting leak and is **still open**.

    ## Fix in progress (leak (a))

    Rework `session_check()`'s TLS path: drop the `tls_pending` liveness latch, and when the raw socket is readable, probe via `cryptPopData(1 byte)` — which distinguishes application data (connected, byte cached in the existing `peeked`/`peeked_valid` slot so nothing is lost), `CRYPT_ERROR_TIMEOUT` (connected, no data yet), and `CRYPT_ERROR_COMPLETE` (peer closed → disconnected). The probe is safe: `CRYPT_OPTION_NET_READTIMEOUT==0` (set at session setup, `websrvr.cpp:6922`) makes it non-blocking, and the op-callback runs in the same `http_session_thread`, so there's no concurrent reader. Also adjusted `recvbufsocket()` to close in place when `session_check()` reports a disconnect.

    Builds clean (MSVC Release). Pending: Debug rebuild + `sbbsctrl` **restart** (not recycle — the DLL must reload) + a live HTTPS repro (`curl -k --max-time 2` against the stats page, watch the thread reap within ~10 callbacks). Not committed until verified in the field.

    Residual edge (rare for this attack): a client that pipelines another request *then* closes can hide the `close_notify` behind unread data; the #1154 max-connection-age cap is the backstop for that.

    — *Authored by Claude (Claude Code), on behalf of @rswindell*
    --- SBBSecho 3.37-Linux
    * Origin: Vertrauen - [vert/cvs/bbs].synchro.net (1:103/705)
  • From Rob Swindell@1:103/705 to GitLab issue in main/sbbs on Fri Jun 5 10:17:45 2026
    close https://gitlab.synchro.net/main/sbbs/-/issues/1155
    --- SBBSecho 3.37-Linux
    * Origin: Vertrauen - [vert/cvs/bbs].synchro.net (1:103/705)
  • From Rob Swindell@1:103/705 to GitLab note in main/sbbs on Fri Jun 5 10:17:45 2026
    https://gitlab.synchro.net/main/sbbs/-/issues/1155#note_9259

    Fixed.
    --- SBBSecho 3.37-Linux
    * Origin: Vertrauen - [vert/cvs/bbs].synchro.net (1:103/705)