open
https://gitlab.synchro.net/main/sbbs/-/issues/1155
## Summary
The Web Server's client list (as shown in `sbbsctrl` on Windows, and the MQTT client topics) accumulates entries that do **not** correspond to live TCP sessions. On a production host (`vert`) under a crawler flood, the list showed **~100 web clients** while the OS TCP table for the web ports held only **~23 sockets**, of which exactly **one** was `ESTABLISHED`. The web slot table (`MaxClients`) fills with dead/phantom entries, which starves real connections — a self-inflicted availability problem that looks like the scrape attack "winning."
## Evidence (ground truth vs. the client list)
Host runs the server (listeners on `:80`/`:443` owned by the `sbbsctrl`-hosted process). `Get-NetTCPConnection -LocalPort 80,443`, non-listen states:
| State | Count | Meaning |
|-------|------:|---------|
| `ESTABLISHED` | 1 | the only genuinely live web session |
| `CLOSE_WAIT` | ~22 | peer sent FIN (client hung up); the app still holds the socket/thread/slot |
| `TIME_WAIT` | ~7 | already closed by the app; kernel cooldown (app no longer owns these) |
vs. **~100** web clients listed in `sbbsctrl`/MQTT (`MaxClients = 100`). So of the listed ~100:
- **1** maps to a live (`ESTABLISHED`) socket,
- **~22** map to a half-closed (`CLOSE_WAIT`) socket — client gone, not reaped,
- **~75+** have **no socket in the OS table at all** — pure leaked/zombie client-list entries.
The real sockets are almost entirely one crawler operator: the single `ESTABLISHED` is `220.181.108.110` (Baidu), and the `CLOSE_WAIT` pile is 17× `116.179.37.x` + 3× `116.179.33.x` (Baidu) plus a couple of stragglers. All current web sockets are on `:443` (TLS).
A `CLOSE_WAIT` socket persists only while the application has **not** called `close()` — i.e. the owning session thread is alive but not progressing to the point where it would notice the peer's FIN (`recv`==0) and tear down. `MaxInactivity` (60s here) should reap an idle keep-alive, but these are not being reaped, so the threads are stuck somewhere upstream of the read/timeout path.
## Prior fix attempt (insufficient)
Commit `ead5ccf16` ("Detect disconnection in JavaScript callback", song-11-earn) added a disconnect check inside `js_OperationCallback()` (and the equivalent in `services.cpp`): if `js_callback.auto_terminate` is set and `session_check()` reports the socket offline for 10 consecutive callbacks, the script is aborted. This correctly fixes the case it targeted — a runaway SSJS/XJS (e.g. webv4 user/system stats) that loops without checking for disconnection.
But the leak persists in the field (the ~100-vs-23 numbers above were observed after that commit), so it's not the whole story. Remaining gaps (hypotheses, not yet confirmed — see below):
1. **The operation callback only fires while the script executes JS bytecode.** A session thread blocked in a *native* call — a record-lock retry loop on the SMB-mounted `user.tab` (see #1153), a blocking `recv`/`SSL_read`, etc. — never reaches `js_OperationCallback`, so the disconnect check can't run.
2. **It's gated on `auto_terminate`;** sessions without it set are unaffected. 3. **It depends on `session_check()` actually detecting a half-closed (`CLOSE_WAIT`) TLS socket.** If a peer FIN on a TLS connection isn't surfaced until the next `SSL_read`, a stalled thread never sees it.
4. **It only addresses sessions that are running a script on a still-present socket.** It cannot explain the **~75+ listed clients with no socket at all** — those are a separate `client_on()`/`client_off()` (or retained-MQTT `client/action/connect`) accounting leak, independent of any running script.
## Impact
The web slot table (`MaxClients`) is consumed by corpses and phantoms rather than real load, so legitimate connections are refused/starved. Under a steady crawler (Baidu here, on persistent TLS keep-alive — cf. #1154), this compounds quickly.
## Relationship to other issues
- **#1153** (Windows/SMB exclusive read locks serializing `user.tab` reads): the lock convoy is a strong candidate for *why* threads stall long enough to never reap their sockets — stuck threads hold slots and CLOSE_WAIT sockets.
- **#1154** (no max-requests / max-age cap on HTTP keep-alive): long-lived crawler connections are what get stuck in the first place.
## Status
Symptom and measurement are confirmed (above). Root cause is **under active investigation** — specifically: where the stuck threads are blocked (native lock path vs. TLS read vs. script), whether `session_check()` detects a TLS half-close, and how a client-list/slot entry can outlive its socket without `client_off()`. This issue tracks the leak itself; findings to follow.
---
*Measured on the live `vert` host during a Baidu (`116.179.37.x`) crawl, while investigating SMB `user.tab` contention. Numbers are a point-in-time sample; the discrepancy (≈100 listed vs ≈23 sockets) is the stable signal.*
— *Authored by Claude (Claude Code), on behalf of @rswindell*
* Origin: Vertrauen - [vert/cvs/bbs].synchro.net (1:103/705)