Revisiting this 8 years later (Dec 2016), I realize I never posted the Exciting Conclusion. Our "solution" was to switch the network connection from TCP to UDP. We concluded that there was a bug in the VxWorks TCP stack, but we couldn't reproduce the
problem reliably. Our application didn't require the extra benefits of TCP (data stream is reproduced in order), and it could tolerate a few dropped packets. We switched to UDP and the problem went away. We were under intense schedule pressure, so we
notified Wind River about our fix and moved on :).
Bill & James: Thanks so much for your help 8 years ago. That was a really tough problem, especially for a new engineer. I was so grateful for your time and attention.
Best,
Justin
On Thursday, August 7, 2008 at 1:56:40 PM UTC-7,
[email protected] wrote:
On Aug 7, 3:13 am, James Cunnane <[email protected]>
wrote:
On Tue, 5 Aug 2008 16:07:34 -0700 (PDT), [email protected]
wrote:
Oh, and I just remembered another piece of the puzzle: The VxWorks >machine is also exchanging data with another box on the network over
UDP. We have timers in the VxWorks app that make it panic if it stops >receiving UDP packets. It appears that during each of these anomalies, >the VxWorks box continues to receive UDP packets just fine. That is,
it appears as though it stops hearing from the TCP stream, but
continues to receive UDP packets as normal.
Perhaps your ARP cache has become corrupt. I had a system which after about 26 days of continuous connection would respond to ping but not
to telnet; it turned out that the ARP cache had become corrupted by a nanosecond timer overflow. The mechanism of corruption is probably
not timer-related in your case but the end result seems similar. Can
you devise ARP diagnostics that can run periodically on the sending
device, both before and after the TCP fail?
Hmm... In your case you said the system would respond to ping, but
not telnet. It's hard to classify that as a problem with the ARP
cache, _if_ you tried to ping the target from the same host that you
also tried to telnet to it from. If you can ping target A from host
B, then ARP resolution between A and B is working (or at least, the
ARP entries haven't timed out yet). Ping (ICMP over IP) and telnet
(TCP over IP) both rely on ARP, so if it worked for one, it should
have worked for the other.
However, if you tried to ping target A from host B, and that worked,
but trying to telnet to target A from host C did not work, that could
be an ARP problem. (The target still had an unexpired ARP entry for
host B, but was unable to perform ARP resolution for the previously
unknown host C.)
In Justin's case, he said once his app got into its error state, he
could see the target still sending TCP segments to his Windows host
using Wireshark (but not responding to ACKs from the Windows host).
This implies the target's ARP entry for the Windows host was still
valid (otherwise it would have started sending ARP "who has" requests instead).
-Bill
Regards
James Cunnane
--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)