On Tue, Feb 14, 2023 at 09:42:09PM +0100, Paul Gevers wrote:
Hi Phil,
On 13-02-2023 08:57, Philipp Kern wrote:
On 12.02.23 22:38, Paul Gevers wrote:
I have munin [1], but as said, I'm not a trained sysadmin. I don't
know what I'm looking for if you ask "statistics on the network".
This is more of a software development / devops question than a sysadmin question, but alas.
I acknowledge that my reach out was broad and didn't only cover s390x.
What I am interested in is *application-level* logging on reconnects. Presumably the connection to RabbitMQ is outbound?
Our configuration can be seen here: https://salsa.debian.org/ci-team/debian-ci-config/-/blob/master/cookbooks/rabbitmq/templates/rabbitmq.conf.erb
Is it tunneled? Does your application log somewhere when a reconnect happens? Does it say when it successfully connected?
I'd expect good software to log something like this:
[10:00:00] Connecting to broker "rabbitmq.debci.debian.net:12345"... [10:00:05] Connected to broker "rabbitmq.debci.debian.net:12345".
And also:
[10:00:00] Connecting to broker "rabbitmq.debci.debian.net:12345"... [10:00:01] Connection to broker "rabbitmq.debci.debian.net:12345"
failed: Connection refused
@terceiro; I haven't seen these kind of logs on the worker hosts. Do you
know if they exist or if we can generate them?
The worker does log it's initial connection, see below.
I think I'm seeing something on the main host. admin@ci-master:/var/log/rabbitmq$ sudo grep 148.100.88.163 [email protected] | grep -v '\[info\]' | grep -v '\[warning\]'
2023-02-14 00:00:37.522 [error] <0.30951.85> closing AMQP connection <0.30951.85> (148.100.88.163:49540 -> 10.1.14.198:5671):
2023-02-14 02:27:56.050 [error] <0.15184.87> closing AMQP connection <0.15184.87> (148.100.88.163:49988 -> 10.1.14.198:5671):
2023-02-14 02:36:05.496 [error] <0.17479.87> closing AMQP connection <0.17479.87> (148.100.88.163:57098 -> 10.1.14.198:5671):
2023-02-14 04:06:13.869 [error] <0.16105.88> closing AMQP connection <0.16105.88> (148.100.88.163:42984 -> 10.1.14.198:5671):
2023-02-14 04:15:27.696 [error] <0.19038.88> closing AMQP connection <0.19038.88> (148.100.88.163:56650 -> 10.1.14.198:5671):
2023-02-14 20:05:38.702 [error] <0.23586.97> closing AMQP connection <0.23586.97> (148.100.88.163:34278 -> 10.1.14.198:5671):
and a lot more warnings (220 times in 20 hours) as well; like:
2023-02-14 20:05:09.011 [warning] <0.20860.97> closing AMQP connection <0.20860.97> (148.100.88.163:45624 -> 10.1.14.198:5671, vhost: '/', user: 'guest'):
And a lot (around 544) (obviously I don't know if that's only or even includes the s390x host):
client unexpectedly closed TCP connection
root@ci-worker-s390x-01:~# journalctl -u
[email protected]e --since='2 days ago' -t debci -b 0 | grep amqp
Feb 15 15:10:21 ci-worker-s390x-01 debci[663]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 15 17:58:53 ci-worker-s390x-01 debci[2740543]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 15 19:23:40 ci-worker-s390x-01 debci[1855652]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 15 19:28:12 ci-worker-s390x-01 debci[1939916]: retry: amqp-publish returned 1, backing off for 10 seconds and trying again...
Feb 15 20:50:51 ci-worker-s390x-01 debci[783145]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 15 21:36:25 ci-worker-s390x-01 debci[1966510]: retry: amqp-publish returned 1, backing off for 10 seconds and trying again...
Feb 15 22:21:43 ci-worker-s390x-01 debci[3243793]: retry: amqp-publish returned 1, backing off for 10 seconds and trying again...
Feb 16 00:29:41 ci-worker-s390x-01 debci[4119188]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 16 01:21:26 ci-worker-s390x-01 debci[2097411]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 16 03:02:20 ci-worker-s390x-01 debci[1133799]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 16 06:40:46 ci-worker-s390x-01 debci[953820]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 16 08:00:24 ci-worker-s390x-01 debci[2875496]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 16 09:59:09 ci-worker-s390x-01 debci[3864527]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 16 11:47:09 ci-worker-s390x-01 debci[2310984]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 16 14:08:01 ci-worker-s390x-01 debci[1968077]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 16 16:58:24 ci-worker-s390x-01 debci[2496027]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 16 18:40:23 ci-worker-s390x-01 debci[1074224]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 16 20:35:15 ci-worker-s390x-01 debci[824124]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 16 22:23:07 ci-worker-s390x-01 debci[3567266]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 17 01:16:26 ci-worker-s390x-01 debci[40757]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 17 04:22:34 ci-worker-s390x-01 debci[1275755]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 17 06:24:45 ci-worker-s390x-01 debci[448310]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net
Feb 17 08:26:47 ci-worker-s390x-01 debci[830390]: I: Connecting to AMQP queue debci-tests-s390x-lxc on amqps://ci-master.debian.net
So, for example, we sometimes see errors when publishing results (the
retry: lines), so maybe we are also having problems getting jobs.
Also, it seems the client connects again several times a day, what is definitively not normal (the amqp connection is supposed to be
persistent). I checked for example one of the arm64 on the Huawei cloud
-- since it's also not inside AWS, but on a different network path --
and it didn't need to reconnect even once in the last full days, having processed > 500 jobs in that time.
So there is for sure something wrong with the client-server connection
there. Reworking the client for robustness is on my TODO list for a
while.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEst7mYDbECCn80PEM/A2xu81GC94FAmPvpXcACgkQ/A2xu81G C97zWQ/+NopnKocwuJNU1p5nYUbk7xGpxjo2qR7rxKu97124OOmJh4o0LrjtqyPk xhxJ2gNdZHeYptW4E+sAA0MfgCJNBDugn7QnmKs+oIn0lkAD2XztPm7ybLvNxJnN iPOghTC6q/ZwTr+pEskxqZW6CN4yRsxTx6PjD2UdVNeXF8X07Why0FQw+xx3rhMb VCM4bXp2Fnt4D4fBdu5ifLu7tQ1jB4bfH2mW/QPhht6PlIbhWGqyWAsaSWRDXn3M KfR6h0JPcoIuk8UZnbKGL2IFmy+4QxEXUjXT+HEvlEV8pS0j6ioIJ3Ji0wh6nhi5 JIhy1nNzmQisEgjTgknOP/cJF2X5UA4E09H2GXw48ugIu1mGu0WGtw1CADOIF1Qg nfbYkGhj13UmqhJ37OscqOTrpIgvPl6Z3Tk2tNUEzfqH0vhyZF2tH+8jjtxxpFqP XRcSx1HC7GRie746xYBgjektumuWMMsLUr0dhH0WunChTot9ULtptnewobPWmwAk BUCdEwhmiCWVfZ0QJnW6GsyKSbJJ2QfzeHZam1FyfVI7Fb7T5XyCQX3tvoxcRzLx eWxDhy66DaumgexVB1tHNF40N81Ue/CVS7DCVnLVBB1mMxeOKFjXGZy2yqwqDwnS hxKAydCAU4qet/LzAjf0xPWOQ3hojnNU+8d9bjdfD9sJKI94KP0=
=myM7
-----END PGP SIGNATURE-----
--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)