Validator 'sync-state' turned into 'Stalled', allthough internet was online and services running

Hello everyone,

I am running a Pulsechain-validator on Mainnet.
Having used the scripts and procedure outlined by RHMax on Github, all was good the first days.
After a good day I was synced and the first yield was being made.

Some time ago, however, I checked on the status of the services for Geth, LH-Beacon and LH-Validator; they were all running.
But doing a check on the json-output for LH-Health, I noticed that the value for sync_state was “Stalled”, where it should be in a normal, healthy state:

“sync_state”: “Synced”

Doing a check on my validator-nodes I saw that I had been offline for some days already, ‘making’ negative income.
Damn…solely checking on status ‘running’ for the services had proved not to be effective.

So, what does this status ‘Stalled’ mean and what might have been the cause?

So far I only some suspicious logoutput in this location:

/opt/lighthouse/data/beacon/beacon/logs/beacon.log.

This is what I found:

“Subnet peer discovery did not find sufficient peers. Reached max retry limit.”
service: libp2p
module: lighthouse_network: : discovery:591

I checked my router and around that time the internetconnection was functioning normally.

Of course I also tried to remember what things I may have invoked on the validator after it was activated on the chain.
This is only one thing: I added some firewall rules to allow an extra homecomputer to access the server.
After that I did a: sudo ufw reload, which should apply the extra rules without having to restart anything.
Can this have caused the ‘stalling’.

A server-reboot made my validators come online again (slowly)

========================================================

I took some serious nagative income because of this matter, so I am thinking on running an itelligent script (bash/python)
that does this (in words):

Test the value in the json-Dictionary: if it’s “Synced” or “Syncing” do nothing (all is good)
If it’s not “Synced” or “Syncing”:
Stop all 3 servcies in an elegant way via Systemd. (Maybe add some extra ‘pause’ time.)
Then reboot the machine.

Then we add such a script to Cron to run once a day. or other frequency.
Even better (an extra monitor-safety) would be: send an email if the status is not “Synced” or “Syncing”.

========================================================

Maybe there are validator-runners out there who have implemented better or simpler solutions.

Greetz, Laurens (5555 to all)

1 Like

Hi Laurens,

I never experienced the need to develop such a script as the 3 services were never typically stalling as they pick up their duties after an interrupton on their own by design.

In your case, you wrote the services were offline for some days which is clearly too long. My geth or lighthouse clients might lose their peers every once in a while, perhaps once or twice a month but they re-connect within a couple of minutes automatically.

Such little network drops might happen due to temporary bandwith issues or lagging on the ISP side and are completely normal, but not for hours or days.

It’s difficult to remotely diagnose but my guess is, that your validator configs are not optimized for your setup specifically related to your firewall or port settings, e.g. check UpNp is enabled on your router and check port forwarding:

Lighthouse peering 9000 TCP/UDP
Go-Pulse/geth 30303 TCP/UDP used for P2P protocol running the network
Go-Pulse/geth peering 30304 UDP used for P2P protocol new peer discovery overlay

Check also, that your local firewall is not blocking any traffic that is necessary for peering and communication.

Good luck
Marculix

1 Like