The "ctrld start" command is running slow, and using much CPU than
necessary. The problem was made because of several things:
1. ctrld process is waiting for 5 seconds before marking listeners up.
That ends up adding those seconds to the self-check process, even
though the listeners may have been already available.
2. While creating socket control client, "s.Status()" is called to
obtain ctrld service status, so we could terminate early if the
service failed to run. However, that would make a lot of syscall in a
hot loop, eating the CPU constantly while the command is running. On
Windows, that call would become slower after each calls. The same
effect could be seen using Windows services manager GUI, by pressing
start/stop/restart button fast enough, we could see a timeout raised.
3. The socket control server is started lately, after all the listeners
up. That would make the loop for creating socket control client run
longer and use much resources than necessary.
Fixes for these problems are quite obvious:
1. Removing hard code 5 seconds waiting. NotifyStartedFunc is enough to
ensure that listeners are ready for accepting requests.
2. Check "s.Status()" only once before the loop. There has been already
30 seconds timeout, so if anything went wrong, the self-check process
could be terminated, and won't hang forever.
3. Starting socket control server earlier, so newSocketControlClient can
connect to server with fewest attempts, then querying "/started"
endpoint to ensure the listeners have been ready.
With these fixes, "ctrld start" now run much faster on modern machines,
taking ~1-2 seconds (previously ~5-8 seconds) to finish. On dual cores
VM, it takes ~5-8 seconds (previously a few dozen seconds or timeout).
---
While at it, there are two refactoring for making the code easier to
read/maintain:
- PersistentPreRun is now used in root command to init console logging,
so we don't have to initialize them in sub-commands.
- NotifyStartedFunc now use channel for synchronization, instead of a
mutex, making the ugly asymetric calls to lock goes away, making the
code more idiom, and theoretically have better performance.
For Android devices, when it joins the network, it uses ctrld to resolve
its private DNS once and never reaches ctrld again. For each time, it uses
a different IPv6 address, which causes hundreds/thousands different client
IDs created for the same device, which is pointless.
Some users mentioned that when there is an Internet outage, ctrld fails
to recover, crashing or locks up the router. When requests start
failing, this results in the clients emitting more queries, creating a
resource spiral of death that can brick the device entirely.
To guard against this case, this commit implement an upstream monitor
approach:
- Marking upstream as down after 100 consecutive failed queries.
- Start a goroutine to check when the upstream is back again.
- When upstream is down, answer all queries with SERVFAIL.
- The checking process uses backoff retry to reduce high requests rate.
- As long as the query succeeded, marking the upstream as alive then
start operate normally.
So ctrld can record the raw/original client IP instead of looking up
from MAC to IP, which may not the right choice in some network setup
like using wireguard/vpn on Merlin router.
Using the same approach as in cd mode, but do it only once when running
ctrld the first time, then the config will be re-used then.
While at it, also adding Dockerfile.debug for better troubleshooting
with alpine base image.