For safety reason, ctrld will create a backup of the current binary when
running upgrade command.
However, on systems where ctrld status is got by parsing ps command
output, the current binary path is important and must be the same with
the original binary. Depends on kernel version, using os.Executable may
return new backup binary path, aka "ctrld_previous", not the original
"ctrld" binary. This causes upgrade command see ctrld as not running
after restart -> upgrade failed.
Fixing this by recording the binary path before creating new service, so
the ctrld service status can be checked correctly.
It seems to be a Windows bug when removing a forwarder and adding a new
one immediately then causing both of them to be added to forwarders
list. This could be verified easily using powershell commands.
Since the forwarder will be removed when ctrld stop/uninstall, ctrld run
could avoid that action, not only help mitigate above bug, but also not
waste host resources.
On low resources Windows Server VM, profiling shows the bottle neck when
interacting with Windows DNS server to add/remove forwarders using by
calling external powershell commands. This happens because ctrld try
setting DNS before it runs.
However, it would be better if ctrld only sets DNS after all its
listeners ready. So it won't block ctrld from receiving requests.
With this change, self-check process on dual Core Windows server VM now
runs constantly fast, ~2-4 seconds when running multiple times in a row.
The "ctrld start" command is running slow, and using much CPU than
necessary. The problem was made because of several things:
1. ctrld process is waiting for 5 seconds before marking listeners up.
That ends up adding those seconds to the self-check process, even
though the listeners may have been already available.
2. While creating socket control client, "s.Status()" is called to
obtain ctrld service status, so we could terminate early if the
service failed to run. However, that would make a lot of syscall in a
hot loop, eating the CPU constantly while the command is running. On
Windows, that call would become slower after each calls. The same
effect could be seen using Windows services manager GUI, by pressing
start/stop/restart button fast enough, we could see a timeout raised.
3. The socket control server is started lately, after all the listeners
up. That would make the loop for creating socket control client run
longer and use much resources than necessary.
Fixes for these problems are quite obvious:
1. Removing hard code 5 seconds waiting. NotifyStartedFunc is enough to
ensure that listeners are ready for accepting requests.
2. Check "s.Status()" only once before the loop. There has been already
30 seconds timeout, so if anything went wrong, the self-check process
could be terminated, and won't hang forever.
3. Starting socket control server earlier, so newSocketControlClient can
connect to server with fewest attempts, then querying "/started"
endpoint to ensure the listeners have been ready.
With these fixes, "ctrld start" now run much faster on modern machines,
taking ~1-2 seconds (previously ~5-8 seconds) to finish. On dual cores
VM, it takes ~5-8 seconds (previously a few dozen seconds or timeout).
---
While at it, there are two refactoring for making the code easier to
read/maintain:
- PersistentPreRun is now used in root command to init console logging,
so we don't have to initialize them in sub-commands.
- NotifyStartedFunc now use channel for synchronization, instead of a
mutex, making the ugly asymetric calls to lock goes away, making the
code more idiom, and theoretically have better performance.
The systemd-networkd-wait-online is only required if systemd-networkd
is managing any interfaces. Otherwise, it will hang and block ctrld from
starting.
See: https://github.com/systemd/systemd/issues/23304
This commit implements upgrade command which will:
- Download latest version for current running arch.
- Replacing the binary on disk.
- Self-restart ctrld service.
If the service does not start with new binary, old binary will be
restored and self-restart again.
On BSD, the service is made un-killable since v1.3.4 by using daemon
command "-r" option. However, when reading remote config, the ctrld will
fatally exit if the config is malformed. This causes daemon respawn new
ctrld process immediately, causing the "ctrld start" command hang
forever because of restart loop.
Since "ctrld start" already fetch the resolver config for validating
uid, it should validate the remote config, too. This allows better error
message printed to users, let them know that the config is invalid.
Further, if the remote config was invalid, we should disregard it and
generating the default working one in cd mode.
- short control socket name.(in IOS max length is 11)
- wait for control server to reply before checking for deactivation pin.
- Added separate name for control socket for mobile.
- Added stop channel reference to Control client constructor.
On BSD platform, using "daemon -r" may fool the status check that ctrld
is still running while it was terminated unexpectedly. This may cause
the check in newSocketControlClient hangs forever.
Using a sane timeout value of 30 seconds, which should be enough for the
ctrld service started in normal condition.
If the server is running old version of ctrld, the deactivation pin
check will return 404 not found, the client should consider this as no
error instead of returning invalid pin code.
This allows v1.3.5 binary `ctrld start` command while the ctrld server
is still running old version. I discover this while testing v1.3.5
binary on a router with old ctrld version running.
The loop is run after the main interface DNS was set, thus the error
would make noise to users. This commit removes the noise, by making
currentStaticDNS returns an additional error, so it's up to the caller
to decive whether to emit the error or not.
Further, the physical interface loop will now only log when the callback
function runs successfully. Emitting the callback error can be done in
the future, until we can figure out how to detect physical interfaces in
Go portably.
The save/restore DNS functionality always perform its job, even though
the DNS is not static, aka set by DHCP. That may lead to confusion to
users. Since DHCP settings was changed to static settings, even though
the namesers set are the same.
To fix this, ctrld should save/restore only there's actual static DNS
set. For DHCP, thing should work as-is like we are doing.