Once per minute, ctrld will check if DNS settings was changed or not. If
yes, re-applying the proper settings for system interfaces.
For now, this is only applied when deactivation_pin was set.
By attempting to reset DNS before starting new ctrld process. This way,
ctrld will read the correct system DNS settings before changing itself.
While at it, some optimizations are made:
- "ctrld start" won't set DNS anymore, since "ctrld run" has already did
this, start command could just query socket control server and emittin
proper message to users.
- The gateway won't be included as nameservers on Windows anymore,
since the GetAdaptersAddresses Windows API always returns the correct
DNS servers of the interfaces.
- The nameservers list that OS resolver is using will be shown during
ctrld startup, making it easier for debugging.
On low resources Windows Server VM, profiling shows the bottle neck when
interacting with Windows DNS server to add/remove forwarders using by
calling external powershell commands. This happens because ctrld try
setting DNS before it runs.
However, it would be better if ctrld only sets DNS after all its
listeners ready. So it won't block ctrld from receiving requests.
With this change, self-check process on dual Core Windows server VM now
runs constantly fast, ~2-4 seconds when running multiple times in a row.
The "ctrld start" command is running slow, and using much CPU than
necessary. The problem was made because of several things:
1. ctrld process is waiting for 5 seconds before marking listeners up.
That ends up adding those seconds to the self-check process, even
though the listeners may have been already available.
2. While creating socket control client, "s.Status()" is called to
obtain ctrld service status, so we could terminate early if the
service failed to run. However, that would make a lot of syscall in a
hot loop, eating the CPU constantly while the command is running. On
Windows, that call would become slower after each calls. The same
effect could be seen using Windows services manager GUI, by pressing
start/stop/restart button fast enough, we could see a timeout raised.
3. The socket control server is started lately, after all the listeners
up. That would make the loop for creating socket control client run
longer and use much resources than necessary.
Fixes for these problems are quite obvious:
1. Removing hard code 5 seconds waiting. NotifyStartedFunc is enough to
ensure that listeners are ready for accepting requests.
2. Check "s.Status()" only once before the loop. There has been already
30 seconds timeout, so if anything went wrong, the self-check process
could be terminated, and won't hang forever.
3. Starting socket control server earlier, so newSocketControlClient can
connect to server with fewest attempts, then querying "/started"
endpoint to ensure the listeners have been ready.
With these fixes, "ctrld start" now run much faster on modern machines,
taking ~1-2 seconds (previously ~5-8 seconds) to finish. On dual cores
VM, it takes ~5-8 seconds (previously a few dozen seconds or timeout).
---
While at it, there are two refactoring for making the code easier to
read/maintain:
- PersistentPreRun is now used in root command to init console logging,
so we don't have to initialize them in sub-commands.
- NotifyStartedFunc now use channel for synchronization, instead of a
mutex, making the ugly asymetric calls to lock goes away, making the
code more idiom, and theoretically have better performance.
- short control socket name.(in IOS max length is 11)
- wait for control server to reply before checking for deactivation pin.
- Added separate name for control socket for mobile.
- Added stop channel reference to Control client constructor.
The loop is run after the main interface DNS was set, thus the error
would make noise to users. This commit removes the noise, by making
currentStaticDNS returns an additional error, so it's up to the caller
to decive whether to emit the error or not.
Further, the physical interface loop will now only log when the callback
function runs successfully. Emitting the callback error can be done in
the future, until we can figure out how to detect physical interfaces in
Go portably.
The save/restore DNS functionality always perform its job, even though
the DNS is not static, aka set by DHCP. That may lead to confusion to
users. Since DHCP settings was changed to static settings, even though
the namesers set are the same.
To fix this, ctrld should save/restore only there's actual static DNS
set. For DHCP, thing should work as-is like we are doing.
After installing as a system service, "ctrld start" does an end-to-end
test for ensuring DNS can be resolved correctly. However, in case the
system is mis-configured (by firewall, other softwares ...) and the test
query could not be sent to ctrld listener, the current error message is
not helpful, causing the confusion from users perspective.
To improve this, selfCheckStatus function now returns the actual status
and error during its process. The caller can now rely on the service
status and the error to produce more useful/friendly message to users.
Windows may raise WSAEHOSTUNREACH instead WSAENETUNREACH in case of
network not available when resuming from sleep or switching network, so
checkUpstream is never kicked in for this type of error.
Generating nextdns config must happen after stopping current ctrld
process. Otherwise, config processing may pick wrong IP+Port.
While at it, also making logging better when updating listener config:
- Change warn to info, prevent confusing that "something is wrong".
- Do not emit info when generating working default config, which may
cause duplicated messages printed.
Otherwise, network changes may not be seen on some platforms, causing
ctrld failed to recover and failing all requests.
While at it, also doing the check DNS in separate goroutine, prevent it
from blocking ctrld from notifying others that it "started". The issue
was seen when ctrld is configured as direct listener, requests are
flooded before ctrld started, causing the healtch process failed.
Some users mentioned that when there is an Internet outage, ctrld fails
to recover, crashing or locks up the router. When requests start
failing, this results in the clients emitting more queries, creating a
resource spiral of death that can brick the device entirely.
To guard against this case, this commit implement an upstream monitor
approach:
- Marking upstream as down after 100 consecutive failed queries.
- Start a goroutine to check when the upstream is back again.
- When upstream is down, answer all queries with SERVFAIL.
- The checking process uses backoff retry to reduce high requests rate.
- As long as the query succeeded, marking the upstream as alive then
start operate normally.
The current approach to get default route IP is finding the LAN
interface with the same MAC address. However, there could be multiple
interfaces like that, making ctrld confused.
This commit fixes this issue, by listing all possible private IPs, then
sorting them and use the smallest one for router self queries.
For reporting router queries, ctrld uses private IP of the default route
interface. However, when the default route is conntected directly to
ISP, the interface will have a public IP, and another interface with the
same MAC address will be created for LAN ip. So when no private IP found
for default route interface, ctrld must look at the other interface to
find the corret LAN ip.