When reading base64 config, either via command line or via custom config
from Control D API, we do want new config entirely instead of mixing
with old config. So new viper instance should be re-recreated before
reading in new config.
That also helps simplifying self-check process, because the config is
now always set correctly, instead of watching change made by "ctrld run"
command.
However, log file and listener config need a special handling, because
they could be changed/unset from Control D API:
- Log file can change dynamically each time ctrld runs, so init logging
process need to take care of re-initializing if log setup changed.
- For listener setup, users could leave ip and port empty, and ctrld
will pick a random loopback 127.0.0.x:53. However, on Linux systems
which use systemd-resolved, the stub listener won't forward queries
from its address 127.0.0.53 to 127.0.0.x, so ctrld will use the
default router interface address instead.
When running on routers, ctrld leverages default setup, let dnsmasq runs
on port 53, and forward queries to ctrld listener on port 5354. However,
this setup is not serialized to config file, causing confusion to users.
Fixing this by writing the correct routers setup to config file. While
at it, updating documentation to refelct that, and also adding note that
changing default router setup could break things.
UniFi Gateway (USG) uses its own DNS forwarding rule, which is
configured default in /etc/dnsmasq.conf file. Adding ctrld own config in
/etc/dnsmasq.d won't take effects. Instead, we must make changes
directly to /etc/dnsmasq.conf, configuring ctrld as the only upstream.
The current state of ctrld is very "high stakes" and easy to mess up,
and is unforgiving when "ctrld start" failed. That would cause the
router is in broken state, unrecoverable.
This commit makes these changes to improve the state:
- Moving router setup process after ctrld listeners are ready, so
dnsmasq won't flood requests to ctrld even though the listeners are
not ready to serve requests.
- On router, when ctrld stopped, restore router DNS setup. That leaves
the router in good state on reboot/startup, help removing the custom
DNS server for NTP synchronization on some routers.
- If self-check failed, uninstall ctrld to restore router to good
state, prevent confusion that ctrld process is still running even
though self-check reports it did not started.
The current transport setup is using mutex lock for synchronization.
This could work ok in normal device, but on low capacity routers, this
high contention may affect the performance, causing ctrld hangs.
Instead of using mutex lock, using atomic operation for synchronization
yield a better performance:
- There's no lock, so other requests won't be blocked. And even theses
requests use old broken transport, it would be fine, because the
client will retry them later.
- The setup transport is now done once, on demand when the transport is
accessed, or when signal rebootsrapping. The first call to
dohTransport will block others, but the transport is warmup before
ctrld start serving requests, so client requests won't be affected.
That helps ctrld handling the requests better when running on low
capacity device.
Further more, the transport configuration is also tweaked for better
default performance:
- MaxIdleConnsPerHost is set to 100 (default is 2), which allows more
connections to be reused, reduce the load to open/close connections
on demand. See [1] for a real example.
- Due to the raising of MaxIdleConnsPerHost, once the transport is
GC-ed, it must explicitly close its idle connections.
- TLS client session cache is now enabled.
Last but not least, the upstream ping process is also reworked. DoH
transport is an HTTP transport, so doing a HEAD request is enough to
warmup the transport, instead of doing a full DNS query.
[1]: https://gitlab.com/gitlab-org/gitlab-pages/-/merge_requests/274
Currently, there's no upper bound for how many requests that ctrld will
handle at a time. This could be problem on some low capacity routers,
where CPU/RAM is very limited.
This commit adds a configuration to limit how many requests that will be
handled concurrently. The default is 256, which should works well for
most routers (the default concurrent requests of dnsmasq is 150).
The openwrt version in old GL-inet devices do not support checking
status using /etc/init.d/<service_name>, so the sysV wrapping trick
won't work. Instead, we need to parse "ps" command output to check
whether ctrld process is running or not.
While at it, making newService as a wrapper of service.New function,
prevent the caller from calling the latter without following call to
the former, causing mismatch in service operations.
To prevent abusive response from some malicious DNS server, ctrld
ignores the response if the target does not match question domain.
However, that would break CNAME chain, which is allowed the mismatch
happens.
On some platforms, like pfsense, ntpd is not problem, so do not spawn
the DNS server for it, which may conflict with default DNS server.
While at it, also make sure that ctrld will be run at last on startup.
On DD-WRT v3.0-r52189, dnsmasq version 2.89 lease format looks like:
1685794060 <mac> <ip> <hostname> 00:00:00:00:00:04 9
It has 6 fields, while the current parser only looks for line with exact
5 fields, which is too restricted. In fact, the parser shold just skip
line with less than 4 fields, because the 4th field is the hostname,
which is the last client info that ctrld needs.
Currently, on routers that require NTP waiting, ctrld makes the cleanup
process, and restart dnsmasq for restoring default DNS config, so ntpd
can query the NTP servers. It did work, but the code will depends on
router platforms.
Instead, we can spawn a plain DNS listener before PreRun on routers,
this listener will serve NTP dns queries and once ntp is configured, the
listener is terminated and ctrld will start serving using its configured
upstreams.
While at it, also fix the userHomeDir function on freshtomato, which
must return the binary directory for routers that requires JFFS.
In commit 670879d1, the backoff is changed to be passed a real error,
instead of a place holder. However, the test query may return a failed
response with a nil error, causing the backoff never fire.
Fixing this by ensuring the error is wrapped, so the backoff always see
a non-nil error.
On WSL 1, the routing table do not contain default route, causing ctrld
failed to get the default iface for setting DNS. However, WSL 1 only use
/etc/resolv.conf for setting up DNS, so the interface does not matter,
because the setting is applied global anyway.
To fix it, just return "lo" as the default interface name on WSL 1.
While at it, also removing the useless service.Logger call, which is not
unified with the current logger, and may cause false positive on system
where syslog is not configured properly (like WSL 1).
Also passing the real error when doing sel-check to backoff, so we don't
have to use a place holder error.
In split mode, the code must check for ipv6 availability to return the
correct network stack. Otherwise, we may end up using "tcp6-tls" even
though the upstream IP is an ipv4.