17 Commits

Author SHA1 Message Date
Alex
41a00c68ac fix down state handling 2025-02-11 19:27:41 +07:00
Alex
e3b99bf339 mark upstream as down after 10s of no successful queries 2025-02-11 19:27:36 +07:00
Alex
60e65a37a6 do the reset after recovery finished 2025-02-10 18:56:09 +07:00
Alex
d37d0e942c fix countHealthy locking 2025-02-10 18:55:48 +07:00
Alex
98042d8dbd remove leaking logic in favor of recovery logic. 2025-02-10 18:55:36 +07:00
Alex
fb49cb71e3 debounce upstream failure checking and failure counts 2025-02-10 18:41:48 +07:00
Alex
cf6d16b439 set new dialer on every request
debugging

debugging

debugging

debugging

use default route interface IP for OS resolver queries

remove retries

fix resolv.conf clobbering on MacOS, set custom local addr for os resolver queries

remove the client info discovery logic on network change, this was overkill just for the IP, and was causing service failure after switching networks many times rapidly

handle ipv6 local addresses

guard ciTable from nil pointer

debugging failure count
2025-02-06 15:40:41 +07:00
Alex
2687a4a018 remove leaking timeout, fix blocking upstreams checks, leaking is per listener, OS resolvers are tested in parallel, reset is only done is os is down
fix test

use upstreamIS var

init map, fix watcher flag

attempt to detect network changes

attempt to detect network changes

cancel and rerun reinitializeOSResolver

cancel and rerun reinitializeOSResolver

cancel and rerun reinitializeOSResolver

ignore invalid inferaces

ignore invalid inferaces

allow OS resolver upstream to fail

dont wait for dnsWait group on reinit, check for active interfaces to trigger reinit

fix unused var

simpler active iface check, debug logs

dont spam network service name patching on Mac

dont wait for os resolver nameserver testing

remove test for osresovlers for now

async nameserver testing

remove unused test
2025-01-20 15:03:27 +07:00
Cuong Manh Le
89600f6091 cmd/cli: new flow for leaking queries to OS resolver
The current flow involves marking OS resolver as down, which is not
right at all, since ctrld depends on it for leaking queries.

This commits implements new flow, which ctrld will restore DNS settings
once leaking marked, allowing queries go to OS resolver until the
internet connection is established.
2025-01-20 14:57:23 +07:00
Cuong Manh Le
f986a575e8 cmd/cli: log upstream name if endpoint is empty 2025-01-20 14:57:09 +07:00
Cuong Manh Le
a95d50c0af cmd/cli: ensure set/reset DNS is done before checking OS resolver
Otherwise, new DNS settings could be reverted by dns watchers, causing
the checking will be always false.
2025-01-14 14:33:15 +07:00
Cuong Manh Le
6046789fa4 cmd/cli: re-initializing OS resolver before doing check upstream
Otherwise, the check will be done for old stale nameservers, causing it
never succeed.
2025-01-14 14:32:15 +07:00
Cuong Manh Le
3ea69b180c cmd/cli: use config timeout when checking upstream
Otherwise, for slow network connection (like plane wifi), the check may
fail even though the internet is available.
2025-01-14 14:32:01 +07:00
Cuong Manh Le
3e388c2857 all: leaking queries to OS resolver instead of SRVFAIL
So it would work in more general case than just captive portal network,
which ctrld have supported recently.

Uses who may want no leaking behavior can use a config to turn off this
feature.
2024-09-30 18:20:27 +07:00
Cuong Manh Le
5c24acd952 cmd/cli: fix bug causes checkUpstream run only once
To prevent duplicated running of checkUpstream function at the same
time, upstream monitor uses a boolean to report whether the upstream is
checking. If this boolean is true, then other calls after the first one
will be returned immediately.

However, checkUpstream does not set this boolean to false when it
finishes, thus all future calls to checkUpstream won't be run, causing
the upstream is marked as down forever.

Fixing this by ensuring the boolean is reset once checkUpstream done.
While at it, also guarding all upstream monitor operations with a mutex,
ensuring there's no race condition between marking upstream state.
2023-12-18 21:30:36 +07:00
Cuong Manh Le
d88cf52b4e cmd/cli: always rebootstrap when check upstream
Otherwise, network changes may not be seen on some platforms, causing
ctrld failed to recover and failing all requests.

While at it, also doing the check DNS in separate goroutine, prevent it
from blocking ctrld from notifying others that it "started". The issue
was seen when ctrld is configured as direct listener, requests are
flooded before ctrld started, causing the healtch process failed.
2023-11-06 20:01:25 +07:00
Cuong Manh Le
511c4e696f cmd/cli: add upstream monitor
Some users mentioned that when there is an Internet outage, ctrld fails
to recover, crashing or locks up the router. When requests start
failing, this results in the clients emitting more queries, creating a
resource spiral of death that can brick the device entirely.

To guard against this case, this commit implement an upstream monitor
approach:

 - Marking upstream as down after 100 consecutive failed queries.
 - Start a goroutine to check when the upstream is back again.
 - When upstream is down, answer all queries with SERVFAIL.
 - The checking process uses backoff retry to reduce high requests rate.
 - As long as the query succeeded, marking the upstream as alive then
   start operate normally.
2023-09-22 18:45:59 +07:00