DNS & Anycast (Part 2): Monitoring DNS over Anycast
The StackPath platform sits on DNS as a foundational pillar and making sure it reflects sound engineering using Anycast is part of our approach to delivering performant edge services. In honor of this, we decided to write about what makes DNS and Anycast great as a pair. This is part two. You can find part one here.
There’s a special challenge in monitoring a DNS service which is hosted in multiple PoPs (Point of Presence) across the globe.
For performance, reliability, and scalability reasons, the DNS service is broadcasted on top of an Anycast network. This results in shorter query times, a properly distributed load, and graceful failover when run correctly. However, when run incorrectly, there’s a downside to Anycast.
I’ll talk about this below, but first let’s review how Anycast enables efficient Internet routing.
Internet and Anycast routing
The Internet works by aggregating multiple networks together, allowing them to communicate with one another. All those autonomous networks (ASNs) have no problem sending information within themselves; it’s the exchange points and interconnection points that require a routing technology like Anycast to allow autonomous networks to efficiently cross over to other networks.
For Anycast to work, each ASN must know to route to its closest PoP. This is determined primarily but not entirely through “AS path length.” The shortest path is often the one chosen, and the shortest path is often the one with the fewest networks. Which leaves us with a special case…
Given a PoP and an ASN within the same city, any traffic originating from that ASN is sent to the PoP within that same city. That geographic closeness usually means that there are fewer networks between the source ASN and the PoP, so the best path—the shortest path—should lie within the same city.
Testing the routing
We now have a test we can apply to see if our Anycast network is routing correctly. If we can find test nodes in every city where we have a PoP we can execute a DNS query from that test node. If they all get a response from their geographically-closest PoP we know that everything is working as desired.
We need a very special DNS record to accomplish this though. We also need a special case where every PoP’s DNS server offers a unique response when queried for the same record. Ideally the PoP’s name and city is located in the response. This is different from the usual case where we want the same response when queried for the same record, making it special.
There are lots of ways to do this. One is to ask each PoP to execute a special override and insert a custom response for an identical request.
It looks something like this:
dig @anycastdns.stackpath.com -t CNAME whichcityamiin.whichcity.stackpath.com ;; ANSWER SECTION: Whichcityamiin.whichcity.stackpath.com. 60 IN CNAME dfw1.dallas.whichcity.stackpath.com
We ask our DNS server to return a CNAME record as a response which is a fully qualified domain name (fqdn). Each PoP is told to return a different CNAME via a special override but only for that one special record: whichcityamiin.whichcity.stackpath.com.
Completing the test
There are no shortage of third party DNS testing providers with test nodes across multiple networks in different cities. In this example, we take a collection of these test nodes, test from every city where we have a PoP, and have the test nodes execute a test every day where they ask for our special domain name.
Using that third party’s API service we download the responses to a local server for processing. Now it’s just a matter of extracting and comparing the city name of the test node that executed the query and the city name from the returned CNAME record.
Immediately we know by the test names which ASNs are correctly routing to make sure that all DNS queries are responded to by their appropriate city’s PoP. So how did a test node on Bell Canada’s network in Toronto route to Chicago instead of remaining local in Toronto?
Investigating the AS path taken, Bell Canada (ASN 577) peers with StackPath (ASN 12989) in Chicago. This is the shortest AS path because there is a direct circuit between Bell Canada in Toronto and StackPath in Chicago. So Bell Canada was correctly choosing to send traffic originating in Toronto to be served from Chicago—with about 14ms latency—but herein lies one of the challenges with Anycast.
To solve this, a routing change is required. We’ll modify the routing tables to ask Bell Canada to prefer a longer AS path but shorter geographical route. This also requires us to add a network that peers with StackPath in Toronto and allows Bell Canada’s DNS traffic to remain in Toronto.
Once done, we can re-investigate the traceroute to look at the new AS path that now goes from Bell Canada (ASN 577) in Toronto, to Cogent (ASN 174) in Toronto, to StackPath (ASN 12989) in Toronto.
Running DNS on top of an Anycast-based network results in the best of both worlds with improved reliability and performance. It is more complex than simply operating a standalone DNS server, but understanding the nuances and challenges also means delivering a superior solution.
Being able to learn from the monitoring feedback loop ensures optimal decisions are taken. The modern Internet is a dynamic and constantly changing place where it imposes as many challenges to service providers as service providers impose challenges to the Internet. Without constant monitoring and alerting, that rate of change would quickly overwhelm the service itself.