Dropbox traffic infrastructure edge network – software engineering daily gas efficient cars under 10000

###########

In this post we will describe the Edge network part of Dropbox traffic infrastructure. This is an extended transcript of our NginxConf 2018presentation. e payment electricity bill mp Around the same time last year we described low-level aspects of our infra in the Optimizing web servers for high throughput and low latency post. This time we’ll cover higher-level things like our points of presence around the world, GSLB, RUM DNS, L4 loadbalancers, nginx setup and its dynamic configuration, and a bit of gRPC proxying. Dropbox scale

Dropbox has more than half a billion registered users who trust us with over an exabyte of data and petabytes of corresponding metadata. For the Traffic team this means millions of HTTP requests and terabits of traffic. To support all of that we’ve built an extensive network of points of presence (PoPs) around the world that we call Edge. Why do we need Edge?

The current PoP selection procedure is human guided but algorithm-assisted. Even with a small number of PoPs without assistive software it may be challenging to choose between, for example, a PoP in Brazil and a PoP in Australia. The problem persists as the number of PoPs grows: e.g. what location will benefit Dropbox users better, Vienna or Warsaw?

By “population” one can use pretty much any metric we want to optimize, for example total number of people in the area, or number of existing/potential users. As for the loss function to determine the score of each placement one can use something standard like L1 or L2 loss. In our case we try to overcompensate for the effects of latency on the TCP throughput.

Some of you may see that the problem here that can be solved by more sophisticated methods like Gradient Descent or Bayesian Optimization. 101 gas station This is indeed true, but because our problem space is so small (there are less than 100K 7th level s2 cells) we can just brute-force through it and get a definitively optimal result instead of the one that can get stuck on a local optimum. GSLB

This is true for a small and medium number of PoPs. But there is a conjecture that “critical” misrouting probability (e.g. probability of routing user to a different continent) in an anycasted network drops sharply with number of PoPs. Therefore it is possible that with increasing number of PoPs, anycast may eventually start outperforming GeoDNS. We’ll continue looking at how our anycast performance scales with the number of PoPs.

Our DNS setup evolved quite a bit over last few years: we started with a simple continent→PoP mappings, then switched to country→PoP with a per-state mapping data for serving network traffic to large countries like the US, Canada, etc. At the moment, we are juggling relatively complex LatLong-based routing with AS-based overrides to work around quirks in internet connectivity and peering. Hybrid unicast/anycast GSLB

Such an approach gives us the ability to quickly switch between unicast and anycast addresses and therefore immediate fallback without waiting for DNS TTL to expire. This also allows graceful PoP draining and all the other benefits of DNS traffic steering. All of that comes at a relatively small operational cost of a more complicated setup and may cause scalability problems once you reach the high thousands of VIPs. On the bright side, all PoP configs now become more uniform. Real User Metrics

All the GSLB methods discussed up until now have one critical problem: none of them uses actual user-perceived performance as a signal, but instead rely on some approximations: BGP uses number of hops as a signal, while GeoIP uses physical proximity. We want to fix that by using Real User Metrics (RUM) collection pipeline based on performance data from our desktop clients.

Years ago we invested in an availability measurement framework in our Desktop Clients to help us estimate the user-perceived reliability of our Edge network. The system is pretty simple: once in a while a sample of clients run availability measurements against all of our PoPs and report back the results. We extended this system to also log latency information, which gave us sufficient data to start building our own map of the internet.

Specifics of map generation are up in the air right now: we’ve tried (and continue trying out) different approaches: from simple HiveQL query that does per-/24 aggregation to ML-based solutions like Random Forests, stacks of XGBoosts, and DNNs. Sophisticated solutions are giving slightly better, but ultimately comparable results, at the cost of way longer training and reverse engineering complexity.

Once a RUM-based map is constructed, it is crucial to be able to estimate how good it is by using a single value, something like an F1 score used for binary classification or BLEU score used for evaluating machine translation. That way, one can not only automatically prevent bad maps from going live, but also numerically compare the quality of different map iterations and construction algorithms.

Another way of visualizing IP maps is to skip the whole GeoDNS mapping step and plot IP addresses on the 2D plane by mapping them on a space filling curve, e.g. electricity voltage in india Hilbert curve. electricity jewels One can also place additional data in the height and color dimensions. This approach will require some heavy regularization for it to be consumable by humans and even more ColorBrewer2 magic to be aesthetically pleasing.

RUM-based DNS is an actively evolving project, and we have not shipped it to our main VIPs yet, but the data we’ve collected from our GSLB experiments shows that it is the only way we can properly utilize more than 25-30 PoPs. This project will be one of our top priorities in 2019, because even metrics collected from an early map prototypes show that it can improve effectiveness of our Edge network by up to 30% using RUM DNS.

A quick note about another more explicit way of routing users to PoPs. All these dances with guessing users’ IP address based on their resolver, GeoIP effectiveness, optimality of decisions made by BGP, etc. are all no longer necessary after a request arrives at the PoP. Because at that point in time, we know the users’ IP and even have an RTT measurement to them. At that point, we can route users on a higher level, like for example embedding a link to a specific PoP in the html, or handing off a different domain to a desktop client trying to download files.

This loadbalancing method allows for a very granular traffic steering, even based on per-resource information, like user ID, file size, and physical location in our distributed storage. Another benefit is almost immediate draining of new connections, though references to resources that were once given out may live for extended periods of time.

PoPs consist of network equipment and sets of Linux servers. An average PoP has good connectivity: backbone, multiple transits, public and private peering. By increasing our network connectivity, we decrease the time packets spend in the public internet and therefore heavily decrease packet loss and improve TCP throughput. Currently about half of our traffic comes from peering.

Do packet processing early in network stack. This allows in-kernel data structures and TCP/IP parsing routines to be reused. For quite a while now, Linux has IPVS and netfilter modules that can be used for connection-level loadbalancing. Recent kernels have eBPF/XDP combo which allows for a safer and faster way to process packets in kernel space. Tight coupling with kernel though has some downsides: upgrade of such LB may require reboot, very strict requirements on kernel version, and difficult integration testing.

Create a virtual NIC PCIe device with SRIO-V, bypass the kernel through DPDK/netmap/etc, and get RX/TX queues in an application address space. This gives programmers full control over the network, but tcp/ip parsing, data structures, and even memory management must be done manually (or provided by a 3rd party library). gas news today Testing this kind of setup is also much easier.

We currently use our homebrew version of consistent hashing module, but starting from linux-4.18 there is a Maglev Hash implementation: [ip_vs_mh]( https://github.com/torvalds/linux/blob/master/net/netfilter/ipvs/ip_vs_mh.c). Compared to Ketama, Maglev Hash trades off some of the hash resiliency for more equal load distribution across backends and lookup speed.

We hash incoming packets based on 5-tuple (proto, sip, dip, sport, dport) which improves load distribution even further. This sadly means that any server-side caching becomes ineffective since different connections from the same client will likely end up on different backends. If our Edge did rely on local caching, we could use 3-tuple hashing mode where we would only hash on (protocol, sip, dip).

Another interesting fact is that L4LB will need to do some special handling of ICMP’s Packet Too Big replies, since they will originate from a different host and therefore can’t use plain outer header hashing, but instead must be hashed based on the tcp/ip headers in the ICMP packet payload. Cloudflare uses another approach for solving this problem with its [pmtud]( https://github.com/cloudflare/pmtud): broadcast incoming ICMP packets to all the boxes in the PoP. This can be useful if you do not have a separate routing layer and are ECMP’ing packets straight to your L7 proxies.

As for the future work, we have a number of things we want to try. First, replace the routing dataplane with either a DPDK or XDP/eBPF solution, or possibly just integrating an open-source project like Katran. Second, we currently use IP-in-IP for packet encapsulation and it’s about time we switch it to something more modern like GUE which is way more NIC-friendly in terms of steering and offload support. L7 proxies (nginx)

Having PoPs close to our users decreases the time needed for both TCP and TLS handshakes, which essentially leads to a faster TTFB. But owning this infrastructure instead of renting it from a CDN provider allows us to easily experiment and iterate on emerging technologies that optimize latency and throughput sensitive workloads even further. Let’s discuss some of them.

Very quick note about how we build and deploy nginx: like everything else in Dropbox, we use Bazel to reproducibly and hermetically build a static nginx binary, copy over configs, package all of this into a squshfs, use torrent to distribute resulting package to all the servers, mount it (read-only), switch symlink, and finally run nginx upgrade.

Our nginx configuration is static and bundled with the binary therefore we need a way to dynamically configure some aspects of the configuration without full redeploy. Here is where Upstream Management Service kicks in. p gaskell UMS is basically a look-aside external loadbalancer for nginx which allows us to reconfigure upstreams on the fly. One can create such system by:

All of this pretty much covers the external part of Traffic Infrastructure, but there is another half that is not directly visible to our users: gRPC-based service mesh, scalable and robust service discovery, and a distributed filesystem for config distribution with notification support. All of that is coming soon in the next series of blog posts. We’re hiring!