High Performance Browser Networking by Ilya Grigorik

February 13, 2017

10 min read

Talk by Ilya Grigorik about improving the performance of the web - performance.now().reject(reasons)

Chapter 1 - Four components of latency
Chapter 2 - TCP (RFC 793) and IP (RFC 791)
Chapter 3 - UDP (RFC 768)
- UDP and NAT
Chapter 4 - TLS
Chapter 8 - Optimizing for mobile networks
Chapter 9 - Primer

Chapter 1 - Four components of latency

Transmission delay. Time taken to write the entire packet into the link. Function of data rate of this link
Propagation delay. Time taken for the packet to travel from sender to receiver
Processing delay. Time taken to verify the packet, check for bit level errors, process the header and determine the packet's destination
Queuing delay. Time packet spends in the receiver's buffer waiting to be processed.

If packets are arriving faster than a router can process them, it should drop them so TCP's congestion control kicks in. Routers have very large buffers to avoid drops.

Speed of light in fibre optic cable is approximately 2*10^8 m/s or 66% of the speed of light in vacuum. Reducing the refractive index of fibre optic cable is an ongoing effort.

The latency from New York to London is 21ms, but the last mile latency between the ISP and home could be anywhere between 10ms (good) and 65ms (bad).

The available bandwidth to the user is a function of the lowest capacity link between the client and the destination server. Theoretically we can keep improving this - either by more fibre links between nodes or by improving the method we use to multiplex data.

The latency between two nodes cannot be improved indefinitely. Its possible to improve the fibre such that the speed of light within it increases. However, there is a hard limit - c. Alternately, its possible to lay cable that takes a shorter route.

Chapter 2 - TCP (RFC 793) and IP (RFC 791)

Features
1. Retransmission of lost data (waiting for ACKs)
2. In-order delivery (through unique sequence IDs)
3. Flow control, congestion control, congestion avoidance.
4. Data integrity (through checksums)

Optimized for accurate delivery, not timely delivery.

3 way Handshake
1. Client sends a SYN containing x=rand()
2. Server sends a SYN ACK containing x+1, y=rand()
3. Client sends ACK containing x+1, y+1 along with application data.
TCP Fast Open (TFO) - since the 3 way handshake takes at least 1 roundtrip, this can be optimized by sending the application data with the first SYN packet. Limitations - only certain kinds of requests can be sent (I think idempotent), only works for sessions that are resumed because a cookie is required, size of the first payload is limited and lastly, both the client OS, server OS should support it before the client application can opt in.
Congestion collapse
- happens when networks of differing bandwidth are connected
- when the total round trip time exceeds the retransmission time, the source introduces more copies of the same datagram into the network
- this fills up buffers and packets need to be dropped, increasing retransmission
- eventually all buffers are full and the network operates at a degraded level.

To avoid this - flow control, congestion control, congestion avoidance.

Flow control - With every ACK, the receiver advertises rwind, the size of the receive window, ie the buffer which will hold the incoming data. If the receiver is unable to keep up, it advertises a smaller rwind. Initially this was a 16 bit value because who would ever have a buffer larger than 65,535 bytes? RFC 1323 supports "scaling" this number so higher buffer sizes can be specified by this 16 bit value.
Congestion control. Slow start. The sender starts out by sending cwind (congestion window) sized data, which is 10 network segments as of 2013. The size of cwind is exponentially increased; with every successful ACK received, two more packets can be sent. The maximum amount of data in flight is the min(cwind, rwind). It often takes a few round trips before the maximum throughput is achieved. For small downloads, that's a problem. If a TCP connection has been idle, the cwind is reset to its original value. This behaviour can be disabled on the server.
Congestion avoidance. The number of packets sent keeps doubling until a packet is dropped by the network. At that point cwind is reduced and increased based on some avoidance algorithm. (This shows up as a sawtooth). Assumption - the packet loss occurred because a router along the way was overwhelmed. Reducing cwind decreases the chance of packet loss
The Bandwidth-Delay product - the maximum amount of un-ACKed data on the wire. Since this is a function of the latency, this is a limitation
- min(cwind, rwind) = 16kB. Delay - 100ms. Max available bandwidth = 16kB/100ms = 1.31Mbps
- Client's bandwidth = 10Mbps. Delay - 100ms. Ideal window size = 10 x 1000 x 1000 / 8 * 0.1 = 125000 bytes = 122.13 kB
To summarize congestion on TCP networks. If a link between client and server isn't being saturated it could be because
- packet loss triggering congestion avoidance
- small default windows in the config of client or server
- explicit traffic shaping
Tuning TCP
1. Disable slow-start while resuming idle connections
2. Keep a high initial cwind
3. TCP Fast Open (TFO)
4. Window scaling for larger buffers
Other optimizations
1. Less data on the wire - either send less data, or compress
2. Use CDNs so latency is lower
3. Upgrade to latest kernels for the best algos
4. Reuse TCP connections wherever possible
Why not TCP for some applications - to avoid head-of-line blocking - if a single packet is dropped, all subsequent packets can't be processed until that is re-transmitted and received. Some applications (audio, video, games) would prefer to receive packets as they are received and deal with ordering themselves (some applications might not even need ordering)

Chapter 3 - UDP (RFC 768)

No retransmission
No ordering
No congestion control
No connection establishment or teardown
No nothing! I guess you will never be anything more than a mere 🐵

DNS and WebRTC use UDP

UDP and NAT

UDP is a stateless protocol, but NAT requires state to decide how to route traffic. When there is outbound UDP traffic, it adds a tranlation record with some TTL on how to route the response. Since this record might expire after some time, its the de facto best practice to send keep-alive packets regularly. This is true even for TCP because some poorly implemented NAT boxes drop TCP translation records too.
Worse, some p2p applications like games, VoIP and file sharing apps need to act as servers too. These applications don't know their public IP. Even if these applications did find out their public IP and communicated that to the peer, an in-bound packet to that IP would still be dropped because the NAT box wouldn't know how to route it. Workarounds
1. STUN (RFC 5389) - Session Traversal Utilities for NAT. The wannabe server asks a STUN server on the public internet what its IP is. When the application finds out the IP port tuple it communicates that with its peers. Works 92% of the time.
2. TURN (RFC 5766) - Traversal Using Relays around NAT. If STUN doesn't work, all traffic in both directions is routed through a relay. The traffic is no longer p2p. Works remaining 8% of the time.
3. ICE (RFC 5245) - Interactive Connectivity Establishment. Establishes the most efficient tunnel between two participants whether thats through STUN or TURN.
Recommendations for UDP (RFC 5405)
1. The application should handle most of what is handled by TCP. It should therefore implement congestion control, congestion avoidance, flow control, error checking, and handle loss, duplication and re-ordering
2. It should use similar bandwidth to TCP
3. Size of the datagram shouldn't exceed the Maximum Transmission Unit (MTU)
4. Should enable checksums

Chapter 4 - TLS

TLS 1.2 (RFC 5246)
TLS 1.3

Refer existing note on tls (TODO link)

Skipping Chapters 5, 6, 7

Chapter 8 - Optimizing for mobile networks

3 main considerations - presentation considering the form factor, battery life, performance of the radio
For presentation read some other book.
Radio and battery
- Radio use at full power can drain the battery in hours
- Each successive generation of radio requires more battery
- Radio power consumption has a non-linear relationship with data transferred. Even a little data requires a full power radio, consuming power.
Simple rules
1. Avoid polling
2. Push notifications should be used. However, high frequency pushes consume similar amount of battery as polling, so pushes should be aggregated on the server
3. Inbound and outbound requests should be coalesced
4. Not-critical requests should be deferred until the radio is active
Energy conversions
- 3Ah * 5V = 15 Wh (phones nowadays are 3000 mah and voltage supplied is 5V)
- converting to joules, 15 Wh x 3600 J/Wh = 54000 J
- radio state transition to high costs 10J. 1 minute polling interval for an hour = 600J. That's 1% of total battery capacity per application that's polling per hour
Energy consumption
- With WiFi, each device sets its own transmit power, which is usually in the 30–200 mW range. By comparison, the transmit power of the 3G/4G radio is managed by the network and can consume as low as 15 mW when in an idle state. However, to account for larger range and interference, the same radio can require 1,000–3,500 mW when transmitting in a high-power state!
- In practice, when transferring large amounts of data, WiFi is often far more efficient if the signal strength is good. But if the device is mostly idle, then the 3G/4G radio is more effective.

Latencies on 3G (200ms/RTT) and 4G (100ms/RTT)

| Leg | 3G | 4G | | ---------------------- | ------------ | ---------- | | Control plane | 200–2,500 ms | 50–100 ms | | DNS lookup | 200 ms | 100 ms | | TCP handshake | 200 ms | 100 ms | | TLS handshake | 200–400 ms | 100–200 ms | | HTTP request | 200 ms | 100 ms | | Total latency overhead | 200–3500 ms | 100–600 ms |

The initial delay for 3.5G+ networks is 150-500ms

Recommendations
- While streaming, if the entire file will definitely be used, download in one shot and shut off the radio. If it might not be, then stream it, preferably with adaptive bitrate streaming
- Do not cache or attempt to guess the state of the network.
- Dispatch the request, listen for failures, and diagnose what happened.
- Transient errors will happen; plan for them, and use a retry strategy.
- Listen to connection state to anticipate the best request strategy.
- Use a backoff algorithm for request retries; do not spin forever.
- If offline, log and dispatch the request later if possible.
- Leverage HTML5 AppCache and localStorage for offline mode.

Chapter 9 - Primer

HTTP 0.9
- only one document would be requested
- 1 TCP connection per request.
HTTP 1.0
- multiple documents would be requested on a single page. Features introduced as a result:
- introduced the concept of headers
HTTP 1.1
- byte range requests
- keepalive! (reusing existing TCP connection, reducing latency)
- caching
- request pipelining (failed due to lack of support)
HTTP 2.0
- header compression
- request multiplexing

HTTP: The Definitive Guide has further details about the protocol

Optimizing tips from High Performance Web Sites

Make fewer requests
Make fewer DNS requests
Use gzip
Configure etags and expires header
Use CDNs
Avoid HTTP redirects

Request pipelining doesn't work. Workarounds

Spriting - increased memory consumption because the entire image isn't usually needed. Even a small change means a cache invalidation
Concatenation of all resources - slows down time to first byte. Becomes impossible to cache these resources.