Online Tracking As Fast As Possible
How online tracking works and the different parties involved
October 27, 2018
Technology today is so invasive that buzzwords like privacy, cookies, tracking, etc. are appearing on every news outlet that cares about the web. But they usually just scratch the surface, leaving a lot of the underlying machinery in the dark and still hidden away. This causes the common internet denizen to fear and/or blame the wrong thing. In this article, I'll attempt to describe the different pieces that comprise online tracking, in a way that goes beyond the cookies and scripts.
The HTTP protocol
The web is built on top of the HTTP protocol. It's a protocol where the client requests what it wants from the server the server responds with what it has to the client. It's also, at its very basic, a stateless protocol. An individual request-response pair has no association with another request-response pair. This makes the protocol simple - client sends a bunch of text, server returns a bunch of text, rinse and repeat. But without the ability to associate the multiple requests of a single client together, things like authentication, sessions and contextually-correct responses are impossible.
Browser cookies
Cookies are small pieces of text stored on the browser. What makes cookies special is that they are sent to the server during an HTTP request, and received from the server during an HTTP response. If a request is made to a domain, any cookies present for that domain are attached to the headers of request, making them readable by the server. On the flip side, when a server responds, it can optionally set cookie information to the header of the response, which the client uses to update its local stash of cookies. This makes cookies a popular way to store and sync state, especially between client and server.
Unique identifiers
The most notorious use of cookies is tracking. However, it's not actually the cookie that does the tracking. It only facilitates tracking by allowing tracking domains to store in a cookie a unique ID that can be associated to the client. It starts when the client makes a request to a tracking domain. When a server on the tracking domain sees that the request does not have the cookie containing an ID, it sets a cookie in the response containing a newly generated ID. This cookie would be saved on the client upon receipt of the response. From there, future requests to the tracking domain now carry an ID, allowing it to associate multiple requests to a specific client, and therefore user.
Widget scripts
Tracking only happens if something on the page makes a request to a tracking domain, carrying with it the unique ID of the client and optional metadata. If nothing on the page makes this request, then the page is effectively a dead zone on the internet. This is where ads, analytics, social media buttons and similar widget scripts come in. In order for these scripts to work, they make requests to their respective domains to load things like functionality, assets and data. These same requests can also used to pass tracking information to their respective domains, effectively making them the trackers.
User movement
Tracking is only effective if you can follow users wherever they go. This means tracker scripts on every website on the internet, which is only possible if you have code-level access to add the script. But you don't have to, especially when there is an incentive for website owners to add it themselves. Analytics scripts provide user insights, social media buttons provide instant access to audience, ads provide additional profit. Because of these, website operators are voluntarily adding these scripts on their websites, effectively doing the work for the trackers and spreading their presence on the internet. This ubiquity coupled with the unique client ID allows trackers to follow any user virtually anywhere on the internet.
Better content
The biggest reason why tracking exists is to gather data about the user. This data could then be used to power personalized content, improve marketing campaigns, improve user experience, improve website performance, and so on. One goal in particular is increasing conversion - the practice of making a drive-by user take action, beyond just viewing the content. This could be a purchase, a subscription, a user registration, a newsletter, whatever. Because higher conversion rates are highly sought after, this makes targeted advertising very profitable, and user data even more valuable.
Real-time bidding
Real-time bidding is a protocol between ad publisher and ad supplier which facilitates automated buying and selling of ads. It starts when a user visits a page that requests an ad from an ad publisher. The ad publisher broadcasts this request to ad suppliers for bidding, together with metadata about the user. Ad suppliers select their most relevant ad based on the request metadata and any historical data they have on the user, and submit the bid. The ad publisher will then select the highest bidder and renders their ad. In addition, the winning ad supplier is allowed to make a client request back to their servers for their own tracking. The data used in the process, both from the ad publisher and the ad supplier's own tracking, will then be used by the ad supplier to submit better bids in the future.
Tracker blockers
All browsers have some form of tracker blocker. Safari has its Intelligent Tracking Prevention powered by in-device machine learning. Firefox has its own Tracking Protection powered by Disconnect. Chrome blocks ads that do not follow the Better Ads Standard. Opera comes with an ad blocker that's compatible with AdBlock Plus block lists. And of course, third-party extensions like AdBlock Plus. While they all take different approaches and degrees of blocking, they all have one common goal: Prevent the client from being given a persistent ID. Because when an ID is established, a rogue request can send it to the tracking domain, and it's game over.
Regulation
Tracker blocking is just a short-term solution to prevent tracking, a bandage to the bigger problem: corporations aggressively farming user data, to the point of invading privacy. The proper solution to this issue is the regulation on how these corporations operate, transparency on what data is being collected, accountability in the event of breaches, as well as fine-grained user-level controls to allow the user to put in or pull out their data. EU's General Data Protection Regulation (GDPR) is one such regulation, which I believe is the right way to approach this problem. Not more ad blocker filters.
Conclusion
Whether it's used for serving authenticated content or just relevant ads, knowing the user behind the browser is essential to the web. Otherwise, we'd all just be staring at the same, pre-generated, non-dynamic content all the time. However, one must also exercise caution when on the web. Until data regulations become global or become part of the technology itself, what goes on the web stays indefinitely on the web.
Hopefully this gave you a 10,000ft overview on how the web works, how tracking works, how your data is being used on the internet, and how to address tracking on the internet. As always, let me know in the comments what you think about the article.