HTTP headers
Understand what HTTP headers are, what they're used for, and three of the biggest differences between HTTP/1.1 and HTTP/2 headers.
HTTP headers let the client and the server pass additional information with an HTTP request or response. Headers are represented by an object where the keys are header names. Headers can also contain certain authentication tokens.
In general, there are 4 different paths you'll find yourself on when scraping a website and dealing with headers:
No headers
For some websites, you won't need to worry about modifying headers at all, as there are no checks or verifications in place.
Some default headers required
Some websites will require certain default browser headers to work properly, such as User-Agent (though, this header is becoming more obsolete, as there are more sophisticated ways to detect and block a suspicious user).
Another example of such a "default" header is Referer. Some e-commerce websites might share the same platform, and data is loaded through XMLHttpRequests to that platform, which would not know which data to return without knowing which exact website is requesting it.
Custom headers required
A custom header is a non-standard HTTP header used for a specific website. For example, an imaginary website of cool-stuff.com might have a header with the name X_Cool_Stuff_Token which is required for every single request to a product page.
Dealing with cases like these usually isn't difficult, but can sometimes be tedious.
Very specific headers required
The most challenging websites to scrape are the ones that require a full set of site-specific headers to be included with the request. For example, not only would they potentially require proper User-Agent and Referer headers mentioned above, but also Accept, Accept-Language, Accept-Encoding, etc. with specific values.
Another big one to mention is the Cookie header. We cover this in more detail within the cookies lesson.
You could use Chrome DevTools to inspect request headers, and Insomnia or Postman to test how the website behaves with or without specific headers.
HTTP/1.1 vs HTTP/2 headers
HTTP/1.1 and HTTP/2 headers have several differences. Here are the three key differences that you should be aware of:
- HTTP/2 headers do not include status messages. They only contain status codes.
- Certain headers are no longer used in HTTP/2 (such as Connection along with a few others related to it like Keep-Alive). In HTTP/2, connection-specific headers are prohibited. While some browsers will ignore them, Safari and other Webkit-based browsers will outright reject any response that contains them. Easy to do by accident, and a big problem.
- While HTTP/1.1 headers are case-insensitive and could be sent by the browsers with capitalized letters (e.g. Accept-Encoding, Cache-Control, User-Agent), HTTP/2 headers must be lower-cased (e.g. accept-encoding, cache-control, user-agent).
To learn more about the difference between HTTP/1.1 and HTTP/2 headers, check out this article