I hope you'll forgive me if I start with a story.. it kind of gives a bit of context.
A long time ago (around 2001/2002) I decided I needed to learn more about web programming. I already knew how to write HTML and load it in the browser. I had heard about this thing called CGI but it required me to install a server.. something I thought I did not have time to learn. So armed with my bachelor's degree and the 1 or 2 years of "professional" programming experience I thought the obvious thing to do was learn how HTTP works.
So I started reading the HTTP RFC: https://www.rfc-editor.org/rfc/rfc2616
Turns out that HTTP is really simple but has a lot of details.. most of which are optional. I needed a way to know what I can ignore and what I must implement because I didn't have time to implement the full spec (I had a day job to do). Google just came out a few years back (too late to help me with my dissertation at uni but just in time for my first job) so I thought I'd google "minimal http implementation". These days you'll probably get different results but back then it gave me James Marshall's excellent HTTP Made Really Easy (which is still online today).
So the answer you're looking for is to read the HTTP Made Really Easy page.
TLDR
Due to stackoverflow's policy it would be rude for me to just leave you with a link. However I still strongly suggest you read that article. In any case, here's the TLDR:
HTTP requests are just plain text. It has the simple format of:
Request Line
Header
Header
Header
Each part of a HTTP request is separated by a new line
Note: Technically they should be \r\n
but you are strongly
encouraged to also accept \n
as a newline.
- A HTTP request is terminated by two newlines
Note: Technically they should be 4 bytes: \r\n\r\n
but you are
strongly encouraged to also accept 2 byte terminator: \n\n
.
The format of the request line is:
METHOD path PROTOCOL_VERSION
METHOD is the HTTP method such as POST, GET, PUT, DELETE etc. Typically they should be upper case.
The path is the url path typically expected to be the path of the file you're requesting but in more modern times is more typically an endpoint processed by a web framework.
The protocol version is in the format:
HTTP/1.1
Normally you can ignore this.
Parts of request line is separated by a space character. Technically there should be only one space though I've seen badly malformed requests that send multiple spaces. Browsers will never send more than one space.
Headers are in the format:
Header-name: header value
Header name can be either title-case or lowercase or mixed, all are valid.
Knowing this, parsing HTTP is actually fairly simple:
// Pseudocode:
let headers = {};
let method = '';
let path = '';
while (1) {
input = read();
buffer.append(input);
if (buffer.contains('\n\n') || buffer.contains('\r\n\r\n') {
raw_request = buffer.split('\n');
request_line = raw_request[0].split(' ');
method = request_line[0];
path = request_line[1];
for (let i=1; i<raw_request.length; i++) {
let header = raw_request[i].split(':');
headers[header[0].toLowerCase()] = header.slice(1).join(':');
}
break
}
}
Obviously you shouldn't use a blocking while
loop to read form I/O in javascript because it wouldn't work but you get the general idea.
There are additional things you need to handle such as reading the Content-Length
header to determine when a POST request is completed, how to parse POST form-data (most js frameworks don't do this - they require you to use an additional module to parse request body) etc. but this should get you started with a minimal viable implementation that you can continue to add features in order to handle all the different request types.
parse-raw-http
has no dependencies and about 100 lines of code.