doc/implementation.md
Update using the following design, it is actually difficult to correctly support HTTP pipelining. I've come up with a new design inspired by Naruil which should be much cleaner and easier to support HTTP pipelining. But as all major browsers, except Opera, does not enable HTTP pipelining by default, I don't think it's worth the effort to support HTTP pipelining now. I'll try to support it with the new design if the performance benefits of HTTP pipelining becomes significant in the future.
The final design is evolved from different previous implementations. The other subsections following this one describe how its evolved.
COW uses separate goroutines to read client requests and server responses.
One client must have one request goroutine, and may have multiple response goroutine. Response goroutine is created when the server connection is created.
This makes it possible for COW to support HTTP pipeline. (Not very sure about this.) COW does not pack multiple requests and send in batch, but it can send request before previous request response is received. If the client (browser) and the web server supports HTTP pipeline, then COW will not in effect make them go back to wating response for each request.
But this design does make COW more complicated. I must be careful to avoid concurrency problems between the request and response goroutine.
Here's things that worth noting:
The web server connection for each host is stored in a map
Request and response goroutine may need to notify each other to stop
Of course we need to parse http request to know the address of the web server.
Besides, HTTP requests sent to proxy servers are a little different from those sent directly to the web servers. So proxy server need to reconstruct http request
GET request has request URI like '/index.html', but when sending to proxy, it would be something like 'host.com/index.html'CONNECT request requires special handling by the proxy (send a 200 back to the client)The initial implementation serves client request one by one. For each request:
We need to know whether a response is finished so we can start to serve another request. (This is the oppisite to HTTP pipelining.) That's why we need to parse content-length header and chunked encoding.
Parsing responses allow the proxy to put server connections back to a pool, thus allows different clients to reuse server connections.
After supporting CONNECT, I realized that I can use a separate goroutine to read HTTP response from the server and pass it directly back to the client. This approach doesn't need to parse response to know when the response ends and then starts to process another request.
Update: not parsing HTTP response do have some problems. Refer to section "But response parsing is necessary".
This approach has several implications needs to be considered:
I choosed not parsing the response because:
I've got a bug in handling HTTP response 302 when not parsing the response.
When trying to visit "youku.com", it gives a "302" response with "Connection: close". The browser doesn't close the connection and still tries to get more content from the server after seeing the response.
I tried polipo and see it will send back "302" response along with a "Content-Length: 0" to indicate the client that the response has finished.
To add this kind of response editing capability for my proxy, I have to parse HTTP response.
So the current solution is to parse the response in the a separate goroutine, which doesn't require lots of code change against the not parsing approach.
When blocked sites are detected because of error like connection resets and read time out, we can choose to redo the HTTP request by using parent proxy or just return error page and let the browser refresh.
I tried to support auto refresh. But as I want support HTTP pipelining, the client request and server response read are in separate goroutine. The response reading goroutine need to send redo request to the client request goroutine and maintain a correct request handling order. The resulting code is very complex and difficult to maintain. Besides, the extra code to support auto refresh may incur performance overhead.
As blocked sites will be recorded, the refresh is only needed for the first access to a blocked site. Auto refresh is just a minor case optimization.
So I choose not to support auto refresh as the benefit is small.
The goal is make it easy to find the exact error location.