This past Tuesday, ParkWhiz was crawled by a bot that was doing 10-20 requests per second, despite a 1s crawl-delay setting in the robots.txt file for the domain (you do have a robots.txt file, right?). I ended up blocking the bot’s IP address using iptables, but that’s hardly a long-term solution. So how should you defend a site against overly-aggressive bots?
Turns out nginx makes this really simple – just use the limit_req_zone module. Set a rate limit, like 2 request per second or 30 requests per minute, and nginx will make sure no individual website visitor exceeds that limit. A “burst” setting can be used to provide some flexibility, allowing a visitor to make a number of requests in rapid succession before imposing the rate limit.
By default, nginx will attempt to slow down the visitor by delaying responses once the limit is reached. The “nodelay” option can be used to just return a HTTP 503 error instead. I think this is preferable, since it lets the offending visitor know something is wrong, rather than just making the site seem slow.
These are the settings I used for ParkWhiz:
Here’s the important part: put the “limit_req” directive in the location block that passes requests to your app server (mongrel, php-fpm, etc). If you put it in the server block, it will count all requests as part of the rate limiting, including images and other static files. A visitor will hit the rate limit simply by downloading the assets of a single page. We only want to limit requests to the app server.
Try it out; go to http://www.parkwhiz.com and ctrl-click a link a bunch of times in rapid succession.
- jonthornton posted this