Using a Bash Script to Track Abusive Bot and Crawler Traffic in NGINX Logs
When managing a web server, it's common to encounter traffic from bots, spiders, and crawlers. While many of these bots (e.g., search engines) are beneficial, some can become abusive, overwhelming your server with requests, consuming bandwidth, or even probing for vulnerabilities. Tracking down and mitigating abusive bot traffic is an essential task for system administrators.
NGINX logs can provide valuable insights into the behavior of bots and crawlers, but sifting through logs manually is a tedious process. This article demonstrates how to use a custom Bash script to filter and track down abusive bot and crawler traffic in your NGINX logs. With this script, you can filter logs by timeframe and request type, helping you identify potential threats more easily.
Why Monitor Bots and Crawlers?
Bots are automated programs that browse the web, and they serve a variety of purposes:
- Good Bots: Search engine bots (e.g., Googlebot, Bingbot) help index your content for search results.
- Malicious Bots: Some bots crawl your site to find vulnerabilities, steal content, or cause Distributed Denial of Service (DDoS) attacks.
- Abusive Traffic: Even non-malicious bots can cause problems if they request resources too frequently, leading to performance degradation.
By tracking bots and crawlers, you can determine which ones are behaving aggressively and mitigate their impact on your server.
The Script: Tracking Abusive Traffic
Below is a Bash script designed to analyze your NGINX logs and track bot and crawler activity. The script supports filtering by request method (GET or POST), timeframe, and user agents that match common bot patterns.
#!/bin/bash
# Default timeframe is 1 hour
timeframe="1 hour"
request_method=""
# Function to calculate the time difference in the correct format
calculate_timeframe() {
unit=${1: -1}
amount=${1::-1}
case $unit in
h) echo "$amount hour";;
d) echo "$amount day";;
m) echo "$amount month";;
y) echo "$amount year";;
*) echo "Invalid time unit. Use 'h' for hours, 'd' for days, 'm' for months, or 'y' for years.";;
esac
}
# Parse the flags
while [[ "$#" -gt 0 ]]; do
case $1 in
--time)
if [[ -z "$2" ]]; then
echo "Error: --time requires a value."
exit 1
fi
timeframe=$(calculate_timeframe $2)
shift 2
;;
--GET)
request_method="GET"
shift
;;
--POST)
request_method="POST"
shift
;;
*)
echo "Unknown parameter: $1"
exit 1
;;
esac
done
# Prompt the user for the log file location
read -p "Enter the log file location (you can use * as a wildcard, e.g., /path/to/logs/access*.log): " log_location
# Get the current time and time from the specified timeframe
start_time=$(date -d "-$timeframe" +"%d/%b/%Y:%H:%M:%S")
# Base grep command for filtering bot, spider, and crawl with the timeframe
grep_command="grep -s \"$start_time\" $log_location | grep -i \"bot\|spider\|crawl\" | grep -v \"Mozilla\|Chrome\|Safari\""
# Add the request method filter if --GET or --POST is specified
if [ "$request_method" == "GET" ]; then
grep_command="$grep_command | grep \"^GET\""
elif [ "$request_method" == "POST" ]; then
grep_command="$grep_command | grep \"^POST\""
fi
# Execute the final command
eval "$grep_command" | awk '{print $6, $12}' | sort | uniq -c | sort -nr
How the Script Works
1. Timeframe Filtering
By default, the script filters logs from the last hour, but you can specify a custom timeframe using the --time flag. The script supports the following formats:
hfor hours (e.g.,--time 12hfor 12 hours)dfor days (e.g.,--time 7dfor 7 days)mfor months (e.g.,--time 3mfor 3 months)yfor years (e.g.,--time 1yfor 1 year)
The script calculates the start time based on the current date and filters logs accordingly.
2. Filtering Bots and Crawlers
The script uses grep to search for user agents containing bot, spider, or crawl, which are common patterns for identifying web crawlers. It excludes common web browsers such as Mozilla, Chrome, and Safari using the -v flag, ensuring that human traffic is not included in the results.
3. Request Method Filtering
You can filter the logs by request method (GET or POST) using the --GET or --POST flags:
--GET: Filters forGETrequests, typically used to retrieve resources (e.g., pages or images).--POST: Filters forPOSTrequests, often used for submitting forms or sending data to the server.
If no flag is specified, the script shows both GET and POST requests by default.
4. Log Location
The script prompts the user to enter the log file location. It supports the use of wildcards (*) to match multiple files (e.g., access*.log), making it flexible for different NGINX log structures.
5. Final Output
Once the script gathers the relevant log entries, it uses awk to extract the HTTP method and user agent, sort to group them, and uniq -c to count the occurrences. The results are displayed in descending order based on the frequency of requests.
Example Usage
Default Usage
To run the script with default settings (last hour, both GET and POST requests):
./track_abusive_bots.sh
Filtering by Timeframe
To filter logs from the last 24 hours:
./track_abusive_bots.sh --time 24h
To filter logs from the last 7 days:
./track_abusive_bots.sh --time 7d
Filtering by Request Method
To filter only GET requests from the last 12 hours:
./track_abusive_bots.sh --time 12h --GET
To filter only POST requests from the last week:
./track_abusive_bots.sh --time 7d --POST
Log File Location with Wildcards
If your NGINX log files are named with a pattern (e.g., access-ssl.log, access-error.log), you can specify a wildcard to match multiple files:
Enter the log file location (you can use * as a wildcard, e.g., /path/to/logs/access*.log): /var/log/nginx/access*.log
Identifying Abusive Traffic
Once the script runs, you'll see output similar to the following:
15 GET "Googlebot/2.1"
11 GET "ZoominfoBot"
9 POST "AdsBot-Google"
6 GET "DuckDuckBot/1.1"
3 GET "facebookexternalhit/1.1"
This output tells you how many requests were made by each bot. Bots with unusually high request counts (e.g., thousands of requests within an hour) may be considered abusive and should be investigated further. You can take action by blocking these bots in your NGINX configuration or firewall.
How to Handle Abusive Bots
Once you've identified bots that are generating excessive or abusive traffic, you can mitigate their impact by:
- Blocking them in NGINX: You can add rules in your NGINX configuration to block specific user agents or IP addresses.
- Using a Web Application Firewall (WAF): Tools like Cloudflare or ModSecurity can help automatically block or throttle abusive bots.
- Rate Limiting: NGINX can be configured to limit the rate of requests from certain IP addresses, preventing bots from overwhelming your server.
Conclusion
By using this Bash script, you can efficiently track down which bots and crawlers are contributing to abusive traffic on your NGINX server. With the ability to filter by timeframe and request method, this script provides flexibility and control over your log analysis, allowing you to quickly identify problematic bots. Once identified, you can take steps to block or mitigate their impact, helping to protect your server from unnecessary load and potential security risks.
Feel free to customize this script to suit your specific needs, such as adding more user agents to exclude or integrating automated actions based on the results!