Using a Bash Script to Track Abusive Bot and Crawler Traffic in NGINX Logs

When managing a web server, it's common to encounter traffic from bots, spiders, and crawlers. While many of these bots (e.g., search engines) are beneficial, some can become abusive, overwhelming your server with requests, consuming bandwidth, or even probing for vulnerabilities. Tracking down and mitigating abusive bot traffic is an essential task for system administrators.

NGINX logs can provide valuable insights into the behavior of bots and crawlers, but sifting through logs manually is a tedious process. This article demonstrates how to use a custom Bash script to filter and track down abusive bot and crawler traffic in your NGINX logs. With this script, you can filter logs by timeframe and request type, helping you identify potential threats more easily.

Why Monitor Bots and Crawlers?

Bots are automated programs that browse the web, and they serve a variety of purposes:

Good Bots: Search engine bots (e.g., Googlebot, Bingbot) help index your content for search results.
Malicious Bots: Some bots crawl your site to find vulnerabilities, steal content, or cause Distributed Denial of Service (DDoS) attacks.
Abusive Traffic: Even non-malicious bots can cause problems if they request resources too frequently, leading to performance degradation.

By tracking bots and crawlers, you can determine which ones are behaving aggressively and mitigate their impact on your server.

The Script: Tracking Abusive Traffic

Below is a Bash script designed to analyze your NGINX logs and track bot and crawler activity. The script supports filtering by request method (GET or POST), timeframe, and user agents that match common bot patterns.

#!/bin/bash

# Default timeframe is 1 hour
timeframe="1 hour"
request_method=""

# Function to calculate the time difference in the correct format
calculate_timeframe() {
    unit=${1: -1}
    amount=${1::-1}

    case $unit in
        h) echo "$amount hour";;
        d) echo "$amount day";;
        m) echo "$amount month";;
        y) echo "$amount year";;
        *) echo "Invalid time unit. Use 'h' for hours, 'd' for days, 'm' for months, or 'y' for years.";;
    esac
}

# Parse the flags
while [[ "$#" -gt 0 ]]; do
    case $1 in
        --time)
            if [[ -z "$2" ]]; then
                echo "Error: --time requires a value."
                exit 1
            fi
            timeframe=$(calculate_timeframe $2)
            shift 2
            ;;
        --GET)
            request_method="GET"
            shift
            ;;
        --POST)
            request_method="POST"
            shift
            ;;
        *) 
            echo "Unknown parameter: $1"
            exit 1
            ;;
    esac
done

# Prompt the user for the log file location
read -p "Enter the log file location (you can use * as a wildcard, e.g., /path/to/logs/access*.log): " log_location

# Get the current time and time from the specified timeframe
start_time=$(date -d "-$timeframe" +"%d/%b/%Y:%H:%M:%S")

# Base grep command for filtering bot, spider, and crawl with the timeframe
grep_command="grep -s \"$start_time\" $log_location | grep -i \"bot\|spider\|crawl\" | grep -v \"Mozilla\|Chrome\|Safari\""

# Add the request method filter if --GET or --POST is specified
if [ "$request_method" == "GET" ]; then
    grep_command="$grep_command | grep \"^GET\""
elif [ "$request_method" == "POST" ]; then
    grep_command="$grep_command | grep \"^POST\""
fi

# Execute the final command
eval "$grep_command" | awk '{print $6, $12}' | sort | uniq -c | sort -nr

How the Script Works

1. Timeframe Filtering

By default, the script filters logs from the last hour, but you can specify a custom timeframe using the --time flag. The script supports the following formats:

h for hours (e.g., --time 12h for 12 hours)
d for days (e.g., --time 7d for 7 days)
m for months (e.g., --time 3m for 3 months)
y for years (e.g., --time 1y for 1 year)

The script calculates the start time based on the current date and filters logs accordingly.

2. Filtering Bots and Crawlers

The script uses grep to search for user agents containing bot, spider, or crawl, which are common patterns for identifying web crawlers. It excludes common web browsers such as Mozilla, Chrome, and Safari using the -v flag, ensuring that human traffic is not included in the results.

3. Request Method Filtering

You can filter the logs by request method (GET or POST) using the --GET or --POST flags:

--GET: Filters for GET requests, typically used to retrieve resources (e.g., pages or images).
--POST: Filters for POST requests, often used for submitting forms or sending data to the server.

If no flag is specified, the script shows both GET and POST requests by default.

4. Log Location

The script prompts the user to enter the log file location. It supports the use of wildcards (*) to match multiple files (e.g., access*.log), making it flexible for different NGINX log structures.

5. Final Output

Once the script gathers the relevant log entries, it uses awk to extract the HTTP method and user agent, sort to group them, and uniq -c to count the occurrences. The results are displayed in descending order based on the frequency of requests.

Example Usage

Default Usage

To run the script with default settings (last hour, both GET and POST requests):

./track_abusive_bots.sh

Filtering by Timeframe

To filter logs from the last 24 hours:

./track_abusive_bots.sh --time 24h

To filter logs from the last 7 days:

./track_abusive_bots.sh --time 7d

Filtering by Request Method

To filter only GET requests from the last 12 hours:

./track_abusive_bots.sh --time 12h --GET

To filter only POST requests from the last week:

./track_abusive_bots.sh --time 7d --POST

Log File Location with Wildcards

If your NGINX log files are named with a pattern (e.g., access-ssl.log, access-error.log), you can specify a wildcard to match multiple files:

Enter the log file location (you can use * as a wildcard, e.g., /path/to/logs/access*.log): /var/log/nginx/access*.log

Identifying Abusive Traffic

Once the script runs, you'll see output similar to the following:

   15 GET "Googlebot/2.1"
   11 GET "ZoominfoBot"
    9 POST "AdsBot-Google"
    6 GET "DuckDuckBot/1.1"
    3 GET "facebookexternalhit/1.1"

This output tells you how many requests were made by each bot. Bots with unusually high request counts (e.g., thousands of requests within an hour) may be considered abusive and should be investigated further. You can take action by blocking these bots in your NGINX configuration or firewall.

How to Handle Abusive Bots

Once you've identified bots that are generating excessive or abusive traffic, you can mitigate their impact by:

Blocking them in NGINX: You can add rules in your NGINX configuration to block specific user agents or IP addresses.
Using a Web Application Firewall (WAF): Tools like Cloudflare or ModSecurity can help automatically block or throttle abusive bots.
Rate Limiting: NGINX can be configured to limit the rate of requests from certain IP addresses, preventing bots from overwhelming your server.

Conclusion

By using this Bash script, you can efficiently track down which bots and crawlers are contributing to abusive traffic on your NGINX server. With the ability to filter by timeframe and request method, this script provides flexibility and control over your log analysis, allowing you to quickly identify problematic bots. Once identified, you can take steps to block or mitigate their impact, helping to protect your server from unnecessary load and potential security risks.

Feel free to customize this script to suit your specific needs, such as adding more user agents to exclude or integrating automated actions based on the results!