InfoHeap
Tech
Navigation
  • Home
  • Tutorials
    • CSS tutorials & examples
    • CSS properties
    • Javascript cookbook
    • Linux/Unix Command Line
    • Mac
    • PHP
      • PHP functions online
      • PHP regex
    • WordPress
  • Online Tools
    • Text utilities
    • Online Lint Tools
search

Apache tutorials

  • AWStats on Ubuntu
  • ApacheBench - load testing
  • Auth to a location or directory
  • Block directory access using htaccess 404
  • Different expire Headers for multiple images
  • Disable directory listing
  • List loaded modules
  • Log Content-Type in access log
  • Log latency and host in apache log
  • Monitoring using mod_status
  • Python to analyze bots in logs
  • Remove php extension from url
  • egrep and access log
  • log custom data in apache access log
  • mod_rewrite
  • top IP list from access log
 
  • Home
  • > Tutorials
  • > Web Development
  • > Apache

Using python to analyze bots from apache logs

By admin on Mar 9, 2013

Apache logs contains pretty useful information about various visitors and bots coming to your site. Here is how a typical apache log entry looks like:

66.249.73.146 - - [09/Mar/2013:19:18:18 +0000] "GET / HTTP/1.1" 200 7266 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

The last field if the User-agent field which tells us about the visitor. In this case it is Googlebot. Some fields contain spaces so parsing apache log requires some sort of csv parser. Here is quick python code snippet to find out which are top User-agents visiting your blog site:

cat /etc/httpd/logs/access_log.2013-03-10 | python -c "import csv,sys;f=csv.reader(sys.stdin, delimiter=' '); print '\n'.join([r[9] for r in f])" | sort | uniq -c | sort -rn

The logfile name may be different depending upon your setup.

Here is the outcome (I have picked only bot entries) for last few days of my apache log file:

    999 Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
    368 Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)
    291 Mozilla/5.0 (compatible; MJ12bot/v1.4.3; http://www.majestic12.co.uk/bot.php?+)
    158 Mediapartners-Google
     98 Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
     92 msnbot/2.0b (+http://search.msn.com/msnbot.htm)
     81 Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 1 subscribers; feed-id=9370706305395655573)
     61 msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)

It is interesting to see that Google crawlers is lot more active than other search engine crawlers. That also suggests that Google results will be more comprehensive and fresh as compared to other engines.

Suggested posts:

  1. Screenflow 4 export options
  2. Perl command line – replace multi line comments
  3. How to use your own domain name for feedburner feed urls
  4. How to customize wordpress image alt tag
  5. Linux – find files containing specific text
  6. Benefits of using Amazon AWS – EC2
  7. How to use w3 total cache for wordpress
  8. Block directory access using htaccess 404
Share this article: share on facebook share on linkedin tweet this submit to reddit
Posted in Tutorials | Tagged Apache, Python, Tutorials, Wordpress
  • Browse content
  • Article Topics
  • Article archives
  • Contact Us
Popular Topics: Android Development | AngularJS | Apache | AWS and EC2 | Bash shell scripting | Chrome developer tools | Company results | CSS | CSS cookbook | CSS properties | CSS Pseudo Classes | CSS selectors | CSS3 | CSS3 flexbox | Devops | Git | HTML | HTML5 | Java | Javascript | Javascript cookbook | Javascript DOM | jQuery | Kubernetes | Linux | Linux/Unix Command Line | Mac | Mac Command Line | Mysql | Networking | Node.js | Online Tools | PHP | PHP cookbook | PHP Regex | Python | Python array | Python cookbook | SEO | Site Performance | SSH | Ubuntu Linux | Web Development | Webmaster | Wordpress | Wordpress customization | Wordpress How To | Wordpress Mysql Queries | InfoHeap Money

Copyright © 2025 InfoHeap.

Powered by WordPress