InfoHeap
Tech tutorials, tips, tools and more
Navigation
  • Home
  • Tutorials
    • CSS tutorials & examples
    • CSS properties
    • Javascript cookbook
    • Linux/Unix Command Line
    • Mac
    • PHP
      • PHP functions online
      • PHP regex
    • WordPress
  • Online Tools
    • Text utilities
    • Online Lint Tools
search

Apache tutorials

  • AWStats on Ubuntu
  • ApacheBench - load testing
  • Auth to a location or directory
  • Block directory access using htaccess 404
  • Different expire Headers for multiple images
  • Disable directory listing
  • List loaded modules
  • Log Content-Type in access log
  • Log latency and host in apache log
  • Monitoring using mod_status
  • Python to analyze bots in logs
  • Remove php extension from url
  • egrep and access log
  • log custom data in apache access log
  • mod_rewrite
  • top IP list from access log
 
  • Home
  • > Tutorials
  • > Web Development
  • > Apache

Using python to analyze bots from apache logs

By admin on Mar 9, 2013

Apache logs contains pretty useful information about various visitors and bots coming to your site. Here is how a typical apache log entry looks like:

66.249.73.146 - - [09/Mar/2013:19:18:18 +0000] "GET / HTTP/1.1" 200 7266 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

The last field if the User-agent field which tells us about the visitor. In this case it is Googlebot. Some fields contain spaces so parsing apache log requires some sort of csv parser. Here is quick python code snippet to find out which are top User-agents visiting your blog site:

cat /etc/httpd/logs/access_log.2013-03-10 | python -c "import csv,sys;f=csv.reader(sys.stdin, delimiter=' '); print '\n'.join([r[9] for r in f])" | sort | uniq -c | sort -rn

The logfile name may be different depending upon your setup.

Here is the outcome (I have picked only bot entries) for last few days of my apache log file:

    999 Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
    368 Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)
    291 Mozilla/5.0 (compatible; MJ12bot/v1.4.3; http://www.majestic12.co.uk/bot.php?+)
    158 Mediapartners-Google
     98 Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
     92 msnbot/2.0b (+http://search.msn.com/msnbot.htm)
     81 Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 1 subscribers; feed-id=9370706305395655573)
     61 msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)

It is interesting to see that Google crawlers is lot more active than other search engine crawlers. That also suggests that Google results will be more comprehensive and fresh as compared to other engines.

Suggested posts:

  1. Should I block Googlebot from crawling javascript and css?
  2. Command line – top IP list from apache access log
  3. How to redirect wordpress feed to feedburner feed url
  4. Why and how to log Content-Type in apache access log
  5. PHP – How to log custom data in apache access log
  6. Python/Perl/Unix one liners
  7. Monitor apache using mod_status on Ubuntu
  8. Apache – add basic auth to a location or directory
Share this article: share on facebook share on linkedin tweet this submit to reddit
Posted in Tutorials | Tagged Apache, Python, Tutorials, Wordpress
  • Browse content
  • Article Topics
  • Article archives
  • Contact Us
Popular Topics: Android Development | AngularJS | Apache | AWS and EC2 | Bash shell scripting | Chrome developer tools | CSS | CSS cookbook | CSS properties | CSS Pseudo Classes | CSS selectors | CSS3 | CSS3 flexbox | Devops | Git | HTML | HTML5 | Java | Javascript | Javascript cookbook | Javascript DOM | jQuery | Kubernetes | Linux | Linux/Unix Command Line | Mac | Mac Command Line | Mysql | Networking | Node.js | Online Tools | PHP | PHP cookbook | PHP Regex | Python | Python array | Python cookbook | SEO | Site Performance | SSH | Ubuntu Linux | Web Development | Webmaster | Wordpress | Wordpress customization | Wordpress How To | Wordpress Mysql Queries

Copyright © 2023 InfoHeap.

Powered by WordPress