One of the main activity a webmaster has to perform is to monitor and fix 404 pages (not found pages) on the web sites. When your web server is not able to find a page on your web site, it returns HTTP 404 status code. You can find more about all http status codes on wikipedia List_of_HTTP_status_codes page.
404 page example
To see what headers are returned from a server you can use netcat (nc). e.g. Run this command on a Linux/Mac terminal (or some equivalent command on windows):
printf "GET /foo/non-existing-page/ HTTP/1.1\nHost: www.google.com\n\n" | nc www.google.com 80
Here is the outcome from the command:
HTTP/1.1 404 Not Found Content-Type: text/html; charset=UTF-8 X-Content-Type-Options: nosniff Date: Thu, 11 Apr 2013 07:38:31 GMT Server: sffe Content-Length: 953 X-XSS-Protection: 1; mode=block <!DOCTYPE html> <html lang=en> <meta charset=utf-8> <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width"> <title>Error 404 (Not Found)!!1</title> <style> *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}} </style> <a href=//www.google.com/><img src=//www.google.com/images/errors/logo_sm.gif alt=Google></a> <p><b>404.</b> <ins>That’s an error.</ins> <p>The requested URL <code>/foo/non-existing-page/</code> was not found on this server. <ins>That’s all we know.</ins>
Reasons for 404 pages
There can be multiple reasons for 404 pages. Some of these are:
- Wrong link in pages
- Some page might have moved or have been deleted
- Some external site might have put wrong link on some of your pages.
Using apache/webserver log to find 404 pages
One option to monitor 404 pages is to regularly check apache log. Here is how one such log entry looks like:
122.167.14.16 - - [11/Apr/2013:07:06:43 +0000] "GET /non-existing-page/ HTTP/1.1" 404 16003 "-" "Mozilla/5.0 (iPhone; U; CPU iPhone OS 431 like Mac OS X; en-US) AppleWebKit/533.17.9 (KHTML like Gecko) Version/5.0.2 Mobile/8G4 Safari/6533.18."
Using Google webmaster tools to find 404 pages
Google webmaster tools provides excellent report on 404 (error pages) on your web site. This can be used to find such pages and take appropriate corrective action. To see crawl error (404 pages) login to Google webmaster tools and go to “Health” -> “Crawl Errors” as shown below:
The tool lists all the error page urls and date it tried to fetch them. Once you fix a page you can mark it fixed. That helps you keep track of such pages.