Search engines

There are various web search tools and a website can be listed by well known search engines or directories like Google, Yahoo, Netscape Open Directory Project (ODP). A website can be optimised for a good search result and it also helps if many other sites have links to it.

Basic stuff

Search engine bots are guided when crawling a web page by HTML tags in the page, like the title, meta, a, img tags and others. It helps the search engine if the web pages have valid HTML/XHTML code (passed validation by the W3C Markup Validation Service). Search engines like Google might have problems with dynamically generated content, DHTML or Flash movies. To have an idea of how the Google bot sees a web page one can view that page with a text browser like Lynx. The W3C editor/browser Amaya presents, among many other features, the option of showing an "alternate view" of a web page (as a text browser would view the page).

Access logs, if available, give information about the visits from search engine bots, show the HTTP response of web pages (hopefully 200 OK for a valid URI), and can show if there are any loops in the bots access etc.

Google introduced recently a "viewing statistics and errors for your site" feature, from Google Sitemaps, with detailed feedback on Googlebot crawling a website, and a Google sitemap file is not necessary for viewing this feedback information.

SERP

SERP (Search Engine Results Page) - the results that appear after a query is submitted online to a search engine - is obviously very important to websites. The SERP structure for Google is described in Google Help Center - Search Results Page. To see what pages from a website, for example domain.com, appear in Google search results, one can use the advanced operator query site:domain.com or the Google advanced search page. If in the results for this search a page is listed only by the URL (without a description or snippet) this means that the page is not fully indexed by Google and usually would not appear in a keyword search. Once a web page is indexed by Google then it can appear in a SERP for keyword queries, with the page title and a snippet (portions of text from the page containing the keywords), and if it exists, a link to the cached version of the page. Search engines store from time to time the indexed pages to have the cached version available in case the current version is temporarily unavailable. Google and MSN display the date when a page has been last cached. Google started recently to show in the SERP, for selected websites, a group of links to important pages from that website. The keywords in the query can be anywhere in the page, in the title tag, the meta keywords and description tags, or in the page content.

The page position in the SERP is very important, the default number of results (each from a different domain) that appear in a SERP is 10, it is important for a website to appear in a SERP in the first 10 results. The position of a web page in the SERP is determined in a complex way by Google (see Google - Technology Overview), with the use of the famous PageRank algorithm based on inbound links and the page importance, and hypertext-matching analysis based on the page content.

Meta Tags

There are meta tags that can be included in the head section of an HTML document to help search engines, like the meta tags for keywords, description, robots, refresh. The words listed within the meta tag for keywords have a better chance to improve search engine results if they appear in the content of an HTML document, included in the title or headings (h1, h2 etc.).

The robots meta tag indicates to visiting robots if a document may be indexed (with content index) or used to obtain more links (with content follow. When this meta tag is not used, this indicates by default no restrictions to search engines in indexing or harvesting more links. The robots meta tag is superseded by the rules specified in the robots.txt file. When a file or directory is specified as out of bounds to polite robots by the robots.txt file, the robots meta tag cannot change that, although it is best that the robots.txt file and the robots meta tag do not give conflicting instructions. A very good presentation of the robots meta tag is robotstxt.org - HTML Author's Guide to the Robots META tag. Usually search engines consider by default <meta name="robots" content="index, follow"> that would indicate to all robots that the page can be indexed and links in it followed, so it is not necessary to have this in the page. Google has some guidelines in using the robots meta tag in Google Information for Webmasters - Removals.

The /robots.txt file

The robots.txt file specifies restriction rules to compliant web robots and it is placed in a website's root directory. An excellent site explaining web robots and the robots.txt file is robotstxt.org - Web Server Administrator's Guide to the Robots Exclusion Protocol.

For a website that has Google sitemaps submitted to Google, the Google sitemap account panel provides a tool to show how Google search agents interpret the robots.txt file, if specified URLs are blocked to Googlebots access by robots.txt, and to test new content for the robots.txt file. The Google sitemap documentation explains this at Analyzing a robots.txt file.

The robots.txt specifies robots or search engines in a text line starting with User-Agent: and out of bounds files or directories in text lines starting with Disallow: The out of bounds URLs are specified by a full path or a partial path, any URL that starts with this string will not be retrieved by the specified user agents. An empty value for the path indicates that all URLs can be retrieved. The way URLs can be specified depends on the search engine, Google accepts paths specified with wildcard * to match any sequence of characters, and with $ to indicate the end of string. For example if the robots.txt file with URL www.mydomain.com/robots.txt contains the lines


User-agent: Googlebot
Disallow: /admin
Disallow: /*.gif$

User-Agent: W3C-checklink
Disallow:

this will prevent Google from crawling pages with URLs starting with www.mydomain.com/admin and all URLs ending in .gif, but the W3C Link Checker can access all documents (see Robots exclusion - W3C Link Checker). A line User-Agent: * means all compliant robots. Note that excluding some URLs by a too precise path in the robots.txt file might attract the attention of unwelcome robots to those pages.

Some search engines like Yahoo, MSN, and BecomeBot (BecomeBot searches for shopping-related websites) mention that they respect also

Crawl-Delay: xx

(where xx is the delay in seconds between successive crawler accesses), that is useful when the crawler rate is a problem for a web server.

The way in which compliant robots respect this exclusion protocol is usually mentioned, with examples, in their websites, for example

Direct submission

Major search engines have online pages for direct submission of websites, for example

Website submission does not guarantee indexing by the search engine. A submitted site should not have broken links (check for example with the W3C Link Checker) or be under construction.

Site Structure

The web pages included in a site's main navigation bar can be indexed by search engines or online directories, and/or in other people's bookmarks, and it is preferable not to change the name of, or delete an HTML document. A well-planned site structure is important in dealing with modifications without affecting good search results from search engines and useful listing in online directories.

If a page is changed from an HTML document to a PHP script, the file's .html extension can be kept by using the .htaccess file, and adding to it some lines, like

AddType application/x-httpd-php .html.

A good presentation of using the .htaccess file to manage file extensions is Philip Olson's phpbuilder.com/tips/item.php?id=79.

Links and search results

Search engines (and people) find new URLs by following hyperlinks. Search results can be improved if a website has good quality links from other websites, relevant to its content. A word of caution: big exchange links programs, that give a site too many unrelated links too fast, can damage a website's search results and can even get a site banned from the Google index.

It is also important how pages from a website link to each other (the navigation inside the website). Search engines follow easier clear links defined by <a href="http://some_url.html">some anchor text<a/> than complex JavaScript navigation. The link element inside the head element can be used to indicate to browsers and search engines the relationship between pages if they are part of an ordered series forming a larger document, like chapters of an article written in separate HTML pages, see the W3C HTML 4.01 Specification, Document relationships: the LINK element.

The rel attribute of the a tag can indicate to search engines if a page does not give priority to a link. <a rel="nofollow" href="some_url.html"/> indicates to search engines not to use that occurence of the link to add weight to that URL in the search engine index.

URLs from other sites containing links to a website some_site.com can be queried in Google by using the advanced operator link: for example link:some_site.com. Yahoo Site Explorer shows, for Yahoo, the inbound links to a site from other sites.

Redirects

It is good practice to use HTTP redirects to indicate to search engines and users the new replacement URL for a page, a 301 redirect indicates permanent redirect and 302 temporary redirect. Google can find a new page if a 301 redirect is used (see Google Information for Webmasters) and Yahoo explains in How does the Yahoo! Web Crawler handle redirects? how it handles redirects and the refresh meta tag.

URL rewrite

There are usually some problems for search engines to find URLs of dynamically generated pages that include query strings, especially if there is a large number of parameter/value couples in the query string. Because of this many websites use URL rewriting, when the hosting server allows the same page content to be accessed with more than one URL, then a URL of a dynamically generated page can be re-written like the URL of a static page, without the & or = characters in the URL. For example www.mysite.com/page_1_2.html can represent the content of www.mysite.com/page.php?ref=1&item=2. The mode_rewrite Apache module permits this re-writing, see Apache HTTP Server - URL Rewriting Guide. The URL re-writing that involves the parsing of the URL string can be programmed from the .htaccess file or by using the $ENV{PATH_INFO} variable in the CGI scripts generating the web page.

www.w3.org/TR/html401/appendix/notes.html#h-B.4
W3C Notes on helping search engines index your Web site
www.google.com/webmasters/
Google Information for Webmasters
help.yahoo.com/help/us/ysearch/basics/index.html
Yahoo Search Help
help.yahoo.com/help/us/ysearch/slurp/slurp-11.html
How does the Yahoo! Web Crawler handle redirects?
www.robotstxt.org/wc/robots.html
robotstxt.org, the reference website for the robots.txt file and robots meta tags
httpd.apache.org/docs/2.0/misc/rewriteguide.html
URL Rewriting Guide - Apache HTTP Server
www.w3.org/QA/Tips/uri-choose
Choose URIs wisely - W3C Quality Web Tips
www.w3.org/QA/Tips/uri-manage
Managing URIs - W3C Quality Web Tips