Search engines
There are various web search tools and a website can be listed by well known search engines or directories like Google, Yahoo, Netscape Open Directory Project (ODP). A website can be optimised for a good search result and it also helps if many other sites have links to it.
Basic stuff
Search engine bots are guided when crawling a web page by HTML tags in the page, like
the title, meta, a, img tags and others. It helps the search engine if the web pages have valid
HTML/XHTML code (passed validation by the W3C Markup Validation Service).
Search engines like Google might have problems with dynamically generated content, DHTML or Flash movies. To have an idea of how
the Google bot sees a web page one can
view that page with a text browser like Lynx.
The W3C editor/browser
Amaya presents, among many other features, the option
of showing an "alternate view" of a web page (as a text browser would view the page).
Access logs, if available, give information about the visits from search engine bots, show the HTTP response of web pages (hopefully 200 OK for a valid URI), and can show if there are any loops in the bots access etc.
Google introduced recently a "viewing statistics and errors for your site" feature, from Google Sitemaps, with detailed feedback on Googlebot crawling a website, and a Google sitemap file is not necessary for viewing this feedback information.
SERP
SERP (Search Engine Results Page) - the results that appear after a query is submitted
online to a search engine - is obviously very important to websites.
The SERP structure for Google is described in
Google Help Center - Search Results Page.
To see what pages from a website, for example domain.com, appear in Google search results,
one can use the advanced operator query site:domain.com or the
Google advanced search page. If in the results for this search a page is listed only by the URL
(without a description or snippet) this means that the page is not fully indexed by Google
and usually would not appear in a keyword search.
Once a web page is indexed by Google then it can appear in a SERP for keyword queries, with the page title and a snippet
(portions of text from the page containing the keywords), and if it exists, a link to the cached version of the page.
Search engines store from time to time the indexed pages to have the cached version available in case the
current version is temporarily unavailable.
Google and MSN display the date when a page has been last cached.
Google started recently to show in the SERP, for selected websites,
a group of links to important pages from that website.
The keywords in the query can be anywhere in the page,
in the title tag, the meta keywords and description tags, or in the page content.
The page position in the SERP is very important, the default number of results (each from a different domain) that appear in a SERP is 10, it is important for a website to appear in a SERP in the first 10 results. The position of a web page in the SERP is determined in a complex way by Google (see Google - Technology Overview), with the use of the famous PageRank algorithm based on inbound links and the page importance, and hypertext-matching analysis based on the page content.
Meta Tags
There are meta tags that can be included
in the head section of an HTML document
to help search engines, like the meta tags for keywords, description, robots,
refresh.
The words listed within the meta tag for keywords have a better chance to improve
search engine results
if they appear in the content of an HTML document, included in the title
or headings (h1, h2 etc.).
The robots meta tag
indicates to visiting robots
if a document may be indexed (with content index) or used to obtain more links
(with content follow.
When this meta tag is not used, this indicates by default no restrictions
to search engines in indexing
or harvesting more links.
The robots meta tag is superseded by
the rules specified in the robots.txt file.
When a file or directory is specified as out of bounds to polite robots by the robots.txt file, the robots meta tag cannot change that, although it is best that
the robots.txt file and the robots meta tag do not give conflicting instructions.
A very good presentation of the robots meta tag is
robotstxt.org - HTML Author's Guide
to the Robots META tag.
Usually search engines consider by default <meta name="robots" content="index, follow"> that would indicate
to all robots that the page can be indexed and links in it followed, so it is not necessary to have this in the page.
Google has some guidelines in using the robots meta tag in
Google Information for Webmasters - Removals.
The /robots.txt file
The robots.txt file
specifies restriction rules to compliant web robots
and it is placed in a website's root directory.
An excellent site explaining web robots and the robots.txt file is
robotstxt.org - Web Server Administrator's Guide
to the Robots Exclusion Protocol.
For a website that has Google sitemaps submitted to Google,
the Google sitemap account panel provides a tool to show how Google search agents interpret the robots.txt file,
if specified URLs are blocked to Googlebots access by robots.txt, and to test new content
for the robots.txt file. The Google sitemap documentation explains this at
Analyzing a robots.txt file.
The robots.txt specifies robots or search engines
in a text line starting with User-Agent:
and out of bounds files or directories in text lines starting with
Disallow:
The out of bounds URLs are specified by a full path or a partial path,
any URL that starts with this string will not be retrieved by the specified user agents.
An empty value for the path indicates that all URLs can be retrieved.
The way URLs can be specified depends on the search engine, Google accepts paths
specified with wildcard * to match any sequence of characters, and with $ to indicate the end of string.
For example if the robots.txt file with URL www.mydomain.com/robots.txt contains the lines
User-agent: Googlebot
Disallow: /admin
Disallow: /*.gif$
User-Agent: W3C-checklink
Disallow:
this will prevent Google from crawling pages with URLs starting with www.mydomain.com/admin
and all URLs ending in .gif, but the W3C Link Checker can access all documents
(see Robots exclusion - W3C Link Checker).
A line User-Agent: * means all compliant robots.
Note that excluding some URLs by a too precise path in the robots.txt file might attract the attention of unwelcome robots to those pages.
Some search engines like Yahoo, MSN, and BecomeBot (BecomeBot searches for shopping-related websites) mention that they respect also
Crawl-Delay: xx
(where xx is the delay in seconds between successive crawler accesses), that is useful when the
crawler rate is a problem for a web server.
The way in which compliant robots respect this exclusion protocol is usually mentioned, with examples, in their websites, for example
- Google Information for Webmasters - Removals,
- Yahoo Help - How do I prevent my site from being crawled or prevent certain subdirectories from being crawled?,
- Alexa Web Search - For Webmasters (Alexa's crawler identifies itself as
ia_archiver), - Girafa - Remove URL (the Girafa thumbnail service for search engines and web directories delivers thumbnail sized images of web pages for display next to the textual links of these Web pages).
- Boitho.com distributed web crawler (beta)
- Ipselonbot (beta)
- NextGenSearchBot ZoomInfo crawler
Direct submission
Major search engines have online pages for direct submission of websites, for example
- submit website URL to Google by add your URL to Google, or by submitting a Google sitemap (beta),
- submit website to Yahoo via site URL or text list of site URLs at submit your site for free to Yahoo Search,
- suggest URL to dmoz.org.
Website submission does not guarantee indexing by the search engine. A submitted site should not have broken links (check for example with the W3C Link Checker) or be under construction.
Site Structure
The web pages included in a site's main navigation bar can be indexed by search engines or online directories, and/or in other people's bookmarks, and it is preferable not to change the name of, or delete an HTML document. A well-planned site structure is important in dealing with modifications without affecting good search results from search engines and useful listing in online directories.
If a page is changed from an HTML document to a PHP script, the
file's .html extension
can be kept by using the .htaccess file, and adding to it some lines, like
AddType application/x-httpd-php .html.
A good presentation of using the .htaccess file to manage file extensions is
Philip Olson's phpbuilder.com/tips/item.php?id=79.
Links and search results
Search engines (and people) find new URLs by following hyperlinks. Search results can be improved if a website has good quality links from other websites, relevant to its content. A word of caution: big exchange links programs, that give a site too many unrelated links too fast, can damage a website's search results and can even get a site banned from the Google index.
It is also important how pages from a website link to each other (the navigation inside the website).
Search engines follow easier clear links defined by <a href="http://some_url.html">some anchor text<a/> than complex JavaScript navigation.
The link element inside the head element can be used to indicate
to browsers and search engines the relationship between pages if they are part of an ordered series forming a
larger document, like chapters of an article written in separate HTML pages,
see the W3C HTML 4.01 Specification,
Document relationships: the LINK element.
The rel attribute of the a tag can indicate to search engines
if a page does not give priority to a link.
<a rel="nofollow" href="some_url.html"/>
indicates to search engines not to use that occurence of the link to add weight to that URL in the search engine index.
URLs from other sites containing links to a website some_site.com
can be queried in Google by using the advanced operator link:
for example link:some_site.com.
Yahoo Site Explorer shows,
for Yahoo, the inbound links to a site from other sites.
Redirects
It is good practice to use HTTP redirects to indicate to search engines and users the new replacement URL for a page, a 301 redirect indicates permanent redirect and 302 temporary redirect. Google can find a new page if a 301 redirect is used (see Google Information for Webmasters) and Yahoo explains in How does the Yahoo! Web Crawler handle redirects? how it handles redirects and the refresh meta tag.
URL rewrite
There are usually some problems for search engines to find URLs of dynamically generated pages
that include query strings, especially if there is a large number of parameter/value couples in
the query string. Because of this many websites use URL rewriting,
when the hosting server allows the same page content to be accessed with more than one URL,
then a URL of a dynamically generated page can be re-written like the URL of a static page,
without the & or = characters in the URL.
For example
www.mysite.com/page_1_2.html
can represent the content of www.mysite.com/page.php?ref=1&item=2.
The mode_rewrite
Apache module permits this re-writing,
see Apache HTTP Server - URL Rewriting Guide. The URL re-writing that involves the parsing of the URL string
can be programmed from
the .htaccess file or by using the $ENV{PATH_INFO} variable in the CGI scripts
generating the web page.
Reference Links
- www.w3.org/TR/html401/appendix/notes.html#h-B.4
- W3C Notes on helping search engines index your Web site
- www.google.com/webmasters/
- Google Information for Webmasters
- help.yahoo.com/help/us/ysearch/basics/index.html
- Yahoo Search Help
- help.yahoo.com/help/us/ysearch/slurp/slurp-11.html
- How does the Yahoo! Web Crawler handle redirects?
- www.robotstxt.org/wc/robots.html
- robotstxt.org, the reference website for the robots.txt file and robots meta tags
- httpd.apache.org/docs/2.0/misc/rewriteguide.html
- URL Rewriting Guide - Apache HTTP Server
- www.w3.org/QA/Tips/uri-choose
- Choose URIs wisely - W3C Quality Web Tips
- www.w3.org/QA/Tips/uri-manage
- Managing URIs - W3C Quality Web Tips