Elements of a good web page

From KallestadWiki

Jump to: navigation, search

I decided to expound on my comments about elements of a good web page - so that I have clarity of purpose when coming up with designs, content, layout, and supporting application software.

Google's Webmaster Guidelines is as good a place to start as any. After having written a lot here, this page has ended up extraordinarily long. I have my own set of guidelines above and beyond what is written here having to do with performance, content quality, usability, etc. based on my own experience on the web, on other important sites that I have read, and based on common sense. I'll expound more on those on another few pages and I'll link from here when they are available, but frankly this page has taken all day to consider.

Contents

Design and Layout Guidelines

Regarding Design and content, they have the following to say:

 * Make a site with a clear hierarchy and text links. 
   Every page should be reachable from at least one static text link.
 * Offer a site map to your users with links that point to the important parts of your site. 
   If the site map is larger than 100 or so links, you may want to break the site map into separate pages.
 * Create a useful, information-rich site, and write pages that 
   clearly and accurately describe your content.
 * Think about the words users would type to find your pages, 
   and make sure that your site actually includes those words within it.
 * Try to use text instead of images to display important names, content, or links. 
   The Google crawler doesn't recognize text contained in images.
 * Make sure that your TITLE tags and ALT attributes are descriptive and accurate.
 * Check for broken links and correct HTML.
 * If you decide to use dynamic pages (i.e., the URL contains a "?" character), 
   be aware that not every search engine spider crawls dynamic pages as well as static pages. 
   It helps to keep the parameters short and the number of them few.
 * Keep the links on a given page to a reasonable number (fewer than 100). 

Let's expound on those a bit to really see what google is getting at, and how we can best build a design that supports each element.

Clear Heirarchy and Text Links

 * Make a site with a clear hierarchy and text links. 
   Every page should be reachable from at least one static text link.


This tells google what pages are important in your site, and provides a way for the googlebot to spider your site. When the size of a site grows beyond 100 pages or so, this is more difficult to accomplish in a reasonable fashion. Certainly, providing a text link to each page within a given category is reasonable, but there are some pages that are more important than others. Most search engines determine the internal value of a given page by the number of intra-site links pointing at it. The home page would be most important, because it is linked from everywhere, but some subject matter within a site can be quite extensive, and you need to determine a way of accomplishing your voting mechanism.

How can we accomplish this?

  1. New pages being linked in a common "recent items" list - this accomplishes the task of making your most recently published information more important than other pages.
  2. Category pages linking to each individual page within a given category - this accomplishes a way to drill down to specific topics within a given category.
  3. Related Links between relevant pages - this accomplishes the task of making your most frequently discussed topics your most important ones.

But really, this seems a bit dry. It implies that every page on your site will be dynamic, and that's just not an option if you want to scale. It also implies a certain page-voting method within your site that may or may not be valid. I'll have to think about this one for a while.

Site Map

 * Offer a site map to your users with links that point to the important parts of your site. 
   If the site map is larger than 100 or so links, you may want to break the site map into separate pages.

This is the easy one to accomplish. The big 3 all have sitemapping recommendations that are easy to follow. Here's a link to Google's. Yahoo and MSN also have something in place. This is more a background application task than anything else. Still a readable sitemap is important, and that should also be an automated task.

Information Rich

 * Create a useful, information-rich site, and write pages that 
   clearly and accurately describe your content.

Easier said than done :).

But really, the more information a site has, the better. While there has been a big drive towards images and video, textual content is more frequently accessed and sought out by far.

Keyword Relevance

* Think about the words users would type to find your pages, 
  and make sure that your site actually includes those words within it.

SEO 101. You aren't going to rank for terms you don't try to rank for. On page SEO is a long and difficult conversation, but having keywords specifically located on a page, especially in high profile areas, is important and necessary. Header tags should be used appropriately. Title tags as well.

Textual Descriptions

* Try to use text instead of images to display important names, content, or links. 
  The Google crawler doesn't recognize text contained in images.

This is again SEO 101, but it's not just important for search engines, it's important for people too.

Proper Tag Use

* Make sure that your TITLE tags and ALT attributes are descriptive and accurate.

This is something we have to attack from a templated perspective, but we can also attack it from an applicatin perspective - ensuring that all elements within a given content stream are described and tagged properly.

Content Relevant URLs

* If you decide to use dynamic pages (i.e., the URL contains a "?" character), 
  be aware that not every search engine spider crawls dynamic pages as well as static pages. 
  It helps to keep the parameters short and the number of them few.

Parameterized pages are so 1998. The problem with parameterized pages is that the parameters can come in any order, and they may or may not effect the content on the page. Parameterized pages are also a site-security risk in that it's easy for a competitor to link to your site in a manner which appears to display a great deal of duplicate content.

Limit the # of Links

* Keep the links on a given page to a reasonable number (fewer than 100).

I've fallen into a trap before where my layout itself exceeded this number. No doubt there is no hard rule against sites with extensive intra-site link structures, but there's building an in-site link structure, and there's building an intuitive navigation system for your users.

Technical Guidelines

   *  Use a text browser such as Lynx to examine your site, 
      because most search engine spiders see your site much as Lynx would. 
      If fancy features such as JavaScript, cookies, session IDs, frames, 
      DHTML, or Flash keep you from seeing all of your site in a text browser, 
      then search engine spiders may have trouble crawling your site.
   * Allow search bots to crawl your sites without session IDs or arguments 
     that track their path through the site. These techniques are useful for 
     tracking individual user behavior, but the access pattern of bots is entirely 
     different. Using these techniques may result in incomplete indexing of your site, 
     as bots may not be able to eliminate URLs that look different but actually 
     point to the same page.
   * Make sure your web server supports the If-Modified-Since HTTP header. 
     This feature allows your web server to tell Google whether your content 
     has changed since we last crawled your site. Supporting this feature 
     saves you bandwidth and overhead.
   * Make use of the robots.txt file on your web server. This file tells crawlers 
     which directories can or cannot be crawled. Make sure it's current for your 
     site so that you don't accidentally block the Googlebot crawler. Visit 
     http://www.robotstxt.org/wc/faq.html to learn how to instruct robots when 
     they visit your site. You can test your robots.txt file to make sure you're 
     using it correctly with the robots.txt analysis tool available in 
     Google Webmaster Tools.
   * If your company buys a content management system, make sure that the system 
     can export your content so that search engine spiders can crawl your site.
   * Use robots.txt to prevent crawling of search results pages or other 
     auto-generated pages that don't add much value for users coming from search engines.

Browse with Lynx

   *  Use a text browser such as Lynx to examine your site, 
      because most search engine spiders see your site much as Lynx would. 
      If fancy features such as JavaScript, cookies, session IDs, frames, 
      DHTML, or Flash keep you from seeing all of your site in a text browser, 

This is a great guideline and one that I haven't checked out myself very frequently. It's a great idea, and one that I hope to remember this time around.

No Session IDs for Search Bots

   * Allow search bots to crawl your sites without session IDs or arguments 
     that track their path through the site. These techniques are useful for 
     tracking individual user behavior, but the access pattern of bots is entirely 
     different. Using these techniques may result in incomplete indexing of your site, 
     as bots may not be able to eliminate URLs that look different but actually 
     point to the same page.

This one is a bit outdated, but it's still advice to adhere to. One area where I see people trying to track things via URLs is with ads. It's the most stable way to keep track of referring sites, but it's also something you want to stay away from if you can reasonably get by without it.

Cache Control

   * Make sure your web server supports the If-Modified-Since HTTP header. 
     This feature allows your web server to tell Google whether your content 
     has changed since we last crawled your site. Supporting this feature 
     saves you bandwidth and overhead.

Cache control is an application level issue, and it should always be adhered to - not just for search engines, but for site performance.

Robots.txt

   * Make use of the robots.txt file on your web server. This file tells crawlers 
     which directories can or cannot be crawled. Make sure it's current for your 
     site so that you don't accidentally block the Googlebot crawler. Visit 
     http://www.robotstxt.org/wc/faq.html to learn how to instruct robots when 
     they visit your site. You can test your robots.txt file to make sure you're 
     using it correctly with the robots.txt analysis tool available in 
     Google Webmaster Tools.

If you don't have a robots.txt file, you'll be sure to get a lot of 404s in your web server logs. It's more a guideline than anything else - there are bad bots out there. You shouldn't expose sensitive data regions via robots.txt, but instead ensure that they are not linked to by the outside world.

Exportable CMS

   * If your company buys a content management system, make sure that the system 
     can export your content so that search engine spiders can crawl your site.

This looks like an old guideline, but it makes sense for scalability. If you have > 100K pages, and your site is popular, chances are you will be asked to export your data at some point.

Only Expose Value Pages

   * Use robots.txt to prevent crawling of search results pages or other 
     auto-generated pages that don't add much value for users coming from search engines.

This goes without saying, but last I checked, this was a biggy for google. Some big sites have been knocked down a notch by the search engines for violating this basic rule of ettiquette.

Quality Guidelines

   *  Make pages for users, not for search engines. 
      Don't deceive your users or present different content to search engines than 
      you display to users, which is commonly referred to as "cloaking."
   * Avoid tricks intended to improve search engine rankings. 
     A good rule of thumb is whether you'd feel comfortable explaining what you've 
     done to a website that competes with you. Another useful test is to ask, "Does 
     this help my users? Would I do this if search engines didn't exist?"
   * Don't participate in link schemes designed to increase your site's ranking or 
     PageRank. In particular, avoid links to web spammers or "bad neighborhoods" on 
     the web, as your own ranking may be affected adversely by those links.
   * Don't use unauthorized computer programs to submit pages, check rankings, etc. 
     Such programs consume computing resources and violate our Terms of Service. 
     Google does not recommend the use of products such as WebPosition Gold™ that 
     send automatic or programmatic queries to Google.
   *  Avoid hidden text or hidden links.
   * Don't use cloaking or sneaky redirects.
   * Don't send automated queries to Google.
   * Don't load pages with irrelevant keywords.
   * Don't create multiple pages, subdomains, or domains with substantially duplicate content.
   * Don't create pages with malicious behavior, such as phishing or 
     installing viruses, trojans, or other badware.
   * Avoid "doorway" pages created just for search engines, or other "cookie cutter" 
     approaches such as affiliate programs with little or no original content.
   * If your site participates in an affiliate program, make sure that your site adds 
     value. Provide unique and relevant content that gives users a reason to visit 
     your site first.

User Targetted Content

   *  Make pages for users, not for search engines. 
      Don't deceive your users or present different content to search engines than 
      you display to users, which is commonly referred to as "cloaking."

This is by far the singlemost valuable guideline - not because Google will ignore your site if you break it (on the contrary, many such sites have historically ranked very well) - but because you will never get a repeat visitor if your site spends more energy focusing on rankings instead of the end user experience.

Avoid the easy way

   * Avoid tricks intended to improve search engine rankings. 
     A good rule of thumb is whether you'd feel comfortable explaining what you've 
     done to a website that competes with you. Another useful test is to ask, "Does 
     this help my users? Would I do this if search engines didn't exist?"

Gaming search engines is the most talked about strategy in the SEO world, and many a site has seen itself destroyed in the rankings because of such activity. I know I think about on page SEO when coming up with designs, but I do so not just because of the big guys, but also because I want my pages to rank appropriately using whatever site-search utility I decide to implement. Documenting the decision tree I walk down will help bring out any obvious issues that I may have picked up as a bad habit.

No Link Farming

   * Don't participate in link schemes designed to increase your site's ranking or 
     PageRank. In particular, avoid links to web spammers or "bad neighborhoods" on 
     the web, as your own ranking may be affected adversely by those links.

This is as much a real world problem as it is a search engine issue. I'm not going to build in anything that will support link scheming as an out of the box feature - but it is a grey area to support automated link exchanges. More on that feature as I get around to it, but I find that the little guy is at a huge disadvantage to the spamming community because of the communication issues involved in exchanging quality links.

No Spam Software

   * Don't use unauthorized computer programs to submit pages, check rankings, etc. 
     Such programs consume computing resources and violate our Terms of Service. 
     Google does not recommend the use of products such as WebPosition Gold™ that 
     send automatic or programmatic queries to Google.

Old School and not very relevant anymore. At least not in my mind, but it's important to keep this in the guidelines.

No Hidden Links

   * Avoid hidden text or hidden links.

Hidden Links were a big problem in the link farming days, and they still are a problem on some sites. More important for you and I is ensuring that no matter what happens, every link needs to be clearly visible. Links displayed in a low contrast color will be frowned upon and could create a substantial penalty. People aren't making these decisions (for the most part), computers are. Consider things like how your text will look if images or stylesheets are disabled.

No Cloaking

   * Don't use cloaking or sneaky redirects.

This is the one guideline I have a real problem with. Yes, cloaking and showing entirely different content to search engines is a problem. The problem in my mind has to do with ad and content rotation to push things that end users become blind to into areas where they are more likely to get attention. For instance, occasionally moving an ad from the header area into the content area.

There is also the issue of showing initial visitors content that would help push them to register, but not showing that to search engines because it isn't relevant to them.

Still - I think I can accomplish my goals while still keeping this guideline in mind.

No Bugging Google

   * Don't send automated queries to Google.

Funny - there are scrapers galore out there. Every SEO company has their own in-house program to track rankings in the search engines. Odd that with all that brain power working on a problem nobody noticed that all the information you need about rankings ends up in your web server logs.

No Irrelevant Keywords

   * Don't load pages with irrelevant keywords.

This is a problem mainly with people trying to do things like including hidden content about mesothelioma so that they would get high paying ads to show up on their site.

No Duplicate Content

   * Don't create multiple pages, subdomains, or domains with substantially duplicate content.

This is another guideline I have a problem with. I should be able to publish multiple versions of the same page without fear of retribution. For instance - I might want a downloadable PDF file, an HTML file, a word document, and an open office file available all with substantially the same content.

I tried asking Matt Cutts about ways to structure things so that they would not violate this guideline, but my question was never answered (in his defense, he was surrounded by plenty of people asking questions that were easier to answer). I asked other google engineers and they flat out said "Don't even try it".

Aside from useful duplicate content, I don't have a problem with it. Of course there are issues with higher ranking sites stealing your content, but this isn't the area to go off on that topic. IMO, links to printable and downloadable content should be fair game, but I'll try to respect this guideline by using subdirectories or something like that that can be addressed with robots.txt.

No Malicious Behavior

   * Don't create pages with malicious behavior, such as phishing or 
     installing viruses, trojans, or other badware.

Goes without saying.

No Doorway Pages

   * Avoid "doorway" pages created just for search engines, or other "cookie cutter" 
     approaches such as affiliate programs with little or no original content.

They aren't called doorway pages anymore, and the structure has substantially changed, but these pages are abound like never before. I refuse to participate.

No Earning Money

   * If your site participates in an affiliate program, make sure that your site adds 
     value. Provide unique and relevant content that gives users a reason to visit 
     your site first.

Placing ads on a site before a user base is apparent is a bad idea in general - not only does it hurt your rankings, but it turns people off before they get a chance to know you. The flip side is that a large end user base might revolt when you try to earn money.

Personal tools