Tuesday, April 6, 2010

On Search Engine Optimization: Why canonical URLs matter

With the explosion of the amount of information organizations make accessible over the Internet, searchability and navigability are increasingly prominent on CMS wish lists. How do you structure site navigation when you have a bit more content than the average company?


Gov.nl

Consider the case of the Dutch national government: To provide convenient access to all governmental information citizens may need on a daily basis, it decided to bring information from 16 different ministries into one CMS.

To make centralization effective it needs to be paired with smart navigation. In larger contexts websites typically choose to provide more than one path to the same content. Technically this is achieved by tagging content instead of dropping it into fixed folders. Hippo CMS helps authors create these tags expediently and as of April 2010 citizens of the Netherlands have a number of centralized starting points to a large chunk of central government information (in Dutch):




Canonical URLs
Flexible navigation may be great for users, it is not ideal for search engines. To offer compact results, search engines try to condense the list of URLs gathered from crawling through different paths to the same content piece. With a variety of algorithms a search engine then decides which URL is the leading road to the content.

Leaving this decision to the search engine is a risky choice. Various URLs for the same content compete with each other in the rankings and the site owner loses control over incoming links. For faceted search paths the number of navigation ways to a single piece of content is practically infinite. If not managed carefully, offering faceted paths could even lead to accidental blacklisting as the search engine might conclude you're trying to game its rating algorithm.



This is where canonical URLs come in. Tagging pages with a canonical URL helps a search engine understand that multiple URLs should be be listed only once. Hippo builds canonicalization into its CMS design, making sure that the dominant URL is provided to the search engine for each unique piece of content. This approach provides a combination of multiple flexible navigation paths, optimized search results and consistency for incoming links.


Example
Let's have a more detailed look at the government example. The following are just two sample URLs leading to an article about tougher laws for repeat DUIs:

http://www.rijksoverheid.nl/onderwerpen/alcohol/nieuws/2010/04/01/rijden-onder-invloed-harder-aangepakt.html


In the source of both pages, invisible to the casual browser, a tag tells the search engine where to place this piece of content in its results.

link rel="canonical" href=http://www.rijksoverheid.nl/nieuws/2010/04/01/rijden-onder-invloed-harder-aangepakt.html"
And this is what Google returns when searching for a few of the article's keywords - Mission accomplished!
  1. Rijden onder invloed harder aangepakt | Nieuwsbericht ...

    1 april 2010 ... Rijden onder invloed harder aangepakt ... Ministerraad. Bel 0800-8051 voor vragen aan de Rijksoverheid ... Zoek binnen rijksoverheid.nl ...
    www.rijksoverheid.nl/.../rijden-onder-invloed-harder-aangepakt.html - Cached



For a more in depth description how you can create canonical URLs with Hippo CMS check out our public wiki





Tuesday, January 26, 2010

Why WCM/Portal convergence is a good thing

Stephen Powers' blogpost on convergence between WCM and Portal sparked a nice little controversy about the alleged trade-off between integration and separation of concerns. It ain't necessarily so: If integration is done right, the trade-off does not need to be there at all.

Convergence is a two way street. From the WCM perspective we like to think of Portals as a way to offer 'self service', personalization, security and integration with other applications / widgets / iframes and the like. From the Portal perspective we need WCM to provide tools to work with our portal content that does not reside in other applications.

Both are, of course, nothing new. Neither is the fact that vendors (Hippo with Hippo CMS and Apache Jetspeed Portal is the open source example) have been offering integrated portal offerings for a number of years. Given the challenges and costs involved in true integration, it clearly makes sense to offer integrated solutions as to keep such projects manageable. And who can better ensure integration is done right than the vendor itself?

New in Stephen's post is the fact that he, or rather IBM, sees the package of WCM & Portal becoming such a common combination that distinguishing the two markets would no longer be meaningful. This does not mean that WCM and Portals will become an undistinguishable mesh (or mess?) with the risk of losing all we gained from separating content from the presentation layer in the first place.

It does mean, just like before, that buyers should be careful not to select a package that restricts their choice in where and how to manage and publish their content. It also means that buyers should be critical when making decisions about such things as collaboration platforms. Collaboration systems put in place today, may be with us for many years to come. Where does the content reside? Can it be accessed, altered, integrated in other applications? What about the source code? Remember Lotus Notes?

For maximum flexibility and future readiness in purchasing an integrated WCM & Portal, there are four basic questions to ask:
1. Can the WCM system stand on its own? Would you buy it for its WCM functionality?
2. Can the Portal stand on its own? Would you buy it for its Portal functionality?
3. Are the two really integrated? Integrated user management, security, administration, URL mapping, ease of development etc.
4. Is it Open Source (eg, are you free to use and change the software as you wish)?