2010 Exceptional Web Experience

(1)

TECH-B21

Search Engine Optimization and

WebSphere Portal - Best Practices

Andreas Prokoph, Lead architect – Search in Portal and WCM, Portal development

([email protected])

(2)

Agenda

Introduction

User patterns using Internet search engines

What makes a webpage 'relevant'?

What features does WebSphere Portal provide to support SEO

Reference materials

(3)

(4)

“Why aren’t our pages on the top in Google?”

… asks your boss

• You did your best – you thought ...

• Even you were wondering what was

going on

• … Keywords are there …

• … Metadata polished …

• … beautiful URLs …

(5)

Search marketing versus ‘organic search’

Paid

placement

O

rg

_a

n

ic

s

_e

a

rc

h

re

su

lt

(6)

“Search and your Internet presence”

• Typically two steps involved to attract traffic and keep/win users/customers:

–

First step is to attract and get users to your website

• Internet Search

–

Second step is once they are at your site, they might have already found what they

were looking for, or even look for more

• Site search

Google

attracts

users to a website

If interesting enough they are

likely to ‘search’ for more

information at that website

(7)

(8)

The golden triangle from an eye-tracking study

Aggregate map:

All consumer

search activity

Red is most-viewed;

black is

un

viewed.

Source

: Enquiro

Aggregate map:

All consumer

search activity

Red is most-viewed;

black is

un

viewed.

(9)

Ideally this is where you want your page to score:

in

the Golden Triangle

Visitors view their results for an

average of 6.3 seconds before

clicking on a link.

Just enough time to scan the first

three to five

results and the top

one or two

ads. Chances are: if

Visitors view their results for an

average of 6.3 seconds before

clicking on a link.

Just enough time to scan the first

three to five

results and the top

one or two

ads. Chances are: if

1

2

3

1

2

(10)

What make a webpage a good webpage?

• Good content and information

• focused

content and information

• … and finally .. interesting enough for others (external)

(11)

Why do search engines keep their ranking metrics a secret?

• If the metrics would be published, too many would take advantage

of them, trying to boost their pages

• In the end: Internet search engine would be worthless to the broad

community!!

• bad example of the past:

–

AltaVista – emphasized on keywords in titles and metadata

–

Within 1 year AltaVista was ‘dead’ because it’s search results quality was

declining rapidly

➢

Imagine you were ‘Google’ – would you take chances to ruin your

(12)

What does not work ….

• Metadata usage …

–

stuffing metadata fields like ‘Keywords’ and ‘Description’ with all kinds of (unrelated)

keywords

• ‘Alt tag’ stuffing …

–

used to describe what a certain image is about ..

• example:

<img alt=“windows, ABC consulting, windows, developer, tutorials, ABC

consulting, developer, windows, tutorial, tutorial, tutorials, resources, windows, tutorials,

developer" />

• .. and there is also: ‘search engine friendly URLs’

–

a widespread misconception as to how search engines work

–

for a crawler –

if it gets a return code '200', then that URL is OK

(13)

A word about metadata usage by Internet search engines

• Title

and

description

information is important for the initial

representation of a webpage in the search result list

• However: these two above and any others are not relevant in any

way for determining the webpage's relevancy

–

one could still argue about ‘Title’ – again: in the past that was one of the metrics

documented by AltaVista and ... we know how that went ...

• An example for 'official' webmaster recommendations, see:

(14)

… User friendly URLs .. why would you think it’s relevant?

• When looking at the

presentation of a Google search

result, the assumption is that

Google highlights whatever has

contributed to a webpage’s

relevance ranking

• Truth is: the highlighting is

straight forward and simply

marks everything in bold which

matches one of the keywords,

assuming the user will then be

able to better judge a page's

relevance for himself...

(15)

How search engines work –

having discussed what DOESN'T work,

(16)

The following are the main metrics that get applied

• Page or document relevancy

–

term frequency

times

inverted document frequency

–

tf x idf

–

To some degree … Hypertext-Matching Analysis: analyzes the full

content of a page and also analyzes the content of linked local

pages

• PageRank

–

link popularity

- how important other Internet users think a specific page is

• Note that Google specifically has a set of more than 100

rules that potentially can get applied (for various purposes)

–

Note further: if one of the rules is seen to be mis-used, then its

(17)

Basic relevance – about ‘precision’ and ‘recall’

Comparison between two

search engines, one is real

world, the other the ideal one

optimal

good

(18)

PageRank – the obvious …

A

B

C

Which webpage

would you think

is the more

important one?

(19)

How PageRank is calculated

PR(A)

= 0,15 + 0,85 * PR(C)

PR(B)

= 0,15 + 0,85 * (PR(A)/2)

PR(C)

= 0,15 + 0,85 * (PR(B) + PR(A)/2)

PageRank formula put simple:

PR(A) = (1-d)/N + d (PR(T₁)/C(T₁) + ... + PR(T_n)/C(T_n))

PageRank calculation for the three pages shown on the left

Definition:

PR(A) is PageRank of document A

PR(T_n) are PageRank of document T_ni, which includes a link to document A

N is the count of qualifying documents

C(T_i) total count of links on page T_iand

d is a confidence value, where 0 _≤d _≤1

(20)

Can I influence my PageRank?

A

B

C

With this:

again: which is the

more important

webpage?

$

and also 100 links on it, then a referenced (linked) page would only receive 2/100ths of its PageRank score … which is .. just about nothing ..

$

(21)

OK, so what can I do? – Part 1

• Ensure proper crawling of your website

–

Redirects only if required!

–

Don't even think of redirecting only crawlers

–

‘What the crawler gets is the same as to what the user gets’!!

–

no JavaScript to generate content or URLs

–

have good navigation – e.g. crawlers like a Sitemap!

• Sitemaps 0.90 protocol support

–

OK, so what can I do? – Part 2

• Publish appropriate content

–

not too little not too much information on a web page

–

note that Flash objects and images might hide essential information from

the crawlers

–

Focus on what you want to tell your users or customers first!

–

Then think about what keywords users (not you!) might choose to find

• reconsider in cases of mismatch to adjust the keywords on your web

(23)

What

MORE

can I do to improve my PageRanks?

• Let’s face it: Not much!

• Seek relationships with trusted web sites and share information

with one another

• Register your web site with Yahoo!

• Make your web pages easy to cross-link

–

Note that even more web sites – based on the platform, like WebSphere

Portal – have URLs which somehow reference the user’s history (e.g.

session object/ID)

–

consider making use of 'user friendly URLs' for users to pick up and use

(24)

A brief word to Sitemaps 0.90 protocol ..

• <?xml version="1.0" encoding="UTF-8"?>

• <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

• <url>

• <loc>http://www.example.com/</loc>

• <lastmod>2019-06-09</lastmod>

• <changefreq>weekly</changefreq>

• <priority>0.8</priority>

• </url>

• </urlset>

_{it doesn't influence}

No …

the PageRank score of

that page

(25)

(26)

How do I promote my pages?

• Question: how will users know what information there is at my web

site if they don’t find it easily (top) through Google?

• As mentioned in the beginning: this is also a two-phase approach:

–

get the high-level pages in good shape with all the important keywords

you have selected

–

once users get to your homepage, they might be inspired to look for more

information

–

How: does you site have good search?

–

If so: they’ll most likely find the information – if it’s there!

• If what they find is good – they might consider pointing to those

(27)

WebSphere Portal and SEO enablement

• Search engine crawler awareness since Portal V6.0

–

Portal Server will externalize for the crawler ‘normalized URLs’

• URLs which do not maintain e.g. navigational state

• Sitemaps 0.90 protocol support

–

www.sitemaps.org

–

developerWorks article:

http://www.ibm.com/developerworks/library/x-sitemaps/index.html

• Search Sitemap Utility portlet download:

https://greenhouse.lotus.com/plugins/plugincatalog.nsf/assetDetails.xsp?action=editDocument&documentId=A1FF51D2C2E82CBE852576AB006ED 590

• Remains: bookmark support for

(28)

… almost forgot .. for all those still insisting to get

'search engine friendly URLs'

(29)

search engine

User

friendly-URLs

• Friendly-URLs result in human readable URL

prefixes that lead to portal pages

• Each content node might have a friendly name

assigned

• The friendly-URL is a hierarchical path

constructed from these names

based on the

content topology

(see URL mappings)

• Every URL that is generated by WP APIs will

contain the friendly-path automatically

–

It is even guaranteed that every URL that leads to a

particular page will start with the page‘s friendly-path

Content Nodes

root

home

shop

info

shoes

/wps/portal

/home

/wps/portal

/home/shop

(30)

(31)

Public Portal pages and how Internet search engines work

Web crawlers

Search indexes

Homepage HomepageHomepage Homepage or oror or Sitemap Sitemap Sitemap Sitemap

Crawler follows ‘hrefs’

only 'GET' requests

no Javascript interpreted

Portal Server recognizes the crawler and triggers URLs to be normalized.

(32)

The fundamental problem –

for web-crawlers!

Welcome Page link!

Welcome page Page A

Page B

Page C

URL URL URL URL----AAAA

U R L U R L U R L U R L -B BB B URL URLURL URL---C-CCC

URL URL URL URL----DDDD

UR L UR L UR L UR

L--- -E EE E

Portal encodes in URLs

additional information about

the navigational state of

the user:

like: which page he comes

from and how he left it –

e.g. a

portlet was maximized

Information encoded within URLs:

URL-A– Target: Page A, coming from Welcome page

URL-B– Target: Page B, coming from Welcome page

URL-C– Target: Page C, coming from Page A

URL-D– Target: Page A, coming from Page C

URL-E– Target: Page B, coming from Page C

A crawler would want to assume:

URL-A and URL-D to be identical

(33)

A thing of the past - How Internet search engines had seen a Portal site

Web crawlers

Search indexes

This set of pages represents the structure of the Portal site.

This set of pages the crawler retrieves and assumes to be unique based on the link structure of the site.

(34)

WebSphere Portal V6 – crawlability enablement!

Web crawlers

Search indexes

Portal Server recognizes the crawler and triggers URLs to be normalized.

Result:

_{no more ‘duplicate’ pages}

_{all linked and public Portal pages are crawled and indexed}

Normalized URL

= all navigational state information is

discarded from the URL

(35)

Crawlability enablement for Search Engines

Crawler awareness - the Portal Server will recognize a crawler by its web agent identifier. A default list is available

already covering the 50 most popular crawlers (via pattern matching - thus potentially more enabled).

The Portal will then transform all URLs that are output on the pages as so-called normalized URLs, thus making

them unique. In addition - action URLs are nullified, thus not allowing crawlers to execute actions such as 'delete

document' or 'login', etc..

A Sitemap portlet

a crawler can be pointed at. For efficiency reasons it might be advisable to also place appropriate

robot directives into the theme to ensure that the crawler will only follow such links available in the Sitemap, and

thus not having to re-evaluate links found on each of the pages.

Search Engine Utility portlet

– provides support for the ‘Sitemap 0.90’ protocol. Supported by Google, Yahoo! and

Microsoft Live Search

In Summary

– Portal will provide the means to allow for a complete crawl of a portal site (public pages) and the

tools to allow for adequate linking of portal pages from an external site to support PageRanks.

(36)

Export as Google Sitemap

WebSphere Portal – Search Engine Utility portlet

Added feature – Export to Google Sitemap

Export the Sitemap to a Google Sitemap XML file.

Click on the Browse button to specify where in the filesystem to store the output XML sitemap file.

(37)

Search Engine Utility portlet – configuration/editing option

0.5 Default values:

Update frequency: Priority:

Weekly

(38)

Dynamic and personalized content – what crawlers will not get

hold of …

• Crawlers try to not fetch dynamic or personalized content

–

there might be spoofing involved !?!

–

in the past this was the main reason for truncation of URLs after the first

‘?’

• What can be done:

–

have the Web content management system generate a link list of the

non-default (dynamic) content

–

append or reference via the website’s sitemap

(39)

Summary

• WebSphere Portal allows for safe and efficient crawling of Portal sites

• Efficiency increased through support for Sitemaps 0.90 protocol

• Good pages are determined by its contents

• link popularity is the additional boost factor

• Consult the 'webmaster guides' that the search engine sites publish

• if a SEO consultant suggests metadata 'spaming' or 'pretty URLs'

→

ask them for proof – which is webpages applying such techniques (before and after),

or it is officially documented by Google et al

• 'dynamic Portal URLs' prevent from getting adequate ranking

(40)

Excellent book on ‘Search Engine Marketing’!

• Search Engine Marketing, Inc. Driving Search Traffic to Your Company's Web Site,

Mike Moran, Bill Hunt, IBM Press

–

http://www.amazon.de/Search-Marketing-Driving-Traffic-Companys/dp/0131852922/ref=sr_1_1?ie=UTF8&s=books-intl-de&qid=1202128301&sr=1-1

–

Acknowledgement: Overview slides taken from Mike’s SEO presentation

• developerWorks articles – Basics on SEO – Part 1-4

–

http://www.ibm.com/developerworks/search/searchResults.jsp?searchType=1&pageLang=&

displaySearchScope=dW&searchSite=dW&lastUserQuery1=search+engine+optimization&las

tUserQuery2=&lastUserQuery3=&lastUserQuery4=&query=search+engine+optimization+bas

ics&searchScope=dW&Go.x=0&Go.y=0

(41)

For More Information (1)

WebSphere Portal – IBM Site

http://www-3.ibm.com/software/genservers/portal/

WebSphere Portal Information Center

http://www.ibm.com/developerworks/websphere/zones/portal/proddoc.html

WebSphere Portal Business Solutions Catalog (on Lotus Greenhouse)

https://greenhouse.lotus.com/catalog/home_full.xsp?fProduct=WebSphere%20Portal

WebSphere and Lotus Web Content Management Portal Open Beta

https://www14.software.ibm.com/iwm/web/cc/earlyprograms/lotus/portalopenbeta/

WebSphere Portal Blog

(42)

For More Information (2 )

IBM Lotus Connections

http://www.ibm.com/software/lotus/products/connections

IBM Lotus Forms

http://www.ibm.com/software/lotus/forms

IBM Lotus Quickr

http://www.ibm.com/

lotus/quickr

IBM Lotus Sametime

http://www.ibm.com/lotus/sametime

WebSphere Commerce

http://www.ibm.com/websphere/commerce

WebSphere Process Server and Business Process Automation

(43)

Please complete the session survey for this session:

TECH-B21

Session Speakers:

Andreas Prokoph

(44)

The information contained in this publication is provided for informational purposes only. While efforts were made to verify the completeness and accuracy of the information contained in this publication, it is provided AS IS without warranty of any kind, express or implied. In addition, this information is based on IBM’s current product plans and strategy, which are subject to change by IBM without notice. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this publication or any other materials. Nothing contained in this publication is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software.

References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. Product release dates and/or capabilities referenced in this presentation may change at any time at IBM’s sole discretion based on market opportunities or other factors, and are not intended to be a commitment to future product or feature availability in any way. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results.

All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer.

IBM, the IBM logo, Lotus, Lotus Notes, Notes, Domino, Quickr, Sametime, WebSphere, UC2, PartnerWorld are trademarks of International Business Machines Corporation in the United States, other countries, or both. Unyte is a trademark of WebDialogs, Inc., in the United States, other countries, or both.

Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both.