TECH-B21
Search Engine Optimization and
WebSphere Portal - Best Practices
Andreas Prokoph, Lead architect – Search in Portal and WCM, Portal development
([email protected])
Agenda
Introduction
User patterns using Internet search engines
What makes a webpage 'relevant'?
What features does WebSphere Portal provide to support SEO
Reference materials
“Why aren’t our pages on the top in Google?”
… asks your boss
•
You did your best – you thought ...
•
Even you were wondering what was
going on
•
… Keywords are there …
•
… Metadata polished …
•
… beautiful URLs …
Search marketing versus ‘organic search’
Paid
placement
O
rg
a
n
ic
s
e
a
rc
h
re
su
lt
“Search and your Internet presence”
•
Typically two steps involved to attract traffic and keep/win users/customers:
–
First step is to attract and get users to your website
•
Internet Search
–
Second step is once they are at your site, they might have already found what they
were looking for, or even look for more
•
Site search
attracts
users to a website
If interesting enough they are
likely to ‘search’ for more
information at that website
The golden triangle from an eye-tracking study
Aggregate map:
All consumer
search activity
Red is most-viewed;
black is
un
viewed.
Source
: Enquiro
Aggregate map:
All consumer
search activity
Red is most-viewed;
black is
un
viewed.
Ideally this is where you want your page to score:
in
the Golden Triangle
Visitors view their results for an
average of 6.3 seconds before
clicking on a link.
Just enough time to scan the first
three to five
results and the top
one or two
ads. Chances are: if
Visitors view their results for an
average of 6.3 seconds before
clicking on a link.
Just enough time to scan the first
three to five
results and the top
one or two
ads. Chances are: if
1
2
3
1
2
What make a webpage a good webpage?
•
Good content and information
•
Good content and information
•
focused
content and information
•
… and finally .. interesting enough for others (external)
Why do search engines keep their ranking metrics a secret?
•
If the metrics would be published, too many would take advantage
of them, trying to boost their pages
•
In the end: Internet search engine would be worthless to the broad
community!!
•
bad example of the past:
–
AltaVista – emphasized on keywords in titles and metadata
–
Within 1 year AltaVista was ‘dead’ because it’s search results quality was
declining rapidly
➢
Imagine you were ‘Google’ – would you take chances to ruin your
What does not work ….
•
Metadata usage …
–
stuffing metadata fields like ‘Keywords’ and ‘Description’ with all kinds of (unrelated)
keywords
•
‘Alt tag’ stuffing …
–
used to describe what a certain image is about ..
•
example:
<img alt=“windows, ABC consulting, windows, developer, tutorials, ABC
consulting, developer, windows, tutorial, tutorial, tutorials, resources, windows, tutorials,
developer" />
•
.. and there is also: ‘search engine friendly URLs’
–
a widespread misconception as to how search engines work
–
for a crawler –
if it gets a return code '200', then that URL is OK
A word about metadata usage by Internet search engines
•
Title
and
description
information is important for the initial
representation of a webpage in the search result list
•
However: these two above and any others are not relevant in any
way for determining the webpage's relevancy
–
one could still argue about ‘Title’ – again: in the past that was one of the metrics
documented by AltaVista and ... we know how that went ...
•
An example for 'official' webmaster recommendations, see:
… User friendly URLs .. why would you think it’s relevant?
•
When looking at the
presentation of a Google search
result, the assumption is that
Google highlights whatever has
contributed to a webpage’s
relevance ranking
•
Truth is: the highlighting is
straight forward and simply
marks everything in bold which
matches one of the keywords,
assuming the user will then be
able to better judge a page's
relevance for himself...
How search engines work –
having discussed what DOESN'T work,
The following are the main metrics that get applied
•
Page or document relevancy
–
term frequency
times
inverted document frequency
–
tf x idf
–
To some degree … Hypertext-Matching Analysis: analyzes the full
content of a page and also analyzes the content of linked local
pages
•
PageRank
–
link popularity
- how important other Internet users think a specific page is
•
Note that Google specifically has a set of more than 100
rules that potentially can get applied (for various purposes)
–
Note further: if one of the rules is seen to be mis-used, then its
Basic relevance – about ‘precision’ and ‘recall’
Comparison between two
search engines, one is real
world, the other the ideal one
optimal
good
PageRank – the obvious …
A
B
C
Which webpage
would you think
is the more
important one?
How PageRank is calculated
PR(A)
= 0,15 + 0,85 * PR(C)
PR(B)
= 0,15 + 0,85 * (PR(A)/2)
PR(C)
= 0,15 + 0,85 * (PR(B) + PR(A)/2)
PageRank formula put simple:
PR(A) = (1-d)/N + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
PageRank calculation for the three pages shown on the left
Definition:
PR(A) is PageRank of document A
PR(Tn) are PageRank of document Tni, which includes a link to document A
N is the count of qualifying documents
C(Ti) total count of links on page Tiand
d is a confidence value, where 0 ≤d ≤1
Can I influence my PageRank?
A
B
C
With this:
again: which is the
more important
webpage?
$
and also 100 links on it, then a referenced (linked) page would only receive 2/100ths of its PageRank score … which is .. just about nothing ..
$
$
$
OK, so what can I do? – Part 1
•
Ensure proper crawling of your website
–
Redirects only if required!
–
Don't even think of redirecting only crawlers
–
‘What the crawler gets is the same as to what the user gets’!!
–
no JavaScript to generate content or URLs
–
have good navigation – e.g. crawlers like a Sitemap!
•
Sitemaps 0.90 protocol support
–
see also: WebSphere Portal ‘Search Sitemap Utility’ portlet available
OK, so what can I do? – Part 2
•
Publish appropriate content
–
not too little not too much information on a web page
–
note that Flash objects and images might hide essential information from
the crawlers
–
Focus on what you want to tell your users or customers first!
–
Then think about what keywords users (not you!) might choose to find
similar content/information elsewhere
•
reconsider in cases of mismatch to adjust the keywords on your web
What
MORE
can I do to improve my PageRanks?
•
Let’s face it: Not much!
•
Seek relationships with trusted web sites and share information
with one another
•
Register your web site with Yahoo!
•
Make your web pages easy to cross-link
–
Note that even more web sites – based on the platform, like WebSphere
Portal – have URLs which somehow reference the user’s history (e.g.
session object/ID)
–
consider making use of 'user friendly URLs' for users to pick up and use
A brief word to Sitemaps 0.90 protocol ..
•
<?xml version="1.0" encoding="UTF-8"?>
•
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
•
<url>
•
<loc>http://www.example.com/</loc>
•
<lastmod>2019-06-09</lastmod>
•
<changefreq>weekly</changefreq>
•
<priority>0.8</priority>
•
</url>
•
</urlset>
it doesn't influence
No …
the PageRank score of
that page
How do I promote my pages?
•
Question: how will users know what information there is at my web
site if they don’t find it easily (top) through Google?
•
As mentioned in the beginning: this is also a two-phase approach:
–
get the high-level pages in good shape with all the important keywords
you have selected
–
once users get to your homepage, they might be inspired to look for more
information
–
How: does you site have good search?
–
If so: they’ll most likely find the information – if it’s there!
•
If what they find is good – they might consider pointing to those
WebSphere Portal and SEO enablement
•
Search engine crawler awareness since Portal V6.0
–
Portal Server will externalize for the crawler ‘normalized URLs’
•
URLs which do not maintain e.g. navigational state
•
Sitemaps 0.90 protocol support
–
www.sitemaps.org
–
developerWorks article:
http://www.ibm.com/developerworks/library/x-sitemaps/index.html
•
Search Sitemap Utility portlet download:
https://greenhouse.lotus.com/plugins/plugincatalog.nsf/assetDetails.xsp?action=editDocument&documentId=A1FF51D2C2E82CBE852576AB006ED 590
•
Remains: bookmark support for
… almost forgot .. for all those still insisting to get
'search engine friendly URLs'
search engine
User
friendly-URLs
•
Friendly-URLs result in human readable URL
prefixes that lead to portal pages
•
Each content node might have a friendly name
assigned
•
The friendly-URL is a hierarchical path
constructed from these names
based on the
content topology
(see URL mappings)
•
Every URL that is generated by WP APIs will
contain the friendly-path automatically
–
It is even guaranteed that every URL that leads to a
particular page will start with the page‘s friendly-path
Content Nodes
root
home
shop
info
shoes
/wps/portal
/home
/wps/portal
/home/shop
Public Portal pages and how Internet search engines work
Web crawlers
Web crawlers
Web crawlers
Web crawlers
Search indexes
Homepage HomepageHomepage Homepage or oror or Sitemap Sitemap Sitemap Sitemap
Crawler follows ‘hrefs’
only 'GET' requests
no Javascript interpreted
Portal Server recognizes the crawler and triggers URLs to be normalized.
The fundamental problem –
for web-crawlers!
Welcome Page link!
Welcome page Page A
Page B
Page C
URL URL URL URL----AAAA
U R L U R L U R L U R L -B BB B URL URLURL URL---C-CCC
URL URL URL URL----DDDD
UR L UR L UR L UR
L--- -E EE E
Portal encodes in URLs
additional information about
the navigational state of
the user:
like: which page he comes
from and how he left it –
e.g. a
portlet was maximized
Information encoded within URLs:
Information encoded within URLs:
Information encoded within URLs:
Information encoded within URLs:
URL-A– Target: Page A, coming from Welcome page
URL-B– Target: Page B, coming from Welcome page
URL-C– Target: Page C, coming from Page A
URL-D– Target: Page A, coming from Page C
URL-E– Target: Page B, coming from Page C
A crawler would want to assume:
URL-A and URL-D to be identical
A thing of the past - How Internet search engines had seen a Portal site
Web crawlers
Search indexes
This set of pages represents the structure of the Portal site.
This set of pages the crawler retrieves and assumes to be unique based on the link structure of the site.
WebSphere Portal V6 – crawlability enablement!
Web crawlers
Search indexes
Portal Server recognizes the crawler and triggers URLs to be normalized.
Result:
no more ‘duplicate’ pagesall linked and public Portal pages are crawled and indexed
Normalized URL
= all navigational state information is
discarded from the URL
Crawlability enablement for Search Engines
Crawler awareness - the Portal Server will recognize a crawler by its web agent identifier. A default list is available
already covering the 50 most popular crawlers (via pattern matching - thus potentially more enabled).
The Portal will then transform all URLs that are output on the pages as so-called normalized URLs, thus making
them unique. In addition - action URLs are nullified, thus not allowing crawlers to execute actions such as 'delete
document' or 'login', etc..
A Sitemap portlet
a crawler can be pointed at. For efficiency reasons it might be advisable to also place appropriate
robot directives into the theme to ensure that the crawler will only follow such links available in the Sitemap, and
thus not having to re-evaluate links found on each of the pages.
Search Engine Utility portlet
– provides support for the ‘Sitemap 0.90’ protocol. Supported by Google, Yahoo! and
Microsoft Live Search
In Summary
– Portal will provide the means to allow for a complete crawl of a portal site (public pages) and the
tools to allow for adequate linking of portal pages from an external site to support PageRanks.
Export as Google Sitemap
WebSphere Portal – Search Engine Utility portlet
Added feature – Export to Google Sitemap
Export the Sitemap to a Google Sitemap XML file.
Click on the Browse button to specify where in the filesystem to store the output XML sitemap file.
Search Engine Utility portlet – configuration/editing option
0.5 Default values:
Update frequency: Priority:
Weekly
Dynamic and personalized content – what crawlers will not get
hold of …
•
Crawlers try to not fetch dynamic or personalized content
–
there might be spoofing involved !?!
–
in the past this was the main reason for truncation of URLs after the first
‘?’
•
What can be done:
–
have the Web content management system generate a link list of the
non-default (dynamic) content
–
append or reference via the website’s sitemap
Summary
•
WebSphere Portal allows for safe and efficient crawling of Portal sites
•
Efficiency increased through support for Sitemaps 0.90 protocol
•
Good pages are determined by its contents
•
link popularity is the additional boost factor
•
Consult the 'webmaster guides' that the search engine sites publish
•
if a SEO consultant suggests metadata 'spaming' or 'pretty URLs'
→
ask them for proof – which is webpages applying such techniques (before and after),
or it is officially documented by Google et al
•
'dynamic Portal URLs' prevent from getting adequate ranking
Excellent book on ‘Search Engine Marketing’!
•
Search Engine Marketing, Inc. Driving Search Traffic to Your Company's Web Site,
Mike Moran, Bill Hunt, IBM Press
–
http://www.amazon.de/Search-Marketing-Driving-Traffic-Companys/dp/0131852922/ref=sr_1_1?ie=UTF8&s=books-intl-de&qid=1202128301&sr=1-1
–
Acknowledgement: Overview slides taken from Mike’s SEO presentation
•
developerWorks articles – Basics on SEO – Part 1-4
–
http://www.ibm.com/developerworks/search/searchResults.jsp?searchType=1&pageLang=&
displaySearchScope=dW&searchSite=dW&lastUserQuery1=search+engine+optimization&las
tUserQuery2=&lastUserQuery3=&lastUserQuery4=&query=search+engine+optimization+bas
ics&searchScope=dW&Go.x=0&Go.y=0
For More Information (1)
WebSphere Portal – IBM Site
http://www-3.ibm.com/software/genservers/portal/
WebSphere Portal Information Center
http://www.ibm.com/developerworks/websphere/zones/portal/proddoc.html
WebSphere Portal Business Solutions Catalog (on Lotus Greenhouse)
https://greenhouse.lotus.com/catalog/home_full.xsp?fProduct=WebSphere%20Portal
WebSphere and Lotus Web Content Management Portal Open Beta
https://www14.software.ibm.com/iwm/web/cc/earlyprograms/lotus/portalopenbeta/
WebSphere Portal Blog
For More Information (2 )
IBM Lotus Connections
http://www.ibm.com/software/lotus/products/connections
IBM Lotus Forms
http://www.ibm.com/software/lotus/forms
IBM Lotus Quickr
http://www.ibm.com/
lotus/quickr
IBM Lotus Sametime
http://www.ibm.com/lotus/sametime
WebSphere Commerce
http://www.ibm.com/websphere/commerce
WebSphere Process Server and Business Process Automation
Please complete the session survey for this session:
TECH-B21
Session Speakers:
Andreas Prokoph
© IBM Corporation 2010. All Rights Reserved.
The information contained in this publication is provided for informational purposes only. While efforts were made to verify the completeness and accuracy of the information contained in this publication, it is provided AS IS without warranty of any kind, express or implied. In addition, this information is based on IBM’s current product plans and strategy, which are subject to change by IBM without notice. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this publication or any other materials. Nothing contained in this publication is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software.
References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. Product release dates and/or capabilities referenced in this presentation may change at any time at IBM’s sole discretion based on market opportunities or other factors, and are not intended to be a commitment to future product or feature availability in any way. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results.
All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer.
IBM, the IBM logo, Lotus, Lotus Notes, Notes, Domino, Quickr, Sametime, WebSphere, UC2, PartnerWorld are trademarks of International Business Machines Corporation in the United States, other countries, or both. Unyte is a trademark of WebDialogs, Inc., in the United States, other countries, or both.
Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both.