• No results found

A STUDY ON DESIGN AND ANALYSIS OF WEB MART MINING AND ITS RELEVANCE TODAY

N/A
N/A
Protected

Academic year: 2020

Share "A STUDY ON DESIGN AND ANALYSIS OF WEB MART MINING AND ITS RELEVANCE TODAY"

Copied!
12
0
0

Loading.... (view fulltext now)

Full text

(1)

A STUDY ON DESIGN AND ANALYSIS

OF WEB MART MINING AND ITS

RELEVANCE TODAY

Ravikumar G K*

*Dr. MGR University, Chennai, Tamilnadu, INDIA, Manjunath T. N+

+

Bharathiar University, Coimbatore, Tamil Nadu, INDIA, Ravindra S. Hegadi#

#Karnatak University,Dharwad,Karnataka,INDIA, Archana R A++

++SJB Institute of Technology,Bangalore,Karnataka,INDIA, Abstract:

Data warehousing is one of the latest trends in computing environment and information technology applications. A data warehouse is a system that extracts, cleans and delivers source data into dimensional data store and then supports and implements querying and analysis for the purpose of decision making. From a data warehouse, data flows to various departments for their customized decision support systems. These individual departmental components are called data marts. A data mart is a set of dimensional tables supporting a business process. Data marts contain all atomic detail needed to support drilling down to the lowest level. Every company or organization in the world has a website. Beneath each web site are web logs that record every object either posted to or served from the web server. Web logs are important because they reveal the user traffic on the web site. The activity of parsing web logs and storing the results in a data mart to analyze customer activity is known as click stream data warehousing. The web mart - database schema is designed to make the underlying data structure more comprehensible to users and to simplify the query process. The recommended approach for data warehouse data modeling is to follow a Dimensional Modeling approach - Star Schema. We explore the design and analysis of web mart and its relevance today at minute level.

Keywords: Data warehousing, ETL, Web log, Data mart, Web mart.

1. Introduction - Star Schema of the Web Mart

The web mart - database schema is designed to make the underlying data structure more comprehensible to users and to simplify the query process. The recommended approach for data warehouse data modeling is to follow a Dimensional Modeling approach-called Star Schema. The star schema has a central fact table with dimension tables at the points of the star. The single fact table’s composite primary key requires a foreign key field corresponding to the primary key field of each dimension table. The dimension tables are hierarchical and thus highly denormalised [4] .A fact table is a primary table in the web mart that contain the business facts, and dimension tables are companion tables to the fact table that represent the business critical dimensions and contain the attributes for the business critical dimensions. The central fact table provides users the ability to do analysis on business facts, and dimensional tables provide users the ability to do analysis on these business facts in various business critical dimensions[10].

(2)

Fig-1: Star Schema of the Web Mart Design -Click Stream Fact and its Associated Dimensions 2. Description -Web Mart Objects

This section explains about the detailed analytical capabilities of the model by giving the listing of the basic fact that the user will be able to analyze and the corresponding dimensions which gives user the capability of drill up and drill down and slicing and dicing on the base fact. Before the design of specific click stream data marts, there is a need to collect together as many dimensions as one can think of that may have relevance in a click stream environment. The unique dimensions of the click stream data warehouses are page, visitor, session and referral. The page dimension describes the page context for a web page event. It contains attributes like page key, page source, page function. The visitor dimension gives the details regarding visitor. The main attributes are userId, CookieId, Operating System and Browser. The session dimension provides one or more levels of diagnosis for the visitor’s session as a whole. For example, the local context of the session might be requesting product information, but the overall session context might be ordering a product. The referral dimension describes how the customer arrived at the current page [9] [11].

2.1 Facts and Dimensions in the Web Mart

The following table-1 presents the objects i.e. fact and dimensions available in the web mart for the analysis purpose.

Table name Fact/Dimension Levels Click Stream Fact Fact -

Universal Date Dimension Year, Quarter, Month, Week, Day Universal TOD Dimension Period of the day, Hour, Minute, Second Date Dimension Year, Quarter, Month, Week, Day TOD Dimension Period of the day, Hour, Minute, Second Visitor Dimension IP Address (or) Visitor Id (or) Cookie Id Page Dimension Object Type, File Type, Page Type, URL (a)

Domain, Site, Directory, URL

Session Dimension Session Type

Click Stream Fact

Object Dimension Universal

Date

TOD Dimension

OS Dimension

Browser Dimension Universal

TOD

Visitor Dimension

Page Dimension

Session Dimension

Referrer Dimension

Date Dimension

Status Dimension

(3)

Referrer Dimension Referrer Type, URL (a) Domain, Site, URL

Status Dimension Type of the Status, Status description

Visit Dimension -

Content Page Dimension -

Table-1: Objects of Web Mart 2.2 Click Stream Fact Table

2.3 The Table-2 presents various business measures on which the user will be able to do analysis using the click stream fact table and the associated dimension tables.

Field Name Description

UniversalDatekey Foreign key for the Universal Date dimension

UTODkey Foreign key for the Universal TOD dimension

Datekey Foreign key for the Date dimension

TODkey Foreign key for the TOD dimension

Visitorkey Foreign key for the Visitor dimension

Pagekey Foreign key for the Page dimension

Sessionkey Foreign key for the Session dimension

Referrerkey Foreign key for the Referrer dimension

Statuskey Foreign key for the Status dimension

Visitkey Foreign key for the Visit dimension

ContentPagekey Foreign key for the Content Page dimension

TimeViewed The time spent in seconds, by the Visitor on a particular object like page, file …

BytesTransferred The bytes transferred to the client machine. Table-2: Click Stream Fact Table Description

2.3 Dimensions

The dimension table gives users the ability to analyze the business measures in different dimensions by allowing the users to drill up and drill down and slice and dice with the attributes of the dimensions. Drilling down is adding detailed rows to an existing request and is nothing more than requesting to give more detail. Drilling up is subtracting row headers and is nothing more than looking at the data at more aggregated/consolidated form. Slicing is constraining the data that is displayed on an attribute found in a dimension and dicing is constraining the data that is displayed by attributes in multiple dimensions [4].

2.3.1 Universal Date Dimension

(4)

Figure-2: Universal Data Dimension with the hierarchy The following table describes the structure of the Universal date dimension table

Field Name Description Values/Example

UniversalDatekey Primary key for the dimension (Surrogate key) 1,2 …

UniversalDate Date 25/01/2000

UniversalDayOfWeek Day of the week Sunday

UniversalDOWNumber Day of week number 1-7, Sunday being 1

UniversalWeekOfMonth Week number in the month 1-5

UniversalWeekOfYear Week number in the year 1-52

UniversalDayOfMonth Day number in the month 1-31

UniversalMonthNumber Month of the year in number 1-12

UniversalMonth Month of the year January

UniversalQuarter Quarter of the year 1-4

UniversalYear Year 2000

Table-3: Column Description of the Universal Date Dimension 2.3.2 Universal TOD Dimension

The figure-3 presents the universal TOD dimension with all its attributes. This universal TOD dimension facilitates analysis along the time of the day with respect to the Greenwich mean time. The hierarchical attributes of the dimension are represented using the arrow head connections and general attributes are represented using straight-line connections. Seconds is the lowest grain level that the user will be able to drill down to, and period of the day is the highest level that the user will be able to drill up to. Drill down path can be identified by following the arrow headings [10] [1].

Figure-3: Universal TOD Dimension

DATE

DAY NUMBER IN

MONTH NUMBER

YEAR

WEEK NUMBER IN

YEAR

QUARTER

MONTH

WEEK OF THE

DAY OF THE

DAY NUMBER IN WEEK

PERIOD OF THE DAY

HOUR

MINUTE

(5)

The table-4 describes the structure of the Universal TOD dimension table.

Field Name Description Values/Example

UTODkey Primary key for the dimension (Surrogate key) 1,2 …

UniversalSecond Second of a Minute 1-60

UniversalMinute Minutes of a hour 1-60

UniversalHour Hours of a day 10-11, 11-12 …

UniversalTODPeriodofDay Collection of hours in a day Morning, Evening … Table-4: Column Description of Universal TOD Dimension

2.3.3 Date Dimension

The figure-4 presents the date dimension with all its attributes. The hierarchical attributes of the dimension are represented using the arrow head connections and general attributes are represented using straight-line connections. Date is the lowest grain level that the user will be able to drill down to, and year is the highest level that the user will be able to drill up to. Drill down path can be identified by following the arrow headings [10] [1].

Figure-4: Date Dimension with the hierarchy The table-5 describes the structure of the date dimension table.

Field Name Description Values/Example

Datekey Primary key for the dimension (Surrogate key) 1,2 …

Date Date 25/01/2000

DayOfWeek Day of the week Sunday

DayOfWeekNumber Day of the week number 1-7, 1 being Sunday

WeekNumber Week number in the month 1-5

Week Week number in the year 1-52

MonthDay Day number in the month 1-31

MonthNumber Month of the year in number 1-12

Month Month of the year January

Quarter Quarter of the year 1-4

Year Year of the date 2000

Table-5: Column Description of Date dimension

DATE

DAY NUMBER IN MONTH

MONTH NUMBER

YEAR

WEEK NUMBER IN YEAR YEAR

QUARTER

MONTH

WEEK OF THE MONTH

DAY OF THE WEEK

(6)

2.3.4 TOD Dimension

The figure-5 presents the TOD dimension with all its attributes. The hierarchical attributes of the dimension are represented using the arrow head connections and general attributes are represented using straight-line connections. Seconds is the lowest grain level that the user will be able to drill down to, and period of the day is the highest level that the user will be able to drill up to. Drill down path can be identified by following the arrow headings [10] [1].

Figure-5: TOD dimension with its attributes The table-6 describes the structure of the TOD dimension table.

Field Name Description Values/Example

TODkey Primary key for the dimension (Surrogate key) 1,2 …

Second Second of a Minute 1-60

Minute Minutes of a hour 1-60

Hour Hours of a day 10-11, 11-12 …

TimeOfDay Time of the day 12:00:55;19:15:25

PeriodofDay Collection of hours in a day Morning, Evening …

Table-6: Column description of TOD dimension 2.3.5 Visitor Dimension

The figure-6 presents the Visitor dimension with all its attributes. The hierarchical attributes of the dimension are represented using the arrow head connections and general attributes are represented using straight-line connections. User id/ Cookie id/ Domain name is the lowest grain level that the user will be able to drill down to, and country is the highest level that the user will be able to drill up to. The lowest granularity will be decided at the client site. Drill down path can be identified by following the arrow headings [10] [1].

Figure-6: Visitor Dimension with its attributes

* - Demographics are collection of many fields. It is also possible to form a hierarchy in the demographic information. The table-7 describes the structure of the visitor dimension table.

PERIOD OF THE DAY

HOUR

MINUTE

SECOND

USER ID

OPERATING SYSTEM BROWSER

COOKIE ID

DEMOGRAPHICS * COUNTRY

(7)

Field Name Description Values/Example

Visitorkey Primary key for the dimension (Surrogate key) 1,2 … UserId Identification of the Visitor (name or login user

id).

CookieId Value of the Cookie A string

IPAddress IP address of the requesting client 164.164.22.91 Country The country of the visitor. Predicted from the

domain of the visitor

USA, India … OperatingSystem The name of the operating system with version Windows NT 4.0, SCO

Unix 7…

Browser The name of the browser with version Netscape Navigator 4.6, Internet Explorer 5.1 …

*FirstName The first name of the Visitor John, Philip…

*LastName The Last name of the Visitor Smith, Jacob…

*DateOfBirth The Date of birth of the Visitor 12/07/1076

*AgeGroup The age group of the Visitor 18-25, 25-40 …

*Gender The Gender of the Visitor Male, Female

*Occupation The occupation of the Visitor. Limited to a set of categories.

Engineering, Computer related…

*IncomeGroup The income group of the Visitor. The groups are defined by the business.

*ZipCode The Zip code of the place of the visitor

*State The state of the visitor.

*VisitorCountry The country of the Visitor specified by the Visitor.

USA, India …

Table-7: Column Description of Visitor Dimension * Optional fields. Collected from the web site visitor through registration forms. 2.3.6 Page Dimension

The figure-7 presents the page dimension with all its attributes. The hierarchical attributes of the dimension are represented using the arrow head connections and general attributes are represented using straight-line connections. URL is the lowest grain level that the user will be able to drill down to, and the domain and object type are the highest levels that the user will be able to drill up to. Drill down path can be identified by following the arrow headings [10] [1].

Figure-7: Page Dimension with this attributes

URL

FILE NAME

SITE

DIRECTORY DOMAIN OBJECT TYPE

FILE TYPE

PAGE TYPE

(8)

The following table-8 describes the various fields of the page dimension.

Field Name Description Values/Example

Pagekey Primary key for the dimension (Surrogate key) 1,2 …

URL Full path of the page in the server C:\..\index.html

PageName Name of the web page Welcome page, Product

info page … PageType Classification of pages in the web site News pages, Jobs &

Career pages …

FileType The file type of the object Gif, au, ra, html …

ObjectType The type of the object Multimedia files,

Application, Content pages

FileName Name of the file being accessed by the user Index.html,

ProductInfo.html … Directory The directory in the server, of the accessed file C:\Inetpub\doc … Site The site where the particular page is available

Domain The domain of the site where the page resides Table-8: Column description of Page Dimension

2.3.7. Session Dimension

The figure-8 presents the session dimension with all its attributes. The general attributes are represented using straight-line connections; the connection with a circle at one end denotes that the specified item is a collection of fields. Session type assigns a meaning to a visit [1]

Figure-8: Session Dimension with its attributes

* - Session parameters are collection of fields, which describes the conditions for characterizing a session type.

The following table-9 describes the various fields of the session dimension.

Field Name Description Values/Example

Sessionkey Primary key for the dimension (Surrogate key) 1,2 … SessionType The type of the user session. Session is defined

bases on the business rules

Quick hit and gone, Product Ordering … SessionDescription The description of a particular session

SessionParameters The parameters that characterize the particular session. It can be split into multiple fields, based on the business rules provided by the customer.

Ex: If Time Spent is in the range of 1-10 min and the pages visited in general info or product info, then it is a ‘Looking for Info’ session.

Table-9: Column Description of Session dimension

SESSION TYPE

(9)

2.3.7 Referrer Dimension

The figure-9 presents the session dimension with all its attributes. The hierarchical attributes of the dimension are represented using the arrow head connections and general attributes are represented using straight-line connections. URL is the lowest grain level that the user will be able to drill down to, and the domain and referrer type are the highest levels that the user will be able to drill up to. Drill down path can be identified by following the arrow headings [1].

Figure-8: Session Dimension with its attributes

The following table-10 describes the fields of the referrer dimension.

Field Name Description Values/Example

Referrerkey Primary key for the dimension (Surrogate key) 1,2 …

ReferringURL The URL of the referring page C:\..\index.html

ReferringSite The Site of the referring page ReferringDomain The domain of the referring page

Keyword The keyword given by the user as search criteria to reach the page.

Web mining, warehousing

ReferrerType The type of the referrer Ad banner, Search

engine … Table-10: Column description of referrer dimension

2.3.8 Status Dimension

The figure-9 presents the status dimension with all its attributes. The hierarchical attributes of the dimension are represented using the arrow head connections and general attributes are represented using straight-line connections. Status id is the lowest grain level that the user will be able to drill down to, and the status type is the highest level that the user will be able to drill up to. Status description provides a description for the status id. Drill down path can be identified by following the arrow headings [10] [1].

Figure-9: Status dimension with its attributes

URL

KEY WORD

SITE DOMAIN

REFERRER TYPE

STATUS DESCRIPTION STATUS ID

(10)

The following table-11 describes the fields of the status dimension.

Field Name Description Values/Example

Statuskey Primary key for the dimension (Surrogate key) 1,2 …

StatusId The Status Code 101, 201 …

StatusDescription Description of the Status Successful, File not found error …

StatusType Type of the Status File errors …

Table-11: Column Description of Status dimension

2.3.9 Visit Dimension

This dimension has no hierarchy. This dimension is used for identifying the start and end of a visit, it is show in table-12

Field Name Description Values/Example

Visitkey Primary key for the dimension (Surrogate key) 1,2 … Description The value or description. For the start of the visit

it is 'Start' for an end of the visit page 'End'

Start, End …

Table-12: Column description of visit dimension

2.3.10 Content Page Dimension

This dimension has no hierarchy. This dimension is used for identifying a page as a content page or not. it is shown in table-13

Field Name Description Values/Example

ContentPagekey Primary key for the dimension (Surrogate key) 1,2 … Description 'Yes' to indicate a content page. 'No' to indicate

other files

'Yes', 'No' …

Table-13: Column description of content page dimension

3. The Data Modeling-Star Schema

(11)

Figure-10: Star Schema model design for Web mart

4. Results

Now we can proceed to the interesting part of our data warehouse: relieving information. 4.1 The average number of minutes from login to order

4.2 The average number of days from first being invited to the site by email to the first order.

5.Conclusions

Understanding the behavior of users on your website is as valuable as following a customer around a store and recording his or her every move. Imagine how much more organized your store can be and how many opportunities you can have to dell merchandise if you know every move customers make while navigating your store. The ETL process in Clickstream data warehousing is significantly different from any other source you are likely to encounter.

References

[1] Ralph Kimball, The Data Warehouse ETL Toolkit, Wiley India Pvt Ltd., 2006 [2] Dr. K.V.K.K Prasad, Data warehouse Development Tools, Dreamtech Press, 2006

[3] White paper by Vivek R Gupta, Senior Consultant, System Services Corporation,“An Introduction to data warehousing”.

[4] Manjunath T.N, Ravindra S Hegadi, Ravikumar G K."Analysis of Data Quality Aspects in DataWarehouse Systems", International Journal of Computer Science and Information Technologies, Vol. 2 (1), 2010, 477-485

[5] Manjunath T.N, Ravindra S Hegadi, Ravikumar G K." A Survey on Multimedia Data Mining and Its Relevance Today", [6] International journal of Computer Science and Network Security. Vol. 10 No. 11 pp. 165-170.

[7] Sanjeevkumar R. Jadhav, and Praveen Kumar Kumbargoudar, “Multimedia Data Mining in Digital Libraries: Standards and [8] Features” in Proc. READIT-2007, p. 54.

(12)

ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 78-85, August 26, 2001, San Francisco, CA, USA.

[10] Valery A. Petrushin and Latifur Khan, “Multimedia Data Mining and Knowledge Discovery”, 2007 - London: Springer-Verlag, pp. [11] 3- 17

[12] S. Kotsiantis, D. Kanellopoulos, P. Pintelas, Multimedia Mining, SEAS Transactions on Systems, Issue 10, Volume 3, December 2004, pp. 3263-3268

[13] Sanjiv Purba “Data Management Handbook” Published by CRC Press, 1999

[14] Bhavani M. Thuraisingham, “Data Management Systems: Evolution and Interoperation”, Published by CRC Press, 1997 [15] Jiawei Han, Micheline Kamber “Data Mining: Concepts and Techniques” Published by Morgan Kaufmann, 2001

[16] Sanjeevkumar R. Jadhav, and Praveen Kumar Kumbargoudar, Multimedia Data Mining in Digital Libraries: Standards and Features, ACVIT- 07, Dr. Babasaheb Ambedkar MarathWada University, Aurangabad,MS-India

[17] Mori Y, Takahashi H, Oka R. Image-to-word transformation based on dividing and vector quantizing images with words. In: MISRM’99 First International Workshop on Multimedia Intelligent Stotage and Retrievel management, 1999.

[18] Ordenoz C, Omiecinski E. Discovering association rules based on image content. In: ADL ’99: Proceedings of the IEEE Forum on Research and Technology Advances in Digital libraries. Washington, DC: IEEE Computer Society; 1999, p.38.

[19] Chakrabarti, S. (2000): Data mining for hypertext: A tutorial survey. SIGKDD explorations, 1(2), pp. 1–11.

[20] Ravikumar G K, Manjunath T. N, Ravindra S. Hegadi, Umesh I.M, Cross Industry Survey on Data mining Applications, International Journal of Computer Science and Information Technologies, Vol. 2 (2) , 2011, 624-628.

Authors Profile

Ravikumar GK. received his Bachelor’s degree from Siddaganga Institute of Technology, Tumkur (Bangalore University) during the year 1996 and M. Tech in Systems Analysis and Computer Application from Karnataka Regional Engineering College Surthakal (NITK) during the year 2000. He is currently working towards his PhD degree in the Area of Data mining. He has published several papers in International and national level conferences. He is having around 14 years of Professional experienced which includes Software Industry and teaching experience. His area of interests are Data Warehouse & Business Intelligence, multimedia and Databases.

Manjunath T N. received his Bachelor’s Degree in computer Science and Engineering from Bangalore University, Bangalore, Karnataka, India during the year 2001 and M. Tech in computer Science and Engineering from VTU, Belgaum, Karnataka, India during the year 2004. Currently pursing Ph.D degree in Bharathiar University, Coimbatore. He is having total 10 years of Industry and teaching experience. His areas of interests are Data Warehouse & Business Intelligence, multimedia and Databases. He has published and presented papers in journals, international and national level conferences.

Dr.Ravindra S Hegadi received his Master of Computer Applications (MCA) & M.Phil and Doctorate of Philosophy (Ph.D). in year 2007 in computer science from Gurbarga University, Karnataka; He is having 15 years of Experience. He has visited overseas to various universities as SME.His area of interests are Image Mining, Image Processing and Databases and business intelligence. He has published and presented papers in journals, international and national level conferences.

Figure

Table name Click Stream  Fact

References

Related documents

If using a fair value basis for financial assets and liabilities, it will provide an initial overview of a company’s liquidity, debt leverage, foreign exchange exposure, interest

It is clear from this summary that a break occurs between sections (c) and (d): Gjuki's sons slay Sigurd, eir sister Gudrun's husband, this central event being preceded by an account

In addition to controlling the access to the network itself, in order to control the access of users to the multicast session, an Authentication, Authorization and Accounting

This observation has an important implication: Any policy instrument, such as R&D Subsidy, R&D Tax Credit or Intellectual Property Rights Policies, which targets

The vast majority of studies in this review (thirty of thirty-six) were cross-sectional in design. This design has several advantages, particularly with rare populations, such as

Data confirmed a decreased functional connectivity (FC) and task-induced deactivation of the DMN during the aging process and in subjects with lower mood; on the contrary, an

High density provides the population and activity basis for a sustainable neighbourhood; adequate street density is the material basis; mixed land-use and social mix shape the

In this paper we show that fi nancial contracts, the success probability, valuation, and value created in start-up fi rms depend on characteristics of the capital market in which