How structured data (Linked Data) help in Big Data Analysis --- Expand Patent Data with Linked Data Cloud

(1)

How structured data (Linked Data) help in Big Data

Analysis --- Expand Patent Data with Linked Data

Cloud

Lishan Zhang

Electrical Engineering and Computer Sciences

University of California at Berkeley

Technical Report No. UCB/EECS-2013-96

http://www.eecs.berkeley.edu/Pubs/TechRpts/2013/EECS-2013-96.html

May 17, 2013

(2)

Copyright © 2013, by the author(s).

All rights reserved.

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that copies

bear this notice and the full citation on the first page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior specific

permission.

(3)

How structured data (Linked Data)

help in Big Data Analysis

-‐-‐-‐ Expand Patent Data with Linked Data Cloud

M.Eng Program

Lishan Zhang

24106243

(4)

Outline

Abstract ... 1 Introduction ... 2 Literature Review ... 6

Unveil the underlying information among Big data ... 6

Previous solutions ... 8

Approaches ... 8

Conclusion ... 10

Methodology ... 12

SPARQL: query language for RDF data ... 13

SPARQL Endpoint query ... 14

HTTP request ... 17

User Interface design ... 17

Discussion ... 20

Results ... 20

Explanation of Results ... 20

(5)

Evaluation ... 23

User Study ... 23

Heuristic Evaluation ... 24

Future Work ... 27

Conclusions or Impact Statement ... 29

Bibliography ... 30

Appendix ... 32

(6)

Abstract

Big Data is currently a big topic in the world. It is a commonly used term to describe data that exceeds the processing capacity of on-‐hand database management tools. We often use 4V (Volume, Variety, Velocity and Value) to describe its characteristics. Big Data can be structured or unstructured data that has potential values behind them. It is of vital importance to extract and analysis the valuable information in Big Data.

On the other hand, Linked Data is a new concept for most of the people. Linked Data refers to the collection of interrelated datasets that can be publishing and sharing on the web. Unlike Big Data, Linked Data is highly structured. It is used to build the Semantic Web which huge amount of data on the web are available in standard format. The technologies enable people to figure out more advanced analytical questions by querying the data and drawing inferences using vocabularies.

In our project, we would like to explore the potential use of Linked Data in analyzing Big Data. We will build a search engine to combine information in Linked Data into these Patent Data to see if we can dig out more information of each patent. There is already a huge Linked Data cloud that contains a large amount of publishing open data. We can also see the potential to connect these public data with patent data to answer advanced questions. When we search for inventor name or certain patent in the search interface, we query from Linked Data Cloud and Patent database separately and return the result. In this way, we can combine the patent itself with

(7)

Introduction

Nowadays, we are generating much more data than any point in the history.

The explosion of data is driven from two particular sources: the social network sharing information about our activities and a variety of sensors collating information on our environment. [1]

Needless to say, there could be priceless value hidden in this booming data. If we make good use of them, we may gain valuable information and pattern inside the data. However, it will also become a thread if we cannot handle this ever-‐increasing amount of data.

Big Data is a commonly used term to describe data that exceeds the processing capacity of conventional database systems. [2] We often identified Big Data with four main attributes: Volume, Velocity, Variety and Value. Big Data can be structured or unstructured data that has potential values behind them. The McKinsey Global Institute describes Big Data as “The next frontier for innovation, competition and productivity.” [3] But processing these big raw datasets pose challenges in both data management and algorithms. It is of vital importance to extract and analysis the valuable information in Big Data.

The major difficulties in processing Big Data include capturing, storage, search, sharing, analytics and visualizing. [4] There are already several approaches to analyzing Big Data. For example, MapReduce is a programming model and an implementation for processing and generating large data sets. It runs on a large

(8)

Data. Some institutes and companies also developed their own mathematics models and algorithms to dig out useful information from Big Data.

We will mainly focus on variety of Data in this thesis. Variety means that Big Data has different types of data and various degrees of structure that does not fit into neat relational structures. It is a mix of structured, semi-‐structured and unstructured data such as text, sensor data, video, log files and more. Those data cannot be integrated into an application directly. [2]

The current approaches for Big Data emphasize the ability to deal with the volume and velocity like MapReduce and NoSQL. In the paper, we are trying to work from a different approach. We are concern about the variety of Big Data. Since most data is unstructured, it is hard to interlink different datasets and create valuable context behind that. We see there may be a potential value to link different datasets and expend the value of the sole data with the help of Linked Data.

Linked Data is used to organize and publish highly structured data with globally unique identifiers, which make it easy to combine various datasets. Richard Cyganiak and Anja Jentzsch created Linked Data Diagram of the Cloud which describes how many datasets have been published on the web. [5] The Linked Data cloud is growing constantly, data integration is becoming more important in this field.

(9)

Fig 1: The Linking Open Data cloud diagram

In this paper, we are trying to figure out the potential use for linked data into Big Data analysis by building a prototype of our concepts. We are using U.S. utility patent dataset and linked with the Public Linked Data cloud. We will build a search engine for Patent Graph search, and query the endpoint from Linked Data Cloud like DBpedia and Freebase and simultaneous query the SQL data from Patent datasets and show the combined results in the interface. The diagram below can illustrate the querying process:

(10)

Fig 2: The querying process of Patent Search Engine

In this way we can add more related information about the Patent and even provide some recommendations for Patent search. We can see there will be many potential values created by this interconnection. And Linked Data would definitely be valued later in Big Data Analysis.

(11)

Literature Review

Unveil the underlying information among Big data

Big Data has become one of the hottest topics in the industry. In this data booming world, some traditional technologies can no longer serve the need to analyze the large volume of data. New approaches must be introduced in order to keep up with the pace of the Big Data. Linked data concept is a useful way to unveil the useful information, especially the data on the Internet.

Big Data is a commonly-‐used term to describe data that exceeds the processing capacity of conventional database systems. We are generating much more data than before with the booming of social network and Media, mobile devices, Internet Transactions and networked devices and sensors.

Big Data is too big, too fast and doesn’t fit the conventional database architectures. Due to the unique nature of Big Data, the first question we need to answer is can we find an alternative way to process the data. More importantly, can we dig out the useful information from the big data?

Big data requires exceptional technologies to efficiently process large quantities of data. There are huge amount of valuable patterns and information hidden in the Big Data, which require us to extract them. Usually, there are four problems when it comes to Big data: Volume, Velocity, Variety and Value (4V) [6] .

(12)

Volume and Velocity

In this data booming world, the speed of data growth is exponential. Particularly, with the increasingly popularity of social media, user generated content has started to dominate. For example, there are roughly 60 hours of video uploaded to YouTube every minute [7]. It is also astonishing that there are over 340 million tweets generated daily in May 2012 [8]. Just to make this more visualizable, the amount of information in the world doubles every five years [9]. There is more information in the daily edition of The New York Times than an individual man or woman in the 16th_{Century had to process in their whole lives.}

Huge amount of data requires tremendous storage space and extremely fast processing speed to deal with the data. It has always been challenging for any company, government or individual to deal with the issue.

Variety and Value

Big Data relates not just to new information sources: it’s equally applicable for gaining new insights from data that was previously inaccessible and to accelerating and easing existing analytical processes [10]. In fact, most big data is low value until rolled up and analyzed, at which point it becomes valuable.

It is challenging due to big data’s variety. Big data has different structures and shapes, causing it very difficult to analyze with traditional technologies, such as MySQL or Oracle. Integrating these data sources are a very expensive operation [11] . Plus, correlating different pieces of data and reconnect those data to make

(13)

them more valuable, readable and accessible has always been an interesting problem.

Previous solutions

Previously, there are several ways to processing and analyzing big data. Usually, they utilize advanced hardware and parallel processing techniques to break the speed bottleneck. Others have employed non-‐relational data storage systems to deal with unstructured and semi-‐structured big data. Meanwhile, a lot of companies and have been trying to apply unique math models, advance analytics and data visualization technology to dig the insights from Bit data.

Approaches

MapReduce

MapReduce is a breakthrough concept announced by Google. It is a programming model and an implementation for processing and generating large data sets [12] . It is able to run on a large cluster of machines and is highly scalable.

MapReduce is not only successful at Google, but is also open-‐sourced to the public under the name of Hadoop, a highly scalable compute and storage platform [13]. Hadoop breaks huge chunk of data into pieces and process/analyze it at the same time.

(14)

NoSQL

NoSQL was a database that did not expose the standard SQL interface and it was first used by Carol Strozzi [14]. It works in conjunction with Hadoop to serve up discrete data stored among large volumes of multi-‐structured data to end-‐user and automated Big Data applications [15].

Digging useful information

Various companies have taken actions to dig out the useful information from the various data in the web. For example, Splunk is a small company that has been in the business for less than 5 years. Splunk’s mission is to make ambiguous big data more readable, useful and valuable to everyone. For example, one of its partners, Amazon, is asking Splunk to find out the habits of their customers.

Another company, Jive, is a software company in the social business software industry. It is also trying to help its customers to consolidate the big data they are dealing with. One of the example data is the price information of all the merchandise: what price should be set in order to be the best price.

Downsides

However, all of these approaches are not perfect. For example, Hadoop is a very young technology and still developing. It is very hard to manage the Hadoop system and it does not support real-‐time data processing and analysis.

(15)

NoSQL, on the other hand, is that most NoSQL databases traded ACID (atomicity, consistency, isolation, durability) compliance for performance and scalability. It also suffers from its ‘youth’: no mature management and monitoring tools.

Conclusion

Key results

Big Data holds tremendous value and it will be beneficial to understand what it really means. Many new technologies, such as MapReduce and NoSQL, have been applied to solve this issue. However, it is never safe to say that we already have the perfect tools for this job. As the data continues to boom exponentially, new technology such as Linked data will definitely be the key to the next-‐generation analytics platform and data management system.

Shortcomings

Linked data applications usually follow different architectures and pattern. For instance, one pattern will require the data to be replicated so that the applications may work with stale data. Another pattern, named On-‐The-‐Fly Dereferencing Pattern works very slowly when dealing with complex operations.

Additional Work

(16)

fact that we are in a data-‐exploding era cannot be reverted. More and more data are coming to us and the technology must keep evolving in order to keep up with the pace.

Artificial intelligence can be applied when dealing with Big Data. A ‘databot’ that can crawl the Linked data, infer relationships, and figure out what information can be extracted will definitely be useful.

(17)

Methodology

For this thesis, we are building a use case in order to figure out the potential use for Linked Data into Patent Data. More specifically, we will build a search engine and we named it “Patent Graph”. So when people type a certain patent number or the inventor, we can show them the relevant information such as the picture of inventor, his workplace, alma mater, doctoral advisor and the biography. This information is obtained from DBpedia, which is a structured data format from Wikipedia. And DBpedia makes this information available on the web so that people can easily link to the data. Besides, we will also make new search around the result simply by clicking the related information on the page. For example, if we are interested in a co-‐worker or the advisor in the patent that we search, we can just click the name and then will return a new search around the person and his patents. In addition, we can provide recommendations based on the searching results. If time allows, we will also be willing to convert the Patent Data into RDF format and publish on the web then more people can benefit from that. In this way, the Linked Data help us to analysis the Patent Data by expanding our patent datasets with related data and finding more useful information.

The Patent Data that we use is the Patent Inventor Database from Fung institute. The database disambiguated all inventor names from the U.S. utility patent database from 1979 to 2010. And the Linked Data we use is DBpedia. The DBpedia dataset extract structured content from the information created by Wikipedia and it can be

(18)

Since we are building a search engine to extract the information from both Linked Data Cloud and relational database, we are building a web service based on that and we use a Model-‐View-‐Controller (MVC) software architecture.

My part of work includes implementing the search interface and query from the Linked Data Cloud. The techniques involve SPARQL endpoint query, HTTP request and User Interface design.

SPARQL: query language for RDF data

Resource Description Framework (RDF) is a directed, labeled graph data format to describe resources on the web. It is designed to be read and understand by computer rather than people. Most RDF documents are written in XML, which can easily be exchanged between different computers and platforms. The RDF language is also a part of “The Semantic Web”. Semantic Web is a set of standards and best practices for sharing data and the semantics of that data over the web for use by application. [16] Rather than just putting data on the web, the Semantic Web is about making links so that a person or machine can explore the web of data. [17] We define RDF statement as a triple of the form (Subject, Predicate, Object) and uses uniform resource identifiers (URIs) to name the data objects. For example, if we need to express “Tom is a man”, we should represent as Tom(Subject), sex(Predicate), man(Object). The data stored in Linked Data Cloud is RDF data. SPARQL stays for SPARQL Protocol and RDF Query Language. SPARQL is a standard query language designed for querying RDF databases. There are four different forms

(19)

form most of the time. [18]The main idea of SPARQL is pattern matching. So it is easily traverse relationship by querying collections of triples. The syntax of SPARQL is quite similar to SQL. A simple SPARQL query example can be as follow:

PREFIX dbont: <http://dbpedia.org/ontology/> SELECT ?musician ?place

WHERE {

?musician dbont:birthPlace ?place . }

First we need to initiate a namespace. In this case is http://dbpedia.org/ontology. And we find all the musicians and their birth places as place and return. The partial result is showed below. We can type the SPARQL query example in DBpedia endpoint to get the full list.

musician place http://dbpedia.org/resource/Federico_Garc%C3%ADa_Lorca http://dbpedia.org/resource/Andalusia http://dbpedia.org/resource/Trinidad_Jim%C3%A9nez http://dbpedia.org/resource/Andalusia http://dbpedia.org/resource/Ibn_Tufail http://dbpedia.org/resource/Andalusia http://dbpedia.org/resource/Fran_Perea http://dbpedia.org/resource/Andalusia http://dbpedia.org/resource/Ver%C3%B3nica_S%C3%A1nchez http://dbpedia.org/resource/Andalusia http://dbpedia.org/resource/Berni_Rodr%C3%ADguez http://dbpedia.org/resource/Andalusia http://dbpedia.org/resource/Jos%C3%A9_Celestino_Mutis http://dbpedia.org/resource/Andalusia http://dbpedia.org/resource/Pepe_Marchena http://dbpedia.org/resource/Andalusia http://dbpedia.org/resource/Antonio_de_Olivares http://dbpedia.org/resource/Andalusia http://dbpedia.org/resource/Tanya_Anne_Crosby http://dbpedia.org/resource/Andalusia

SPARQL Endpoint query

(20)

specific protocol and data format. [19] A SPARQL endpoint enables users to query a knowledge base via the SPARQL language. Results are typically returned in one or more machine-‐processable formats like HTML. For simplicity, we can say that a SPARQL endpoint is the place you send your SPARQL query and receive the result. The commonly used SPARQL Endpoints are lists below (SparqlEndpoints, 2013):

Data Source Endpoint Address

DBpedia http://dbpedia.org/sparql

U.S. Census http://www.rdfabout.com/sparql

FactForge http://factforge.net/sparql

data.gov.uk http://data.gov.uk/sparql

In our project, we need to query the bio information of the patent inventor from DBpedia through SPARQL endpoint query. The information of a certain person is the same as we often see in Wikipedia, but it is in a different format. For example, as for our professor David A. Patterson, the Wikipedia page and DBpedia page are showed as below. We can see they have quite different representation of the same information. In DBpedia, data is machine-‐readable. We can get the value from the property on the left side. We just need to select the properties we need in SPARQL query and can get the corresponding values more convenient.

(21)

Fig 3: Screenshot of an example of Wikipedia

(22)

HTTP request

The Hypertext Transfer Protocol (HTTP) can work as a request-‐response protocol between a client and server. An HTTP request consists of a request method, a request URL, header fields and a body. The request methods are GET, HEAD, POST, PUT, DELETE, OPTIONS, TRACE. [20] The two commonly used HTTP request methods are GET and POST. While these two methods have similar function, GET emphasizes requests data from a specified resource while POST submits data to be processed to a specified resource. We use POST method here to avoid caching. In our case, the client is the Search Interface that submits an HTTP request using JavaScript to the server endpoint with the SPARQL query. Then the server returns a response to the client. The response contains status and content information about the request. Consider that JavaScript is not good at dealing with RDF data; we set the return format as json format.

User Interface design

The User Interface (UI) design for our prototype is simple and clean. It looks like a simplified Wikipedia. We query from both the Patent Data and Linked Data Cloud and display the output in the interface. The structure of the User Interface is the Patent information surrounded by some information of the inventor of the patent. We can see the screenshot as below:

(23)

Fig 5: Screenshot of Paten Search Interface

The left side contains the basic information including his profile picture, working place, Alma Mater and Doctoral Advisor. The upper right side is a biography of the inventor. Then followed his patent information got from relational database. If we click the link in the left side, it can lead us to the certain Wikipedia page to get more information. The UI design emphasizes the Patent part while putting the relevant information surrounded.

The procedure

The procedure works as below:

On the client-‐side, when people search a keyword, a HTTP request message will send to the DBpedia web server. We write a wrapper class “SPARQLWrapper.js” in JavaScript that is similar to SPARQL Endpoint interface to Python. [21]

(24)

The SPARQL endpoint query is http://dbpdia.org/sparql. We send the request with searched title and some properties like abstract, workplaces and so on to the server endpoint. But it will return html page, which is not what we need. So we set the accept field in Request Header to identify the return data type. Here we need to return json format. We use GET and POST methods to send the SPARQL.

The web server then will provide resources and return a response message to the client. The response message is read by JavaScript and write into html and display in the User Interface.

For the Patent Data part, we have potentially two main approaches. One approach is to use the Patent Data as the relational database and query the data from local database. And the other approach is to convert it to RDF format and store it in triple store or even publish on the web. The first approach is efficient because we just need to obtain the Patent information from the search keyword. It is quite convenient to use relational database. The bottleneck would be how to store the data. The whole dataset could be saved locally or upload in Google Datastore.

The second approach is more complex because we need to pre-‐process the whole dataset and convert to RDF format. Since the Patent Data is quite large, many existing tools like Google Refine cannot hold such a large amount of data. The advantage for the second approach is that the Patent Data can interlink with other Linked Data and make Patent Data more available.

Since the large amount of Data is always a problem, we will begin from a small subset and go from there. For example, we can use the Patent Data from Berkeley

(25)

Discussion

In this section, I will main discuss the use case that we bring Linked Data in Patent Data search. Also I will talk about how linked Data helped in patent search, what is the limitation and how linked data can be used in broader context. I also evaluate the User Interface of the search interface and test with real users.

Results

Explanation of Results

For our Capstone Project, we would like to explore the potential use of Linked Data to help Big Data Analysis. And thus we are building a patent search engine based on these two concepts. Linked Data has many advantages like highly structured data, machine-‐readable and interlinked between different data sources. So we take advantage of the structured data format of Linked Data and use it to expand the search result for patent and add more values to it. Basically we have proven the hypothesis that Linked Data works in this situation and it will have many other implications.

(26)

Fig 6: Screenshot of Paten Search Result

From the screenshot we can easily see that it has association information adding into the patent search result. Here we add some wiki information for the certain inventor. In this way user can easily distinguish the exact inventor by looking at the biography or some related information like work place, alma mater and doctoral advisor. It will help in disambiguation for patents since there will be a large amount of people with the same name but work in different areas and have totally different patents.

Besides, users can also search for the patents for the coworkers by clicking their names in the page. Or if the users are interested in the workplace or alma mater, they can also just click the link and it will lead them to the Wikipedia page of the certain item.

(27)

With the help of Open Linked Data, we have a new kind of patent association search that disambiguation the patent search and provide a broader context of the patent related information.

What is different

We have many some changes compare to our initial ideas in our implementation. First for the patent data, we retain its format as relational database and query with SQL rather than converting it into RDF format. Actually we have worked in some small prototype to convert the data using Google Refine. But it becomes really complex when we use a large amount of data. And it is not necessary to covert data format in our use case. So we decided to query the relational database directly and combine the result with inventor information from Linked Data.

Also we decide to put the patent data locally and use PHP to query the relational database and send back to client side with json format. We find out this is the most efficient way of doing that at this stage. If time permits, we would probably put them in the cloud server so that we can run the search engine remotely.

Limitation of this approach

There are also some limitations of our patent search.

Firstly, we are assuming that the inventor would have a Wikipedia page so that we can find the corresponding information in DBpedia. However, this would not also be

(28)

all the people who held their patents. In such case, we won’t find their information from the Linked Data Cloud and it would cause a problem.

Secondly, the user will need to type the full name of the inventor in order to match the name in DBpedia and the inventor name in patent database. Compare with Google Patent Search, it is kind of limited because Google can find us a lot of information based on selection rank even if we didn’t type the full name.

Thirdly, we are using patent data as its original format and run two queries to search from DBpedia and relational database. It doesn’t make the best use of Linked Data because the advantage of Linked Data over other format is that it is in the same format and different datasets can be interlinked together. Later it would be better if we can actually convert the patent data into RDF format and even publish the data into Open Linked Data Cloud. In this way, the patent data would have been interlinked with all the other data source in the cloud and make use of the Linked Data concept better.

Evaluation

In the evaluation part, I will mainly discuss the User Interface we build for patent search and the effectiveness and convenience of search experience for real users.

User Study

(29)

Most of them think that the patent association search result is better comparing it with the traditional approach. They often encounter the problem whether they get the right one when they search for patents. With our prototype they can easily get the information of the inventor and therefore get correct and comprehensive understanding of the information they retrieve.

They thinks that our patent search has clear output with the associate information and it can also run relevant search. But they also point out the limitation of the approach. We can only have basic information for the patent itself. If users would like to know about some details of the patent itself, we cannot provide that because we don’t have that information in Patent Database.

Heuristic Evaluation

We examine our User Interface with the famous 10 Usability Heuristics introduced by Jakob Nielsen. It is a usability engineering method for finding the usability problems in a user interface design. [22] We have a small set of evaluators examine the interface with the recognized usability principles with point one to ten and combine the result of evaluation.

We asked our users to go through a set of tasks we designed in our search interface and provide evaluators with the goals of the system and allowed them to do their own tasks. After that, they filled out the sheet of Heuristic Evaluation.

(30)

Heuristic Evaluation principles Points

(1-‐10) Comments

Visibility of system status

Match between system and the real world

User control and freedom

Consistency and standards

Error prevention

Recognition rather than recall

Flexibility and efficiency of use

Aesthetic and minimalist design

Help users recognize, diagnose, and recover from errors

Help and documentation

We analyzed the results the real users provides and explained the evaluation result. The principle got Good if the average point is more than 6 out of 10, otherwise it need to improve.

(1). Visibility of system status: Good (8.7)

Our interface has clear layout and different components will not combine together when it shows. User can easily see if they have obtained the search result and how the information likes.

(2). Match between system and the real world: Good (8.2)

(31)

(3). User control and freedom: Good (7.1)

Users can search new patent by using the textbox in the upper left corner or simply click the information in the page.

(4). Consistency and standards: Need to improve (5.8)

For the search textbox, we can only do search for the existing patents number and some inventor information. So user may get confused about what they should enter at first.

(5). Error prevention: Need to improve (5.0)

We don’t build the function for auto-‐completion or auto-‐correction so that users need to type correctly in order to get the result.

(6). Recognition rather than recall: Good (7.5)

We have minimized the user’s memory load by making the objects and actions visible. Users don’t have to remember information but can just click in the old result.

(7). Flexibility and efficiency of use: Good (7.2)

The differences between novice user and expert user will not be huge because there are no complicated actions needed for the search feature.

(8). Aesthetic and minimalist design: Good (6.8)

The interface contains the most relevant and needed information and diminishes the extra information with low visibility.

(32)

(9). Help users recognize, diagnose, and recover from errors: Need to improve (5.7)

If users type some names that does not exist in the Wikipedia or they make some typo, there is no error messages to indicate the problem precisely.

(10). Help and documentation: Need to improve (5.5)

We actually didn’t implement the documentation part to help user understand the functionality of the search engine. Normally people will understand because the interface looks like all the other search engine.

Future Work

Enriching the functionality of the Patent Search

Now we only focus on how to combine the Linked Data and relational Data together to make the patent search more convenient. So we only use a limited information collected from only one source of Open Linked Data Cloud. In fact, there are many more things we can do to enrich the functionality of the Patent Search. For example, we can obtain the geo information in the Patent Data and do some visualization of from the Geo Names Data from Linked Data Cloud. Or we can even visualize some Patent Search Graph to show the relationships between different inventors and their patents more explicit.

(33)

Querying a Collection of Datasets in Linked Data

We query data from only DBpedia for this project. But since Linked Data is interlinked, we may be able to query a collection of datasets using an existing SPARQL endpoint and access to a set of copies of relevant dataset. For example, OpenLink SW has a majority of dataset from the LOD cloud using SPARQL endpoint. [23]

Applying the concept to other topics

Currently we apply the patent data with the DBpedia in Linked Data Cloud. There are many other sources in Linked Data Cloud we may use like Geo Names data, IMDB data, BBC music and so on. We may make use of these sources and find other available applications. For example, we can search for a certain music singer and get the relevant biographical information along with their albums and songs in different data sources.

(34)

Conclusions or Impact Statement

For our capstone project, it is a research project to explore the potential use of Linked Data into Big Data. We have do some research about Big Data, knowing the existing approaches to analysis Big Data and their strength and weakness. And we figured out that the highly structured Linked Data might be a potential solution for unstructured Big Data analytics and dig out more values behind the Big Data.

Based on that, we are building a search engine to describe how Linked Data help in Big Data Analysis by expanding the Patent Data with the Open Linked Data Cloud. In this way, we may be able to find out the patent association information through the Linked Data Cloud and combine with the patent search to get a comprehensive answer.

Although we have learned a lot about the mechanism of Linked Data and use it in our prototype, there is something remains to be learned. For example, we just query from a single sources from Linked Data Cloud, we may explore multiply queries from different sources or directly convert the Patent Data into RDF format and publish it in the Linked Data cloud.

The strength for Linked Data is its structured and uniform format that information can be shared among different datasets and it can be read automatically by computers. Yet we still need to figure out the drawbacks like complicated pre-‐ processing procedures and the way to protect the available data in the web.

Our prototype has proven that linked data has many advantages and can be used in data analysis in different situations. We can see a bright future for making better use

(35)

Bibliography

1. Ian Mitchell, Mark Wilson. Linked Data: Connecting and exploiting big data. London : Fujitsu UK, 2012.

2. Dumbill, Edd. What is big data? An introduction to the big data langscape. [Online] January 11, 2012. http://strata.oreilly.com/2012/01/what-‐is-‐big-‐data.html.

3. James Manyika, Michael Chui, Brad Borwn, Jacques Bughin, Richard Dobbs, Charles Roxburgh, Angela Hung Byers. Big Data: The next frontier for innovation, competition, and productivity. s.l. : McKinsey Global Institute, 2011.

http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_ The_next_frontier_for_innovation.

4. Roebuck, Kevin. Big Data: High-‐impact Strategies – What You Need to Know: Definitions, Adoptions, Impact, Benefits, Maturity, Vendors. s.l. : Lightning Source Incorporated, 2011. 5. Richard Cyganiak, Anja Jentzsch. Linking Open Data cloud diagram. [Online] 2011. http://lod-‐cloud.net/.

6. Hopkins, Brian and Evelson, Boris. Expand Your Digital Horizon with Big Data . s.l. : Forrester , 2011.

7. Oreskovic, Alexei. YouTube, Google Inc's video website, is streaming 4 billion online videos every day, a 25 percent increase in the past eight months, according to the company. [Online] Jan. 23, 2012. [Cited: Nov. 30, 2012.]

http://www.reuters.com/article/2012/01/23/us-‐google-‐youtube-‐ idUSTRE80M0TS20120123.

8. twittersearch. The Engineering Behind Twitter’s New Search Experience. [Online] May 31, 2011. [Cited: Nov 30, 2012.] http://engineering.twitter.com/2011/05/engineering-‐ behind-‐twitters-‐new-‐search.html.

9. O'Brien, Kevin. Why Media Literacy? A Catholic Reflection. [Online] [Cited: Nov. 30, 2012.] http://www.medialit.org/reading-‐room/why-‐media-‐literacy-‐catholic-‐reflection. 10. IDC European Software Predictions. Woodward, Alys, et al. 2012, IDC.

11. IDC Worldwide Big Data Taxonomy . Woo, Benjamin, et al. 2011. 12. Ghemawat, Jeffrey Dean and Sanjay. 2004, OSDI, p. 13.

(36)

16. DuCharme, Bob. Learning SPARQL. s.l. : O'REILLY, 2011.

17. Berners-‐Lee, Tim. Linked Data Design Issues. [Online] 06 18, 2009. http://www.w3.org/DesignIssues/LinkedData.html.

18. Matthews, Andrew. Understanding SPARQL. [Online] 2008.

http://www.ibm.com/developerworks/xml/tutorials/x-‐sparql/section3.html.

19. SPARQL endpoint. [Online] 2011. http://semanticweb.org/wiki/SPARQL_endpoint. 20. HTTP Requests. [Online] http://docs.oracle.com/javaee/1.4/tutorial/doc/HTTP2.html. 21. Ivan Herman, Sergio Fernandez, Carlos Tejo. SPARQL Endpoint interface to Python. [Online] 2008. http://sparql-‐wrapper.sourceforge.net/.

22. Nielsen, Jakob. 10 Usability Heuristics for User Interface Design. [Online] 1995. http://www.nngroup.com/articles/ten-‐usability-‐heuristics/.

23. Hartig, Olaf. Querying Linked Data with SPARQL. [Online] 2009. http://www.slideshare.net/olafhartig/querying-‐linked-‐data-‐with-‐sparql. 24. Public Data Sets on AWS. [Online] http://aws.amazon.com/publicdatasets. 25. SparqlEndpoints. [Online] 2013. http://esw.w3.org/topic/SparqlEndpoints.

(37)

Appendix

Here I will list some code snippets described in methodology.

SPARQLWrapper.js

(function(root, factory) {

if(typeof define === "function"){

define("SPARQLWrapper", factory); // AMD || CMD }else{

root.SPARQLWrapper = factory(); // <script> } }(this, function(){ 'use strict' function SPARQLWrapper(endpoint){ this.endpoint = endpoint; this.queryPart = ""; this.type = "json"; } SPARQLWrapper.prototype = { constructor: SPARQLWrapper, setQuery: function(query){

this.queryPart = "query=" + encodeURI(query); },

setType: function(type){

this.type = type.toLowerCase(); },

query: function(type, callback){

callback = callback === undefined ? type : this.setType(type) || callback;

var xhr = new XMLHttpRequest(); xhr.open('POST', this.endpoint, true);

xhr.setRequestHeader('Content-‐type', 'application/x-‐www-‐form-‐ urlencoded'); switch(this.type){ case "json": type = "application/sparql-‐results+json"; break; case "xml": type = "application/sparql-‐results+xml";

(38)

break; default: type = "application/sparql-‐results+json"; break; } xhr.setRequestHeader("Accept", type); xhr.onreadystatechange = function(){ if(xhr.readyState == 4){

var sta = xhr.status;

if(sta == 200 || sta == 304){

callback(xhr.responseText);

}else{

console && console.error("Sparql query error: " + xhr.status + " " + xhr.responseText);

}

window.setTimeout(function(){

xhr.onreadystatechange= new Function(); xhr = null; },0); } } xhr.send(this.queryPart); } } return SPARQLWrapper; }));

How structured data (Linked Data) help in Big Data Analysis --- Expand Patent Data with Linked Data Cloud

How structured data (Linked Data) help in Big Data

Analysis --- Expand Patent Data with Linked Data

Cloud

Lishan Zhang

Electrical Engineering and Computer Sciences

University of California at Berkeley

May 17, 2013

Copyright © 2013, by the author(s).

All rights reserved.

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that copies

bear this notice and the full citation on the first page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior specific

permission.

How structured data (Linked Data)

help in Big Data Analysis

-­‐-­‐-­‐ Expand Patent Data with Linked Data Cloud

M.Eng Program

Lishan Zhang

24106243

Outline

Abstract

Introduction

Literature Review

Methodology

Discussion

Conclusions or Impact Statement

Bibliography

Appendix

-‐-‐-‐ Expand Patent Data with Linked Data Cloud