• No results found

How structured data (Linked Data) help in Big Data Analysis --- Expand Patent Data with Linked Data Cloud

N/A
N/A
Protected

Academic year: 2021

Share "How structured data (Linked Data) help in Big Data Analysis --- Expand Patent Data with Linked Data Cloud"

Copied!
38
0
0

Loading.... (view fulltext now)

Full text

(1)

How structured data (Linked Data) help in Big Data

Analysis --- Expand Patent Data with Linked Data

Cloud

Lishan Zhang

Electrical Engineering and Computer Sciences

University of California at Berkeley

Technical Report No. UCB/EECS-2013-96

http://www.eecs.berkeley.edu/Pubs/TechRpts/2013/EECS-2013-96.html

May 17, 2013

(2)

Copyright © 2013, by the author(s).

All rights reserved.

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that copies

bear this notice and the full citation on the first page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior specific

permission.

(3)

 

 

 

 

 

How  structured  data  (Linked  Data)  

help  in  Big  Data  Analysis  

-­‐-­‐-­‐  Expand  Patent  Data  with  Linked  Data  Cloud  

 

 

 

 

M.Eng  Program  

Lishan  Zhang  

24106243  

(4)

 

 

Outline  

 

  Abstract  ...  1   Introduction  ...  2   Literature  Review  ...  6  

Unveil  the  underlying  information  among  Big  data  ...  6  

Previous  solutions  ...  8  

Approaches  ...  8  

Conclusion  ...  10  

Methodology  ...  12  

SPARQL:  query  language  for  RDF  data  ...  13  

SPARQL  Endpoint  query  ...  14  

HTTP  request  ...  17  

User  Interface  design  ...  17  

Discussion  ...  20  

Results  ...  20  

Explanation  of  Results  ...  20  

(5)

Evaluation  ...  23  

User  Study  ...  23  

Heuristic  Evaluation  ...  24  

Future  Work  ...  27  

Conclusions  or  Impact  Statement  ...  29  

Bibliography  ...  30  

Appendix  ...  32    

(6)

Abstract  

Big  Data  is  currently  a  big  topic  in  the  world.  It  is  a  commonly  used  term  to  describe   data   that   exceeds   the   processing   capacity   of   on-­‐hand   database   management   tools.   We  often  use  4V  (Volume,  Variety,  Velocity  and  Value)  to  describe  its  characteristics.   Big   Data   can   be   structured   or   unstructured   data   that   has   potential   values   behind   them.  It  is  of  vital  importance  to  extract  and  analysis  the  valuable  information  in  Big   Data.      

On  the  other  hand,  Linked  Data  is  a  new  concept  for  most  of  the  people.  Linked  Data   refers  to  the  collection  of  interrelated  datasets  that  can  be  publishing  and  sharing  on   the   web.   Unlike   Big   Data,   Linked   Data   is   highly   structured.   It   is   used   to   build   the   Semantic   Web   which   huge   amount   of   data   on   the   web   are   available   in   standard   format.   The   technologies   enable   people   to   figure   out   more   advanced   analytical   questions  by  querying  the  data  and  drawing  inferences  using  vocabularies.    

In  our  project,  we  would  like  to  explore  the  potential  use  of  Linked  Data  in  analyzing   Big  Data.  We  will  build  a  search  engine  to  combine  information  in  Linked  Data  into   these  Patent  Data  to  see  if  we  can  dig  out  more  information  of  each  patent.  There  is   already  a  huge  Linked  Data  cloud  that  contains  a  large  amount  of  publishing  open   data.  We  can  also  see  the  potential  to  connect  these  public  data  with  patent  data  to   answer  advanced  questions.  When  we  search  for  inventor  name  or  certain  patent  in   the   search   interface,   we   query   from   Linked   Data   Cloud   and   Patent   database   separately  and  return  the  result.  In  this  way,  we  can  combine  the  patent  itself  with  

(7)

Introduction  

Nowadays,  we  are  generating  much  more  data  than  any  point  in  the  history.  

The   explosion   of   data   is   driven   from   two   particular   sources:   the   social   network   sharing   information   about   our   activities   and   a   variety   of   sensors   collating   information  on  our  environment.  [1]  

Needless   to   say,   there   could   be   priceless   value   hidden   in   this   booming   data.   If   we   make   good   use   of   them,   we   may   gain   valuable   information   and   pattern   inside   the   data.  However,  it  will  also  become  a  thread  if  we  cannot  handle  this  ever-­‐increasing   amount  of  data.    

Big   Data   is   a   commonly   used   term   to   describe   data   that   exceeds   the   processing   capacity   of   conventional   database   systems.   [2]  We   often   identified   Big   Data   with   four  main  attributes:  Volume,  Velocity,  Variety  and  Value.  Big  Data  can  be  structured   or   unstructured   data   that   has   potential   values   behind   them.   The   McKinsey   Global   Institute   describes   Big   Data   as   “The   next   frontier   for   innovation,   competition   and   productivity.”  [3]  But  processing  these  big  raw  datasets  pose  challenges  in  both  data   management   and   algorithms.   It   is   of   vital   importance   to   extract   and   analysis   the   valuable  information  in  Big  Data.      

The   major   difficulties   in   processing   Big   Data   include   capturing,   storage,   search,   sharing,   analytics   and   visualizing.   [4]   There   are   already   several   approaches   to   analyzing   Big   Data.   For   example,   MapReduce   is   a   programming   model   and   an   implementation   for   processing   and   generating   large   data   sets.   It   runs   on   a   large  

(8)

Data.  Some  institutes  and  companies  also  developed  their  own  mathematics  models   and  algorithms  to  dig  out  useful  information  from  Big  Data.  

We  will  mainly  focus  on  variety  of  Data  in  this  thesis.  Variety  means  that  Big  Data   has   different   types   of   data   and   various   degrees   of   structure   that   does   not   fit   into   neat   relational   structures.   It   is   a   mix   of   structured,   semi-­‐structured   and   unstructured   data   such   as   text,   sensor   data,   video,   log   files   and   more.   Those   data   cannot  be  integrated  into  an  application  directly.  [2]  

The  current  approaches  for  Big  Data  emphasize  the  ability  to  deal  with  the  volume   and  velocity  like  MapReduce  and  NoSQL.  In  the  paper,  we  are  trying  to  work  from  a   different  approach.  We  are  concern  about  the  variety  of  Big  Data.  Since  most  data  is   unstructured,   it   is   hard   to   interlink   different   datasets   and   create   valuable   context   behind   that.   We   see   there   may   be   a   potential   value   to   link   different   datasets   and   expend  the  value  of  the  sole  data  with  the  help  of  Linked  Data.    

Linked   Data   is   used   to   organize   and   publish   highly   structured   data   with   globally   unique   identifiers,   which   make   it   easy   to   combine   various   datasets.   Richard   Cyganiak   and   Anja   Jentzsch   created   Linked   Data   Diagram   of   the   Cloud   which   describes  how  many  datasets  have  been  published  on  the  web.  [5]  The  Linked  Data   cloud   is   growing   constantly,   data   integration   is   becoming   more   important   in   this   field.    

(9)

  Fig  1:  The  Linking  Open  Data  cloud  diagram  

 

In  this  paper,  we  are  trying  to  figure  out  the  potential  use  for  linked  data  into  Big   Data   analysis   by   building   a   prototype   of   our   concepts.   We   are   using  U.S.   utility   patent  dataset  and  linked  with  the  Public  Linked  Data  cloud.  We  will  build  a  search   engine   for   Patent   Graph   search,   and   query   the   endpoint   from   Linked   Data   Cloud  like  DBpedia  and  Freebase  and  simultaneous  query  the  SQL  data  from   Patent   datasets   and   show   the   combined   results   in   the   interface.  The  diagram   below  can  illustrate  the  querying  process:  

(10)

  Fig  2:  The  querying  process  of  Patent  Search  Engine  

 

In  this  way  we  can  add  more  related  information  about  the  Patent  and  even  provide   some  recommendations  for  Patent  search.  We  can  see  there  will  be  many  potential   values  created  by  this  interconnection.  And  Linked  Data  would  definitely  be  valued   later  in  Big  Data  Analysis.      

(11)

Literature  Review  

Unveil  the  underlying  information  among  Big  data  

Big  Data  has  become  one  of  the  hottest  topics  in  the  industry.    In  this  data  booming   world,   some   traditional   technologies   can   no   longer   serve   the   need   to   analyze   the   large  volume  of  data.    New  approaches  must  be  introduced  in  order  to  keep  up  with   the  pace  of  the  Big  Data.    Linked  data  concept  is  a  useful  way  to  unveil  the  useful   information,  especially  the  data  on  the  Internet.  

Big   Data   is   a   commonly-­‐used   term   to   describe   data   that   exceeds   the   processing   capacity  of  conventional  database  systems.    We  are  generating  much  more  data  than   before   with   the   booming   of   social   network   and   Media,   mobile   devices,   Internet   Transactions  and  networked  devices  and  sensors.  

Big  Data  is  too  big,  too  fast  and  doesn’t  fit  the  conventional  database  architectures.     Due  to  the  unique  nature  of  Big  Data,  the  first  question  we  need  to  answer  is  can  we   find  an  alternative  way  to  process  the  data.    More  importantly,  can  we  dig  out  the   useful  information  from  the  big  data?  

Big  data  requires  exceptional  technologies  to  efficiently  process  large  quantities  of   data.  There  are  huge  amount  of  valuable  patterns  and  information  hidden  in  the  Big   Data,   which   require   us   to   extract   them.   Usually,   there   are   four   problems   when   it   comes  to  Big  data:    Volume,  Velocity,  Variety  and  Value  (4V)  [6]  .  

(12)

Volume  and  Velocity  

In  this  data  booming  world,  the  speed  of  data  growth  is  exponential.    Particularly,   with  the  increasingly  popularity  of  social  media,  user  generated  content  has  started   to  dominate.    For  example,  there  are  roughly  60  hours  of  video  uploaded  to  YouTube   every   minute   [7].     It   is   also   astonishing   that   there   are   over   340   million   tweets   generated  daily  in  May  2012  [8].    Just  to  make  this  more  visualizable,  the  amount  of   information  in  the  world  doubles  every  five  years  [9].  There  is  more  information  in   the  daily  edition  of  The  New  York  Times  than  an  individual  man  or  woman  in  the   16th  Century  had  to  process  in  their  whole  lives.  

Huge   amount   of   data   requires   tremendous   storage   space   and   extremely   fast   processing   speed   to   deal   with   the   data.     It   has   always   been   challenging   for   any   company,  government  or  individual  to  deal  with  the  issue.  

 

Variety  and  Value  

Big   Data   relates   not   just   to   new   information   sources:   it’s   equally   applicable   for   gaining  new  insights  from  data  that  was  previously  inaccessible  and  to  accelerating   and  easing  existing  analytical  processes  [10].  In  fact,  most  big  data  is  low  value  until   rolled  up  and  analyzed,  at  which  point  it  becomes  valuable.  

It   is   challenging   due   to   big   data’s   variety.     Big   data   has   different   structures   and   shapes,   causing   it   very   difficult   to   analyze   with   traditional   technologies,   such   as   MySQL   or   Oracle.     Integrating   these   data   sources   are   a   very   expensive   operation   [11]   .     Plus,   correlating   different   pieces   of   data   and   reconnect   those   data   to   make  

(13)

them   more   valuable,   readable   and   accessible   has   always   been   an   interesting   problem.  

 

Previous  solutions  

Previously,   there   are   several   ways   to   processing   and   analyzing   big   data.     Usually,   they   utilize   advanced   hardware   and   parallel   processing   techniques   to   break   the   speed  bottleneck.  Others  have  employed  non-­‐relational  data  storage  systems  to  deal   with  unstructured  and  semi-­‐structured  big  data.    Meanwhile,  a  lot  of  companies  and   have   been   trying   to   apply   unique   math   models,   advance   analytics   and   data   visualization  technology  to  dig  the  insights  from  Bit  data.  

 

Approaches  

MapReduce  

MapReduce  is  a  breakthrough  concept  announced  by  Google.    It  is  a  programming   model  and  an  implementation  for  processing  and  generating  large  data  sets  [12]  .  It   is  able  to  run  on  a  large  cluster  of  machines  and  is  highly  scalable.  

MapReduce  is  not  only  successful  at  Google,  but  is  also  open-­‐sourced  to  the  public   under   the   name   of   Hadoop,   a   highly   scalable   compute   and   storage   platform   [13].     Hadoop  breaks  huge  chunk  of  data  into  pieces  and  process/analyze  it  at  the  same   time.  

(14)

NoSQL  

NoSQL   was   a   database   that   did   not   expose   the   standard   SQL   interface   and   it   was   first   used   by   Carol   Strozzi   [14].     It   works   in   conjunction   with   Hadoop   to   serve   up   discrete  data  stored  among  large  volumes  of  multi-­‐structured  data  to  end-­‐user  and   automated  Big  Data  applications  [15].  

 

Digging  useful  information  

Various   companies   have   taken   actions   to   dig   out   the   useful   information   from   the   various  data  in  the  web.    For  example,  Splunk  is  a  small  company  that  has  been  in   the  business  for  less  than  5  years.    Splunk’s  mission  is  to  make  ambiguous  big  data   more  readable,  useful  and  valuable  to  everyone.    For  example,  one  of  its  partners,   Amazon,  is  asking  Splunk  to  find  out  the  habits  of  their  customers.  

Another   company,   Jive,   is   a   software   company   in   the   social   business   software   industry.    It  is  also  trying  to  help  its  customers  to  consolidate  the  big  data  they  are   dealing   with.     One   of   the   example   data   is   the   price   information   of   all   the   merchandise:    what  price  should  be  set  in  order  to  be  the  best  price.  

 

Downsides  

However,   all   of   these   approaches   are   not   perfect.     For   example,   Hadoop   is   a   very   young  technology  and  still  developing.    It  is  very  hard  to  manage  the  Hadoop  system   and  it  does  not  support  real-­‐time  data  processing  and  analysis.  

(15)

NoSQL,   on   the   other   hand,   is   that   most   NoSQL   databases   traded   ACID   (atomicity,   consistency,   isolation,   durability)   compliance   for   performance   and   scalability.     It   also  suffers  from  its  ‘youth’:    no  mature  management  and  monitoring  tools.  

 

Conclusion  

Key  results  

Big   Data   holds   tremendous   value   and   it   will   be   beneficial   to   understand   what   it   really   means.     Many   new   technologies,   such   as   MapReduce   and   NoSQL,   have   been   applied  to  solve  this  issue.    However,  it  is  never  safe  to  say  that  we  already  have  the   perfect   tools   for   this   job.     As   the   data   continues   to   boom   exponentially,   new   technology   such   as   Linked   data   will   definitely   be   the   key   to   the   next-­‐generation   analytics  platform  and  data  management  system.  

Shortcomings  

Linked   data   applications   usually   follow   different   architectures   and   pattern.     For   instance,  one  pattern  will  require  the  data  to  be  replicated  so  that  the  applications   may   work   with   stale   data.     Another   pattern,   named   On-­‐The-­‐Fly   Dereferencing   Pattern  works  very  slowly  when  dealing  with  complex  operations.  

 

Additional  Work  

(16)

fact  that  we  are  in  a  data-­‐exploding  era  cannot  be  reverted.    More  and  more  data  are   coming  to  us  and  the  technology  must  keep  evolving  in  order  to  keep  up  with  the   pace.  

Artificial   intelligence   can   be   applied   when   dealing   with   Big   Data.     A   ‘databot’   that   can  crawl  the  Linked  data,  infer  relationships,  and  figure  out  what  information  can   be  extracted  will  definitely  be  useful.  

(17)

Methodology    

For  this  thesis,  we  are  building  a  use  case  in  order  to  figure  out  the  potential  use  for   Linked  Data  into  Patent  Data.  More  specifically,  we  will  build  a  search  engine  and  we   named   it   “Patent   Graph”.   So   when   people   type   a   certain   patent   number   or   the   inventor,  we  can  show  them  the  relevant  information  such  as  the  picture  of  inventor,   his  workplace,  alma  mater,  doctoral  advisor  and  the  biography.  This  information  is   obtained   from   DBpedia,   which   is   a   structured   data   format   from   Wikipedia.   And   DBpedia  makes  this  information  available  on  the  web  so  that  people  can  easily  link   to   the   data.   Besides,   we   will   also   make   new   search   around   the   result   simply   by   clicking  the  related  information  on  the  page.  For  example,  if  we  are  interested  in  a   co-­‐worker  or  the  advisor  in  the  patent  that  we  search,  we  can  just  click  the  name   and  then  will  return  a  new  search  around  the  person  and  his  patents.  In  addition,  we   can  provide  recommendations  based  on  the  searching  results.  If  time  allows,  we  will   also  be  willing  to  convert  the  Patent  Data  into  RDF  format  and  publish  on  the  web   then   more   people   can   benefit   from   that.   In   this   way,   the   Linked   Data   help   us   to   analysis   the   Patent   Data   by   expanding   our   patent   datasets   with   related   data   and   finding  more  useful  information.  

The   Patent   Data   that   we   use   is   the   Patent   Inventor   Database   from   Fung   institute.   The  database  disambiguated  all  inventor  names  from  the  U.S.  utility  patent  database   from  1979  to  2010.  And  the  Linked  Data  we  use  is  DBpedia.  The  DBpedia  dataset   extract  structured  content  from  the  information  created  by  Wikipedia  and  it  can  be  

(18)

Since  we  are  building  a  search  engine  to  extract  the  information  from  both  Linked   Data  Cloud  and  relational  database,  we  are  building  a  web  service  based  on  that  and   we  use  a  Model-­‐View-­‐Controller  (MVC)  software  architecture.    

My   part   of   work   includes   implementing   the   search   interface   and   query   from   the   Linked  Data  Cloud.  The  techniques  involve  SPARQL  endpoint  query,  HTTP  request   and  User  Interface  design.    

 

SPARQL:  query  language  for  RDF  data  

Resource  Description  Framework  (RDF)  is  a  directed,  labeled  graph  data  format  to   describe   resources   on   the   web.   It   is   designed   to   be   read   and   understand   by   computer  rather  than  people.  Most  RDF  documents  are  written  in  XML,  which  can   easily  be  exchanged  between  different  computers  and  platforms.  The  RDF  language   is  also  a  part  of  “The  Semantic  Web”.  Semantic  Web  is  a  set  of  standards  and  best   practices   for   sharing   data   and   the   semantics   of   that   data   over   the   web   for   use   by   application.   [16]   Rather   than   just   putting   data   on   the   web,   the   Semantic   Web   is   about  making  links  so  that  a  person  or  machine  can  explore  the  web  of  data.  [17]   We  define  RDF  statement  as  a  triple  of  the  form  (Subject,  Predicate,  Object)  and  uses   uniform   resource   identifiers   (URIs)   to   name   the   data   objects.   For   example,   if   we   need   to   express   “Tom   is   a   man”,   we   should   represent   as   Tom(Subject),   sex(Predicate),  man(Object).  The  data  stored  in  Linked  Data  Cloud  is  RDF  data.   SPARQL  stays  for  SPARQL  Protocol  and  RDF  Query  Language.  SPARQL  is  a  standard   query  language  designed  for  querying  RDF  databases.  There  are  four  different  forms  

(19)

form   most   of   the   time.   [18]The   main   idea   of   SPARQL   is   pattern   matching.   So   it   is   easily  traverse  relationship  by  querying  collections  of  triples.  The  syntax  of  SPARQL   is  quite  similar  to  SQL.  A  simple  SPARQL  query  example  can  be  as  follow:  

PREFIX dbont: <http://dbpedia.org/ontology/> SELECT ?musician ?place

WHERE {

?musician dbont:birthPlace ?place . }

 

First  we  need  to  initiate  a  namespace.  In  this  case  is  http://dbpedia.org/ontology.   And  we  find  all  the  musicians  and  their  birth  places  as  place  and  return.  The  partial   result   is   showed   below.   We   can   type   the   SPARQL   query   example   in   DBpedia   endpoint  to  get  the  full  list.  

musician place http://dbpedia.org/resource/Federico_Garc%C3%ADa_Lorca http://dbpedia.org/resource/Andalusia http://dbpedia.org/resource/Trinidad_Jim%C3%A9nez http://dbpedia.org/resource/Andalusia http://dbpedia.org/resource/Ibn_Tufail http://dbpedia.org/resource/Andalusia http://dbpedia.org/resource/Fran_Perea http://dbpedia.org/resource/Andalusia http://dbpedia.org/resource/Ver%C3%B3nica_S%C3%A1nchez http://dbpedia.org/resource/Andalusia http://dbpedia.org/resource/Berni_Rodr%C3%ADguez http://dbpedia.org/resource/Andalusia http://dbpedia.org/resource/Jos%C3%A9_Celestino_Mutis http://dbpedia.org/resource/Andalusia http://dbpedia.org/resource/Pepe_Marchena http://dbpedia.org/resource/Andalusia http://dbpedia.org/resource/Antonio_de_Olivares http://dbpedia.org/resource/Andalusia http://dbpedia.org/resource/Tanya_Anne_Crosby http://dbpedia.org/resource/Andalusia  

SPARQL  Endpoint  query  

(20)

specific  protocol  and  data  format.  [19]  A  SPARQL  endpoint  enables  users  to  query  a   knowledge  base  via  the  SPARQL  language.  Results  are  typically  returned  in  one  or   more   machine-­‐processable   formats   like   HTML.   For   simplicity,   we   can   say   that   a   SPARQL  endpoint  is  the  place  you  send  your  SPARQL  query  and  receive  the  result.   The  commonly  used  SPARQL  Endpoints  are  lists  below  (SparqlEndpoints,  2013):      

Data  Source   Endpoint  Address  

DBpedia   http://dbpedia.org/sparql  

U.S.  Census   http://www.rdfabout.com/sparql  

FactForge   http://factforge.net/sparql  

data.gov.uk   http://data.gov.uk/sparql  

 

In   our   project,   we   need   to   query   the   bio   information   of   the   patent   inventor   from   DBpedia  through  SPARQL  endpoint  query.  The  information  of  a  certain  person  is  the   same  as  we  often  see  in  Wikipedia,  but  it  is  in  a  different  format.  For  example,  as  for   our  professor  David  A.  Patterson,  the  Wikipedia  page  and  DBpedia  page  are  showed   as   below.   We   can   see   they   have   quite   different   representation   of   the   same   information.   In   DBpedia,   data   is   machine-­‐readable.   We   can   get   the   value   from   the   property  on  the  left  side.  We  just  need  to  select  the  properties  we  need  in  SPARQL   query  and  can  get  the  corresponding  values  more  convenient.  

(21)

  Fig  3:  Screenshot  of  an  example  of  Wikipedia  

   

(22)

HTTP  request  

The  Hypertext  Transfer  Protocol  (HTTP)  can  work  as  a  request-­‐response  protocol   between   a   client   and   server.   An   HTTP   request   consists   of   a   request   method,   a   request  URL,  header  fields  and  a  body.  The  request  methods  are  GET,  HEAD,  POST,   PUT,   DELETE,   OPTIONS,   TRACE.   [20]   The   two   commonly   used   HTTP   request   methods   are   GET   and   POST.   While   these   two   methods   have   similar   function,   GET   emphasizes  requests  data  from  a  specified  resource  while  POST  submits  data  to  be   processed  to  a  specified  resource.  We  use  POST  method  here  to  avoid  caching.   In   our   case,   the   client   is   the   Search   Interface   that   submits   an   HTTP   request   using   JavaScript  to  the  server  endpoint  with  the  SPARQL  query.  Then  the  server  returns  a   response  to  the  client.  The  response  contains  status  and  content  information  about   the  request.  Consider  that  JavaScript  is  not  good  at  dealing  with  RDF  data;  we  set  the   return  format  as  json  format.  

 

User  Interface  design  

The  User  Interface  (UI)  design  for  our  prototype  is  simple  and  clean.  It  looks  like  a   simplified   Wikipedia.   We   query   from   both   the   Patent   Data   and   Linked   Data   Cloud   and   display   the   output   in   the   interface.   The   structure   of   the   User   Interface   is   the   Patent  information  surrounded  by  some  information  of  the  inventor  of  the  patent.   We  can  see  the  screenshot  as  below:    

(23)

 

  Fig  5:  Screenshot  of  Paten  Search  Interface  

   

The   left   side   contains   the   basic   information   including   his   profile   picture,   working   place,  Alma  Mater  and  Doctoral  Advisor.  The  upper  right  side  is  a  biography  of  the   inventor.  Then  followed  his  patent  information  got  from  relational  database.  If  we   click  the  link  in  the  left  side,  it  can  lead  us  to  the  certain  Wikipedia  page  to  get  more   information.   The   UI   design   emphasizes   the   Patent   part   while   putting   the   relevant   information  surrounded.  

 

The  procedure    

The  procedure  works  as  below:  

On   the   client-­‐side,   when   people   search   a   keyword,   a   HTTP   request   message   will   send  to  the  DBpedia  web  server.  We  write  a  wrapper  class  “SPARQLWrapper.js”  in   JavaScript  that  is  similar  to  SPARQL  Endpoint  interface  to  Python.  [21]  

(24)

The  SPARQL  endpoint  query  is  http://dbpdia.org/sparql.  We  send  the  request  with   searched  title  and  some  properties  like  abstract,  workplaces  and  so  on  to  the  server   endpoint.   But   it   will   return   html   page,   which   is   not   what   we   need.   So   we   set   the   accept   field   in   Request   Header   to   identify   the   return   data   type.   Here   we   need   to   return  json  format.  We  use  GET  and  POST  methods  to  send  the  SPARQL.  

The  web  server  then  will  provide  resources  and  return  a  response  message  to  the   client.  The  response  message  is  read  by  JavaScript  and  write  into  html  and  display  in   the  User  Interface.  

For  the  Patent  Data  part,  we  have  potentially  two  main  approaches.  One  approach  is   to   use   the   Patent   Data   as   the   relational   database   and   query   the   data   from   local   database.  And  the  other  approach  is  to  convert  it  to  RDF  format  and  store  it  in  triple   store   or   even   publish   on   the   web.   The   first   approach   is   efficient   because   we   just   need   to   obtain   the   Patent   information   from   the   search   keyword.   It   is   quite   convenient   to   use   relational   database.   The   bottleneck   would   be   how   to   store   the   data.  The  whole  dataset  could  be  saved  locally  or  upload  in  Google  Datastore.  

The  second  approach  is  more  complex  because  we  need  to  pre-­‐process  the  whole   dataset   and   convert   to   RDF   format.   Since   the   Patent   Data   is   quite   large,   many   existing   tools   like   Google   Refine   cannot   hold   such   a   large   amount   of   data.   The   advantage  for  the  second  approach  is  that  the  Patent  Data  can  interlink  with  other   Linked  Data  and  make  Patent  Data  more  available.    

Since   the   large   amount   of   Data   is   always   a   problem,   we   will   begin   from   a   small   subset  and  go  from  there.  For  example,  we  can  use  the  Patent  Data  from  Berkeley  

(25)

Discussion  

In  this  section,  I  will  main  discuss  the  use  case  that  we  bring  Linked  Data  in  Patent   Data  search.  Also  I  will  talk  about  how  linked  Data  helped  in  patent  search,  what  is   the  limitation  and  how  linked  data  can  be  used  in  broader  context.  I  also  evaluate   the  User  Interface  of  the  search  interface  and  test  with  real  users.  

 

Results          

Explanation  of  Results  

For  our  Capstone  Project,  we  would  like  to  explore  the  potential  use  of  Linked  Data   to  help  Big  Data  Analysis.  And  thus  we  are  building  a  patent  search  engine  based  on   these  two  concepts.  Linked  Data  has  many  advantages  like  highly  structured  data,   machine-­‐readable   and   interlinked   between   different   data   sources.   So   we   take   advantage   of   the   structured   data   format   of   Linked   Data   and   use   it   to   expand   the   search   result   for   patent   and   add   more   values   to   it.   Basically   we   have   proven   the   hypothesis   that   Linked   Data   works   in   this   situation   and   it   will   have   many   other   implications.  

 

(26)

  Fig  6:  Screenshot  of  Paten  Search  Result  

 

From   the   screenshot   we   can   easily   see   that   it   has   association   information   adding   into   the   patent   search   result.   Here   we   add   some   wiki   information   for   the   certain   inventor.  In  this  way  user  can  easily  distinguish  the  exact  inventor  by  looking  at  the   biography   or   some   related   information   like   work   place,   alma   mater   and   doctoral   advisor.  It  will  help  in  disambiguation  for  patents  since  there  will  be  a  large  amount   of  people  with  the  same  name  but  work  in  different  areas  and  have  totally  different   patents.  

Besides,   users   can   also   search   for   the   patents   for   the   coworkers   by   clicking   their   names   in   the   page.   Or   if   the   users   are   interested   in   the   workplace   or   alma   mater,   they  can  also  just  click  the  link  and  it  will  lead  them  to  the  Wikipedia  page  of  the   certain  item.  

(27)

With  the  help  of  Open  Linked  Data,  we  have  a  new  kind  of  patent  association  search   that  disambiguation  the  patent  search  and  provide  a  broader  context  of  the  patent   related  information.    

 

What  is  different  

We   have   many   some   changes   compare   to   our   initial   ideas   in   our   implementation.   First  for  the  patent  data,  we  retain  its  format  as  relational  database  and  query  with   SQL   rather   than   converting   it   into   RDF   format.   Actually   we   have   worked   in   some   small   prototype   to   convert   the   data   using   Google   Refine.   But   it   becomes   really   complex  when  we  use  a  large  amount  of  data.  And  it  is  not  necessary  to  covert  data   format  in  our  use  case.  So  we  decided  to  query  the  relational  database  directly  and   combine  the  result  with  inventor  information  from  Linked  Data.  

Also   we   decide   to   put   the   patent   data   locally   and   use   PHP   to   query   the   relational   database  and  send  back  to  client  side  with  json  format.  We  find  out  this  is  the  most   efficient  way  of  doing  that  at  this  stage.  If  time  permits,  we  would  probably  put  them   in  the  cloud  server  so  that  we  can  run  the  search  engine  remotely.  

 

Limitation  of  this  approach  

 

There  are  also  some  limitations  of  our  patent  search.    

Firstly,  we  are  assuming  that  the  inventor  would  have  a  Wikipedia  page  so  that  we   can  find  the  corresponding  information  in  DBpedia.  However,  this  would  not  also  be  

(28)

all  the  people  who  held  their  patents.  In  such  case,  we  won’t  find  their  information   from  the  Linked  Data  Cloud  and  it  would  cause  a  problem.  

Secondly,  the  user  will  need  to  type  the  full  name  of  the  inventor  in  order  to  match   the   name   in   DBpedia   and   the   inventor   name   in   patent   database.   Compare   with   Google   Patent   Search,   it   is   kind   of   limited   because   Google   can   find   us   a   lot   of   information  based  on  selection  rank  even  if  we  didn’t  type  the  full  name.    

Thirdly,   we   are   using   patent   data   as   its   original   format   and   run   two   queries   to   search  from  DBpedia  and  relational  database.  It  doesn’t  make  the  best  use  of  Linked   Data  because  the  advantage  of  Linked  Data  over  other  format  is  that  it  is  in  the  same   format  and  different  datasets  can  be  interlinked  together.  Later  it  would  be  better  if   we  can  actually  convert  the  patent  data  into  RDF  format  and  even  publish  the  data   into   Open   Linked   Data   Cloud.   In   this   way,   the   patent   data   would   have   been   interlinked  with  all  the  other  data  source  in  the  cloud  and  make  use  of  the  Linked   Data  concept  better.    

 

Evaluation    

In  the  evaluation  part,  I  will  mainly  discuss  the  User  Interface  we  build  for  patent   search  and  the  effectiveness  and  convenience  of  search  experience  for  real  users.    

User  Study  

(29)

Most  of  them  think  that  the  patent  association  search  result  is  better  comparing  it   with  the  traditional  approach.  They  often  encounter  the  problem  whether  they  get   the  right  one  when  they  search  for  patents.  With  our  prototype  they  can  easily  get   the   information   of   the   inventor   and   therefore   get   correct   and   comprehensive   understanding  of  the  information  they  retrieve.  

They  thinks  that  our  patent  search  has  clear  output  with  the  associate  information   and   it   can   also   run   relevant   search.   But   they   also   point   out   the   limitation   of   the   approach.  We  can  only  have  basic  information  for  the  patent  itself.  If  users  would   like  to  know  about  some  details  of  the  patent  itself,  we  cannot  provide  that  because   we  don’t  have  that  information  in  Patent  Database.  

 

Heuristic  Evaluation  

 

We  examine  our  User  Interface  with  the  famous  10  Usability  Heuristics  introduced   by   Jakob   Nielsen.   It   is   a   usability   engineering   method   for   finding   the   usability   problems  in  a  user  interface  design.  [22]  We  have  a  small  set  of  evaluators  examine   the   interface   with   the   recognized   usability   principles   with   point   one   to   ten   and   combine  the  result  of  evaluation.  

We  asked  our  users  to  go  through  a  set  of  tasks  we  designed  in  our  search  interface   and  provide  evaluators  with  the  goals  of  the  system  and  allowed  them  to  do  their   own  tasks.  After  that,  they  filled  out  the  sheet  of  Heuristic  Evaluation.  

(30)

Heuristic  Evaluation  principles   Points  

(1-­‐10)   Comments  

Visibility of system status    

Match between system and the real world    

User control and freedom    

Consistency and standards    

Error prevention    

Recognition rather than recall    

Flexibility and efficiency of use    

Aesthetic and minimalist design    

Help users recognize, diagnose, and recover from errors

   

Help and documentation    

 

We  analyzed  the  results  the  real  users  provides  and  explained  the  evaluation  result.   The   principle   got   Good   if   the   average   point   is   more   than   6   out   of   10,   otherwise   it   need  to  improve.    

 (1).  Visibility  of  system  status:  Good  (8.7)  

Our  interface  has  clear  layout  and  different  components  will  not  combine  together   when  it  shows.    User  can  easily  see  if  they  have  obtained  the  search  result  and  how   the  information  likes.  

(2).  Match  between  system  and  the  real  world:  Good  (8.2)  

(31)

(3).  User  control  and  freedom:  Good  (7.1)  

Users  can  search  new  patent  by  using  the  textbox  in  the  upper  left  corner  or  simply   click  the  information  in  the  page.    

(4).  Consistency  and  standards:  Need  to  improve  (5.8)  

For  the  search  textbox,  we  can  only  do  search  for  the  existing  patents  number  and   some  inventor  information.  So  user  may  get  confused  about  what  they  should  enter   at  first.  

(5).  Error  prevention:  Need  to  improve  (5.0)  

We   don’t   build   the   function   for   auto-­‐completion   or   auto-­‐correction   so   that   users   need  to  type  correctly  in  order  to  get  the  result.  

(6).  Recognition  rather  than  recall:  Good  (7.5)  

We   have   minimized   the   user’s   memory   load   by   making   the   objects   and   actions   visible.  Users  don’t  have  to  remember  information  but  can  just  click  in  the  old  result.  

(7).  Flexibility  and  efficiency  of  use:  Good  (7.2)  

The  differences  between  novice  user  and  expert  user  will  not  be  huge  because  there   are  no  complicated  actions  needed  for  the  search  feature.  

(8).  Aesthetic  and  minimalist  design:  Good  (6.8)  

The   interface   contains   the   most   relevant   and   needed   information   and   diminishes   the  extra  information  with  low  visibility.  

(32)

(9).  Help  users  recognize,  diagnose,  and  recover  from  errors:  Need  to  improve   (5.7)  

If  users  type  some  names  that  does  not  exist  in  the  Wikipedia  or  they  make  some   typo,  there  is  no  error  messages  to  indicate  the  problem  precisely.  

(10).  Help  and  documentation:  Need  to  improve  (5.5)  

We  actually  didn’t  implement  the  documentation  part  to  help  user  understand  the   functionality   of   the   search   engine.   Normally   people   will   understand   because   the   interface  looks  like  all  the  other  search  engine.  

 

Future  Work    

 

Enriching  the  functionality  of  the  Patent  Search  

Now  we  only  focus  on  how  to  combine  the  Linked  Data  and  relational  Data  together   to  make  the  patent  search  more  convenient.  So  we  only  use  a  limited  information   collected  from  only  one  source  of  Open  Linked  Data  Cloud.  In  fact,  there  are  many   more  things  we  can  do  to  enrich  the  functionality  of  the  Patent  Search.  For  example,   we  can  obtain  the  geo  information  in  the  Patent  Data  and  do  some  visualization  of   from  the  Geo  Names  Data  from  Linked  Data  Cloud.  Or  we  can  even  visualize  some   Patent   Search   Graph   to   show   the   relationships   between   different   inventors   and   their  patents  more  explicit.  

(33)

Querying  a  Collection  of  Datasets  in  Linked  Data  

We   query   data   from   only   DBpedia   for   this   project.   But   since   Linked   Data   is   interlinked,   we   may   be   able   to   query   a   collection   of   datasets   using   an   existing   SPARQL   endpoint   and   access   to   a   set   of   copies   of   relevant   dataset.   For   example,   OpenLink  SW  has  a  majority  of  dataset  from  the  LOD  cloud  using  SPARQL  endpoint.   [23]  

 

Applying  the  concept  to  other  topics  

Currently  we  apply  the  patent  data  with  the  DBpedia  in  Linked  Data  Cloud.  There   are   many   other   sources   in   Linked   Data   Cloud   we   may   use   like   Geo   Names   data,   IMDB  data,  BBC  music  and  so  on.  We  may  make  use  of  these  sources  and  find  other   available  applications.  For  example,  we  can  search  for  a  certain  music  singer  and  get   the  relevant  biographical  information  along  with  their  albums  and  songs  in  different   data  sources.    

 

                                                                                                     

(34)

Conclusions  or  Impact  Statement                          

For   our   capstone   project,   it   is   a   research   project   to   explore   the   potential   use   of   Linked  Data  into  Big  Data.  We  have  do  some  research  about  Big  Data,  knowing  the   existing  approaches  to  analysis  Big  Data  and  their  strength  and  weakness.  And  we   figured  out  that  the  highly  structured  Linked  Data  might  be  a  potential  solution  for   unstructured  Big  Data  analytics  and  dig  out  more  values  behind  the  Big  Data.  

Based  on  that,  we  are  building  a  search  engine  to  describe  how  Linked  Data  help  in   Big  Data  Analysis  by  expanding  the  Patent  Data  with  the  Open  Linked  Data  Cloud.  In   this  way,  we  may  be  able  to  find  out  the  patent  association  information  through  the   Linked   Data   Cloud   and   combine   with   the   patent   search   to   get   a   comprehensive   answer.  

Although  we  have  learned  a  lot  about  the  mechanism  of  Linked  Data  and  use  it  in   our  prototype,  there  is  something  remains  to  be  learned.  For  example,  we  just  query   from   a   single   sources   from   Linked   Data   Cloud,   we   may   explore   multiply   queries   from   different   sources   or   directly   convert   the   Patent   Data   into   RDF   format   and   publish  it  in  the  Linked  Data  cloud.    

The  strength  for  Linked  Data  is  its  structured  and  uniform  format  that  information   can   be   shared   among   different   datasets   and   it   can   be   read   automatically   by   computers.   Yet   we   still   need   to   figure   out   the   drawbacks   like   complicated   pre-­‐ processing  procedures  and  the  way  to  protect  the  available  data  in  the  web.  

Our  prototype  has  proven  that  linked  data  has  many  advantages  and  can  be  used  in   data  analysis  in  different  situations.  We  can  see  a  bright  future  for  making  better  use  

(35)

Bibliography  

 

1.  Ian  Mitchell,  Mark  Wilson.  Linked  Data:  Connecting  and  exploiting  big  data.  London  :   Fujitsu  UK,  2012.  

2.  Dumbill,  Edd.  What  is  big  data?  An  introduction  to  the  big  data  langscape.  [Online]   January  11,  2012.  http://strata.oreilly.com/2012/01/what-­‐is-­‐big-­‐data.html.  

3.  James  Manyika,  Michael  Chui,  Brad  Borwn,  Jacques  Bughin,  Richard  Dobbs,  Charles   Roxburgh,  Angela  Hung  Byers.  Big  Data:  The  next  frontier  for  innovation,  competition,  and   productivity.  s.l.  :  McKinsey  Global  Institute,  2011.  

http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_ The_next_frontier_for_innovation.  

4.  Roebuck,  Kevin.  Big  Data:  High-­‐impact  Strategies  –  What  You  Need  to  Know:  Definitions,   Adoptions,  Impact,  Benefits,  Maturity,  Vendors.  s.l.  :  Lightning  Source  Incorporated,  2011.   5.  Richard  Cyganiak,  Anja  Jentzsch.  Linking  Open  Data  cloud  diagram.  [Online]  2011.   http://lod-­‐cloud.net/.  

6.  Hopkins,  Brian  and  Evelson,  Boris.  Expand  Your  Digital  Horizon  with  Big  Data  .  s.l.  :   Forrester  ,  2011.  

7.  Oreskovic,  Alexei.  YouTube,  Google  Inc's  video  website,  is  streaming  4  billion  online   videos  every  day,  a  25  percent  increase  in  the  past  eight  months,  according  to  the  company.   [Online]  Jan.  23,  2012.  [Cited:  Nov.  30,  2012.]  

http://www.reuters.com/article/2012/01/23/us-­‐google-­‐youtube-­‐ idUSTRE80M0TS20120123.  

8.  twittersearch.  The  Engineering  Behind  Twitter’s  New  Search  Experience.  [Online]  May   31,  2011.  [Cited:  Nov  30,  2012.]  http://engineering.twitter.com/2011/05/engineering-­‐ behind-­‐twitters-­‐new-­‐search.html.  

9.  O'Brien,  Kevin.  Why  Media  Literacy?  A  Catholic  Reflection.  [Online]  [Cited:  Nov.  30,   2012.]  http://www.medialit.org/reading-­‐room/why-­‐media-­‐literacy-­‐catholic-­‐reflection.   10.  IDC  European  Software  Predictions.  Woodward,  Alys,  et  al.  2012,  IDC.  

11.  IDC  Worldwide  Big  Data  Taxonomy  .  Woo,  Benjamin,  et  al.  2011.   12.  Ghemawat,  Jeffrey  Dean  and  Sanjay.  2004,  OSDI,  p.  13.  

(36)

16.  DuCharme,  Bob.  Learning  SPARQL.  s.l.  :  O'REILLY,  2011.  

17.  Berners-­‐Lee,  Tim.  Linked  Data  Design  Issues.  [Online]  06  18,  2009.   http://www.w3.org/DesignIssues/LinkedData.html.  

18.  Matthews,  Andrew.  Understanding  SPARQL.  [Online]  2008.  

http://www.ibm.com/developerworks/xml/tutorials/x-­‐sparql/section3.html.  

19.  SPARQL  endpoint.  [Online]  2011.  http://semanticweb.org/wiki/SPARQL_endpoint.   20.  HTTP  Requests.  [Online]  http://docs.oracle.com/javaee/1.4/tutorial/doc/HTTP2.html.   21.  Ivan  Herman,  Sergio  Fernandez,  Carlos  Tejo.  SPARQL  Endpoint  interface  to  Python.   [Online]  2008.  http://sparql-­‐wrapper.sourceforge.net/.  

22.  Nielsen,  Jakob.  10  Usability  Heuristics  for  User  Interface  Design.  [Online]  1995.   http://www.nngroup.com/articles/ten-­‐usability-­‐heuristics/.  

23.  Hartig,  Olaf.  Querying  Linked  Data  with  SPARQL.  [Online]  2009.   http://www.slideshare.net/olafhartig/querying-­‐linked-­‐data-­‐with-­‐sparql.   24.  Public  Data  Sets  on  AWS.  [Online]  http://aws.amazon.com/publicdatasets.   25.  SparqlEndpoints.  [Online]  2013.  http://esw.w3.org/topic/SparqlEndpoints.  

 

 

(37)

Appendix  

 

Here  I  will  list  some  code  snippets  described  in  methodology.    

SPARQLWrapper.js  

 

(function(root,  factory)  {  

  if(typeof  define  ===  "function"){  

    define("SPARQLWrapper",  factory);   //  AMD  ||  CMD     }else{  

    root.SPARQLWrapper  =  factory();   //  <script>     }   }(this,  function(){   'use  strict'     function  SPARQLWrapper(endpoint){     this.endpoint  =  endpoint;     this.queryPart  =  "";     this.type  =  "json";   }   SPARQLWrapper.prototype  =  {     constructor:  SPARQLWrapper,     setQuery:  function(query){  

    this.queryPart  =  "query="  +  encodeURI(query);     },  

  setType:  function(type){  

    this.type  =  type.toLowerCase();     },  

  query:  function(type,  callback){  

    callback  =  callback  ===  undefined  ?  type  :  this.setType(type)  ||   callback;  

     

    var  xhr  =  new  XMLHttpRequest();       xhr.open('POST',  this.endpoint,  true);  

    xhr.setRequestHeader('Content-­‐type',  'application/x-­‐www-­‐form-­‐ urlencoded');       switch(this.type){         case  "json":           type  =  "application/sparql-­‐results+json";           break;         case  "xml":           type  =  "application/sparql-­‐results+xml";  

(38)

        break;         default:           type  =  "application/sparql-­‐results+json";           break;       }       xhr.setRequestHeader("Accept",  type);       xhr.onreadystatechange  =  function(){         if(xhr.readyState  ==  4){  

        var  sta  =  xhr.status;  

        if(sta  ==  200  ||  sta  ==  304){  

          callback(xhr.responseText);  

        }else{  

          console  &&  console.error("Sparql  query  error:  "   +  xhr.status  +  "  "  +  xhr.responseText);  

        }  

     

        window.setTimeout(function(){  

          xhr.onreadystatechange=  new  Function();             xhr  =  null;           },0);         }       }             xhr.send(this.queryPart);     }   }       return  SPARQLWrapper;     }));  

References

Related documents

We have reported the results of a 4-week study exploring how different types of cues and positive reinforcement influence the development of automaticity,

These include: (1) widely expanding oil data collection and making new information publicly available, (2) increasing heavy-duty vehicle efficiency standards for the trucks and

potensi alam, wilayah, dan masyarakat yang hampir sama (DIRJEN PHKA 2014). Dengan melihat kondisi tersebut menyebabkan perlu adanya studi pengembangan wisata melalui kajian

in Drop D tuning (DADGBE)..

The potential of combined omics approaches such as metabolomics, genomics, transcriptomics, and proteomics would contribute to our understanding about pathogen fungal crosstalk as

Diagnostic tool structure - Part II: Action Plan to improve the OVC national M&amp;E system.  Identification of existing relevant OVC national

I hereby authorise and request any doctor, medical professional, or any other person who may be in possession of, or may hereafter acquire, any information concerning my / the

3–5 Given that those treponemes associated with bovine digital dermatitis are genetically similar to those observed in contagious ovine digital dermatitis, hoof disease in elk,