• No results found

XML, Seman9c Web and Content Analy9cs

N/A
N/A
Protected

Academic year: 2021

Share "XML, Seman9c Web and Content Analy9cs"

Copied!
23
0
0

Loading.... (view fulltext now)

Full text

(1)

XML,  Seman9c  Web  and  Content  

Analy9cs  

XML  Prague  Pre-­‐conference  2014  

Felix  Sasaki  

(2)

What  do  you  need  to  follow  this  

session?  

• 

Ideal:  a  computer  with  internet  access,  to  be  

able  to  provide  feedback  

(3)

What  is  LIDER?  

• 

Project  funding  this  session  

• 

Aim:  bring  people  like  you  together  with  

– Linked  data  experts  

– Language  technology  people  

(4)

What  is  Content  Analy9cs?  

• 

A  set  of  technologies  to  make  sense  of  data  

• 

Quite  broad  –  like  usage  scenarios  

– Sen9ment  analysis   – Business  intelligence  

– “Intelligent”  web  search   – ...  

(5)

What  is  Content  Analy9cs?  

• 

Some  basic  technologies  

– Named  en9ty  recogni9on  

•  “Welcome  to  Prague!”  (=city)  

– Rela9on  extrac9on  

(6)

Demo  

• 

Dbpedia  spotlight  

• 

Basic  named  en9ty  recogni9on  

• 

Data  extracted  from  wikipedia  

• 

Available  as  a  RESTful  service:  

hTp://9nyurl.com/dbspotlight-­‐demo  

• 

Deployed  in  oXygen  automa9c  annota9on  

implementa9on;  see  oXygen  users  session  and    

(7)

Issues  with  content  analy9cs  

• 

Tooling  is  language  specific  

–  Most  support  of  course  for  English  

–  Task:  “Get  content  analy9cs  for  your  language!”  

• 

Content  =  not  only  text!  

–  E.g.  more  and  more  mul9media  content   –  Signal  analysis  has  its  limits  

“Find  my  all  movies  with  a  kiss  scene  at  the  end!”:  Won’t  work  L  

–  Again  textual  metadata  (e.g.  sub9tles,  closed  cap9ons)   may  help  

(8)

WHY  AM  I  TELLING  YOU  ABOUT  

CONTENT  ANALYTICS?  

(9)

Vision:  beTer  content  analy9cs  via  

structured  data!  

• 

More  and  more  structured  data  on  the  Web  

– Textual  data  with  addi9onal  informa9on  (markup,   metadata)  

• 

Huge  knowledge  bases  

– Wikipedia  /  DBpedia  is  just  an  example   – Also:  Wikidata,  Freebase,  BabelNet,  ...  

• 

Not  necessarily  created  with  content  analy9cs  

in  mind  –  you  just  need  to  bring  things  

(10)

What  is  XML?  

• 

Details  can  be  skipped  here  ...  

• 

From  the  point  of  view  of  content  analy9cs:  

– A  way  to  store  (semi)  structured  data  

– A  way  to  store  annota9ons  of  content  that  may  link   to  structured  data  

<span  ...  

its-­‐ta-­‐class-­‐ref="hTp://nerd.eurecom.fr/ontology#Loca9on"       its-­‐ta-­‐ident-­‐ref="hTp://dbpedia.org/resource/Prague">  

(11)

XML  aware  content  analy9cs  

• 

Content  analy9cs  algorithms  process  pure  text  

• 

Knowledge  about  XML  structures  improves  

content  analy9cs  quality  

<para>Mr.  Obama<footnote><para>Barack  Obama   is  ...</para></footnote>  came  to  Prague  yesterday.</ para>  

Without  the  knowledge  a  content  analy9cs  tool   “sees”:  

(12)

XML  aware  content  analy9cs  

• 

Need:  process  different  parts  of  XML  content  

differently  

– Headings:  Osen  not  sentences  in  the  linguis9c   sense  

– Footnotes:  embedded  in  text  (see  last  slide)   – Non  textual  content:  XML  data  base  structures  

cons9tute  a  mix  of  structured  (non)textual  +  semi   structured  data  

XML  knowledge  &  tooling  could  lead  to  new  

content  analy9cs  approaches  

(13)

WHAT  IS  SEMANTIC  WEB?  

SHORT  INTRODUCTION  

(14)

Building  blocks  of  the  Web  

<p>All  content  on  this  site  is  licensed  under   <a  

   href="hTp://crea9vecommons.org/licenses/by/3.0/">            a  Crea9ve  Commons  License</a>.  </p>  

(15)

Content  

<p>All  content  on  this  site  is  licensed  under   <a  

(16)

Links  (or  “

iden9fiers

”)  

<p>All  content  on  this  site  is  licensed  under   <a  

   href="h?p://creaCvecommons.org/licenses/by/3.0/">            a  Crea9ve  Commons  License</a>.  </p>  

(17)

Easy:  “Find  all  content  that  links  to  

hTp://crea9vecommons.org/licenses/by/3.0/“

 

<p>All  content  on  this  site  is  licensed  under   <a  

   href="h?p://creaCvecommons.org/licenses/by/3.0/">  

(18)

S9ll  

difficult

:  “Find  all  content  that  links  to  a  

crea9ve  commons  license“

 

<p>All  content  on  this  site  is  licensed  under   <a  

   href="h?p://creaCvecommons.org/licenses/by/3.0/">            a  Crea9ve  Commons  License</a>.  </p>  

(19)

Seman9c  Web  to  the  rescue

 =  Providing  

machine  readable  informa9on

 on  the  Web

 

<p>All  content  on  this  site  is  licensed  under  

<a  property="h?p://creaCvecommons.org/ns#license"  

   href="hTp://crea9vecommons.org/licenses/by/3.0/">  

(20)

Seman9c  Web

 =  Providing  

machine  readable  informa9on

 on  the  Web

 

(21)

What  is  Seman9c  Web  -­‐  summary  

• 

A  set  of  technologies  to  work  with  interlinked  

data:  

– 1)  represent;  2)  model;  3)  store;  4)  process.  

– 1)  RDF;  2)  RDF  Schema,  OWL;  3)  Turtle  /  RDFa  /  ...;   4)  SPARQL.  

• 

Also  (but  not  cri9cal  here):  a  means  to  infer  

new  informa9on  from  data  

(22)
(23)

Aim  of  this  session  

• 

Gather  your  thoughts  on  the  rela9on  between  

XML,  Seman9c  Web  and  content  analy9cs  

• 

Together:  go  through  a  LIDER  survey  &  find  out:  

“What  are  

your  use  cases  for  content  analy9cs?”  

hTp://9nyurl.com/co-­‐an-­‐survey

     

• 

That  may  be  related  to  XML  –  or  not  

• 

Also:  gather  feedback  in  free  text  form:  please  join  

References

Related documents

By “pushed to their upmost limit and cross-checked”, we meant that the objective of our study was not so to identify a “winner” in this planning exercise, but rather to

Greene, Sarah M.. The Transformation Center will be the main vehicle for rapid cycle learning in Oregon. Building on the performance reporting mechanisms described above, the

In any case, the lowest environmental impact was obtained in the last mixture, in which FA substituted halloysite as the raw material, RHA substituted the nanosilica powder in

º When local marketers and their external partners do have access to corporate marketing assets (usually via a digital asset management system), the access may not be provided

By using a fixed-effect gravity model to estimate the effects of these integration processes on bilateral FDI activity with the UK, the empirical results suggest that, on

McPhee, letter to Members of the Council of State College Presidents, August 26, 1954, Box 2 Folder 1, 191.01 Cal Poly Foundation Collection, Special Collections and

In terms of overall efficiency, our experimental results point to a non-monotonicity over the outside-option value. Through an increased selection of cooperative strategies,

Shenzhen pilot focuses on UAS segregated operation below AGL 120 meters, and the Hainan pilot project is extended to the non-segregated operation with manned flight for