• No results found

Advanced Archive- It Applica2on Training: Archiving Social Networking and Social Media Sites

N/A
N/A
Protected

Academic year: 2021

Share "Advanced Archive- It Applica2on Training: Archiving Social Networking and Social Media Sites"

Copied!
29
0
0

Loading.... (view fulltext now)

Full text

(1)

Advanced  Archive-­‐It  Applica2on  Training:  

 

Archiving  Social  Networking  and  

Social  Media  Sites  

(2)

Agenda  

•  Overview  of  Social  Networking/Media  sites   •  Why  archive  these  sites?  

•  Typical  Challenges   •  Best  Prac2ces:  

•  TwiGer,  Facebook,  YouTube,  Flickr  

(3)

Why  Archive  These  Sites?  

•  State  Agencies:  An  increasing  number  have  decided   that  the  content  on  these  sites  are  a  record  and  need   to  be  archived.    "A  tweet  is  a  record”  

•  University  libraries:  Used  to  share  informa2on  with   students  and  alumni  and  contain  important  records   about  a  school's  culture,  student  body  and  campus   events.  

•  Non  Government  Non  Profit  Organiza@ons:  Used  to   record  online  presence  and  impact  

•  Researchers:  Used  to  preserve  valuable  social   reac2ons  and  change  on  topics  of  interest

(4)

Archive-­‐It  and  Social  Media  

Overview  

•  Capturing  Social  media  sites  is  becoming  more  

necessary  for  Archive-­‐It  partners  

•  S2ll  focused  on:  Flickr,  Facebook,  TwiGer,  and  

YouTube  

•  On  our  radar:  Vimeo,  LinkedIn,  Others?  

•  Join  the  Archive-­‐It  social  media  list  serve  to  hear  

breaking  news,  including  fixes  and  adjustments  within   Archive-­‐It  

(5)

Social  Media  Crawling  Notes  

•  Content  behind  log-­‐ins  can  not  be  archived   currently  –  Feature  in  4.8  Release,  April  2013     •  Some  parts  of  sites  are  not  “archive-­‐

friendly”    (i.e.  complicated  javascript,  etc.)   •  These  sites  tend  to  change  both  their  

technical  structure  and  policy  quickly  and   oeen.  

(6)

Scoping  Social  Media  Sites  

•  Because  of  the  way  many  of  these  sites  are   structured,  scoping  crawls  correctly  is  very   important  if  you  are  archiving  these  sites.  

– Each  site  has  its  own  unique  structure  

– Not  scoping  correctly  can  result  in  crawling   much  much  more  than  you  intend,  or  not   capturing  the  content  you  want  to  archive.  

(7)

Scoping  -­‐  Overall  Approaches  

•  Trial  and  Error:    Try  to  harvest  with  a  variety   of  seings  and  a  variety  of  seeds    

•  Quality  Review:  review  archived  content   thoroughly  

•  Collaborate:  compare  approaches  and  results   with  other  Archive-­‐It  users  

•  Document  detailed  instruc2ons,  lessons  

(8)

Best  Prac2ces  

•  Best  prac2ces  for  various  social  networking  

and  social  media  sites  are  documented  on  the   Archive-­‐It  Help  Wiki:  

hGps://webarchive.jira.com/wiki/display/ARIH/ Archiving+Social+Networking+Sites+with

(9)

Best  Prac2ces  

•  Be  specific  with  your  seed  URLs  -­‐  list  only  the  

page  you  would  like  to  archive  as  a    seed  .    Do  

NOT  use  the  larger  site  as  a  seed  (for  example,  do   NOT  use  www.facebook.com  or  

www.twiGer.com  as  seeds.    DO  use:  

hGp://twiGer.com/internetarchive/).        

•  Double  –check  your  seed:    Do  you  need  an  

ending  slash  /    ?  

•  Ignore  Robots.txt  as  needed:  Some  sites  block  

(10)

Best  Prac2ces  

•  ALWAYS  run  a  test  crawl  when  first  seing  up   these  seeds  to  avoid  using  more  of  your  

document  budget  than  expected.    You  may   need  to  run  more  than  one  un2l  you  get  it   right.  

(11)

Best  Prac2ces  

•  ANer  your  first  crawl…  

– Review  post-­‐crawl  reports  (did  you  crawl   too  much?)  

– Review  archived  content  in  Wayback   • Did  you  capture  all  the  areas  you  

expected?  

(12)

Reviewing  Scoping  Rules  

 

(13)

TwiGer  –  Sample  URLs  

– Individual  user  feeds    

•  hGps://twiGer.com/archiveitorg/   – Searches   •  hGps://twiGer.com/search?q=web %20archiving&src=typd   – Lists   •  hGps://twiGer.com/smithsonian/smithsonian/  

– A  specific  tweet  

•  hGps://twiGer.com/archiveitorg/status/ 294819565320413184  

(14)

TwiGer  -­‐  Scoping  

Expand  Scope  (using  SURTs)  to  capture  dynamically   loading  content:  

– Individual  TwiGer  feed:    

•  +hGp://(com,twiGer,)/i/profiles/show/ BrowardCollege/  

– Mul2ple  TwiGer  feeds:    

(15)

Links  in  Tweets  

•  Can  I  archive  a  url  linked  to  using  a  ‘url  shortener’?  

– Yes!    Use  an  Expand  Scope  rule  for  hGp://t.co/  -­‐  all  

URLs  posted  on  TwiGer  redirect  through  that  domain  

– Note:  just  the  one  page  that  the  url  shortener  link  

(16)

TwiGer  

(17)

Facebook  –  Sample  URLs  

–  Individual  User  Profiles  –  Timeline  view    

•  hGp://www.facebook.com/tonyforsenate/  

–  Pages  -­‐  Timeline  view    

•  hGp://www.facebook.com/ArchiveIt/   –  Events   •  hGp://www.facebook.com/events/265897963430841/   –  Albums   •  hGps://www.facebook.com/media/set/?set=a. 13499334573.18616.6193904573&type=3    

(18)

Facebook  -­‐  Scoping  

– Ignoring  robots.txt:  

•  www.facebook.com     •  qcdn.net  

•  akamaihd.net  

– Document  limit  on  www.facebook.com  

(recommended  2000  for  each  seed)  –  Note,  you   cannot  limit  to  *just*  capture  content  from  one   Facebook  account  

(19)

Facebook  

•  Currently  we  can  capture  the  ini2al  content   on  a  Facebook  2meline,  however  the  

dynamically  loading  content  can  be  difficult  to   capture  due  to  the  frequent  changes  in  the  

way  that  content  is  served  by  Facebook  

•  Our  engineers  are  working  on  keeping  up  to   date  with  these  changes  and  we  are  also  

inves2ga2ng  alternate  methods  for  capturing   Facebook  pages  

(20)

Facebook  

(21)

YouTube  -­‐  Sample  URLs  

–  Channel  /User  pages  

•  hGp://www.youtube.com/whitehouse  

–  Watch  pages-­‐  individual  videos  

•  hGp://www.youtube.com/watch?v=5lVIuW8vJ_E  

–  Uploaded  Document  RSS  Feed  

•  hGp://gdata.youtube.com/feeds/api/users/whitehouse/ uploads/  

–  Embedded  YouTube  Videos  on  other  sites:  

•  hGp://www.whitehouse.gov/photos-­‐and-­‐video/video/ 2013/01/29/president-­‐obama-­‐speaks-­‐comprehensive-­‐

(22)

YouTube  -­‐  Scoping  

•  For  all  YouTube  content,  ignore  robots.txt  for:  

– youtube.com  

– y2mg.com  

•  For  Watch  pages-­‐  individual  videos  

– Use  “One  Page  Only”  Seed  Type  

•  For  Channel/User  pages    

(23)

23  

YouTube  

•  Viewing  YouTube  videos:  

– YouTube  videos  for  Watch  pages  and  most  

embedded  YouTube  videos  will  playback  normally   in  Wayback  

– For  Channel/User  Pages  or  other  pages  where  

videos  are  not  playing  back  within  the  page,  view   videos  from  the  video  report  or  the  public  video   page  for  that  seed.  

(24)

YouTube  

(25)

Flickr  

What  types  of  pages  can  be  archived?  

– Photo  streams  

•  Ex:  hGp://www.flickr.com/photos/whitehouse/  

– Individual  photos  

•  Ex:  hGp://www.flickr.com/photos/whitehouse/ 8390033709/in/photostream    

(26)

Flickr  

(27)

Other  Sites  

•  Can  sites  other  than  those  already  men2oned   be  archived?  

– Yes!    There  are  many  more  sites  out  there  that  

can  be  archived.    Please  send  us  sites  you  are   interested  in  archiving.  

– Other  sites  men2oned  by  partners  currently  are  

(28)

Moving  Forward  

•  These  best  prac2ces  will  change  as  the  sites  themselves  make   changes.    Please  be  sure  to  check  the  Help  Wiki  page  for  updates   •  We  con2nue  to  focus  on  working  with  our  partners  to  improve  

the  capture  and  display  of  archived  social  networking  sites   •  The  Archive-­‐It  team  is  exploring  other  capture  mechanisms  

besides  using  a  tradi2onal  crawler  resource  (Heritrix)     •  Headless  browsers  

•  Hybrid  architecture   •  API  

(29)

Thank  you!  

•  Ques2ons?  Discussion?  

•  Please  take  our  quick  survey:  

References

Related documents

To get a sense of their daily online news consumption, we asked online news consumers if, on a typical day, they used a number of different online sources, ranging from the

If the coefficient of future inflation is restricted in a standard NK Phillips Curve, this creates a bias on the estimate of the slope of the Phillips Curve, This dampening

C121: DEMO OF ISO 45001:2018 OH&S SYSTEM AWARENESS AND INTERNAL AUDITOR TRAINING PPT PRESENTATION KIT. Price

In this paper, a multipath routing protocol for MANETs using dual polarised directional antenna with load sharing, called LS-Multipath Routing protocol is designed and

Then, you need to choose the item from the SD_Counties attribute table that will link the data [choose ‘fips’], the table you wish to join [select ‘livestock’], and field in

Our objectives are two fold: firstly, to examine whether conceptions of drawing as articulated by Ingold and Taussig offer a way of thinking more productively about forms

Comparison of cer- tain traditional risk factors for CVD, such as smoking habit and obesity in the general population of Latin America were comparable to those found in our

Keywords: ischaemic stroke; magnetic resonance imaging; chemical exchange saturation transfer; acidosis; pH-weighted imaging Abbreviations: ADC = apparent diffusion coefficient; APTR*