Advanced Archive-‐It Applica2on Training:
Archiving Social Networking and
Social Media Sites
Agenda
• Overview of Social Networking/Media sites • Why archive these sites?
• Typical Challenges • Best Prac2ces:
• TwiGer, Facebook, YouTube, Flickr
Why Archive These Sites?
• State Agencies: An increasing number have decided that the content on these sites are a record and need to be archived. "A tweet is a record”
• University libraries: Used to share informa2on with students and alumni and contain important records about a school's culture, student body and campus events.
• Non Government Non Profit Organiza@ons: Used to record online presence and impact
• Researchers: Used to preserve valuable social reac2ons and change on topics of interest
Archive-‐It and Social Media
Overview
• Capturing Social media sites is becoming more
necessary for Archive-‐It partners
• S2ll focused on: Flickr, Facebook, TwiGer, and
YouTube
• On our radar: Vimeo, LinkedIn, Others?
• Join the Archive-‐It social media list serve to hear
breaking news, including fixes and adjustments within Archive-‐It
Social Media Crawling Notes
• Content behind log-‐ins can not be archived currently – Feature in 4.8 Release, April 2013 • Some parts of sites are not “archive-‐
friendly” (i.e. complicated javascript, etc.) • These sites tend to change both their
technical structure and policy quickly and oeen.
Scoping Social Media Sites
• Because of the way many of these sites are structured, scoping crawls correctly is very important if you are archiving these sites.
– Each site has its own unique structure
– Not scoping correctly can result in crawling much much more than you intend, or not capturing the content you want to archive.
Scoping -‐ Overall Approaches
• Trial and Error: Try to harvest with a variety of seings and a variety of seeds
• Quality Review: review archived content thoroughly
• Collaborate: compare approaches and results with other Archive-‐It users
• Document detailed instruc2ons, lessons
Best Prac2ces
• Best prac2ces for various social networking
and social media sites are documented on the Archive-‐It Help Wiki:
hGps://webarchive.jira.com/wiki/display/ARIH/ Archiving+Social+Networking+Sites+with
Best Prac2ces
• Be specific with your seed URLs -‐ list only the
page you would like to archive as a seed . Do
NOT use the larger site as a seed (for example, do NOT use www.facebook.com or
www.twiGer.com as seeds. DO use:
hGp://twiGer.com/internetarchive/).
• Double –check your seed: Do you need an
ending slash / ?
• Ignore Robots.txt as needed: Some sites block
Best Prac2ces
• ALWAYS run a test crawl when first seing up these seeds to avoid using more of your
document budget than expected. You may need to run more than one un2l you get it right.
Best Prac2ces
• ANer your first crawl…
– Review post-‐crawl reports (did you crawl too much?)
– Review archived content in Wayback • Did you capture all the areas you
expected?
Reviewing Scoping Rules
TwiGer – Sample URLs
– Individual user feeds
• hGps://twiGer.com/archiveitorg/ – Searches • hGps://twiGer.com/search?q=web %20archiving&src=typd – Lists • hGps://twiGer.com/smithsonian/smithsonian/
– A specific tweet
• hGps://twiGer.com/archiveitorg/status/ 294819565320413184
TwiGer -‐ Scoping
Expand Scope (using SURTs) to capture dynamically loading content:
– Individual TwiGer feed:
• +hGp://(com,twiGer,)/i/profiles/show/ BrowardCollege/
– Mul2ple TwiGer feeds:
Links in Tweets
• Can I archive a url linked to using a ‘url shortener’?
– Yes! Use an Expand Scope rule for hGp://t.co/ -‐ all
URLs posted on TwiGer redirect through that domain
– Note: just the one page that the url shortener link
TwiGer
Facebook – Sample URLs
– Individual User Profiles – Timeline view
• hGp://www.facebook.com/tonyforsenate/
– Pages -‐ Timeline view
• hGp://www.facebook.com/ArchiveIt/ – Events • hGp://www.facebook.com/events/265897963430841/ – Albums • hGps://www.facebook.com/media/set/?set=a. 13499334573.18616.6193904573&type=3
Facebook -‐ Scoping
– Ignoring robots.txt:
• www.facebook.com • qcdn.net
• akamaihd.net
– Document limit on www.facebook.com
(recommended 2000 for each seed) – Note, you cannot limit to *just* capture content from one Facebook account
• Currently we can capture the ini2al content on a Facebook 2meline, however the
dynamically loading content can be difficult to capture due to the frequent changes in the
way that content is served by Facebook
• Our engineers are working on keeping up to date with these changes and we are also
inves2ga2ng alternate methods for capturing Facebook pages
YouTube -‐ Sample URLs
– Channel /User pages
• hGp://www.youtube.com/whitehouse
– Watch pages-‐ individual videos
• hGp://www.youtube.com/watch?v=5lVIuW8vJ_E
– Uploaded Document RSS Feed
• hGp://gdata.youtube.com/feeds/api/users/whitehouse/ uploads/
– Embedded YouTube Videos on other sites:
• hGp://www.whitehouse.gov/photos-‐and-‐video/video/ 2013/01/29/president-‐obama-‐speaks-‐comprehensive-‐
YouTube -‐ Scoping
• For all YouTube content, ignore robots.txt for:
– youtube.com
– y2mg.com
• For Watch pages-‐ individual videos
– Use “One Page Only” Seed Type
• For Channel/User pages
23
YouTube
• Viewing YouTube videos:
– YouTube videos for Watch pages and most
embedded YouTube videos will playback normally in Wayback
– For Channel/User Pages or other pages where
videos are not playing back within the page, view videos from the video report or the public video page for that seed.
YouTube
Flickr
What types of pages can be archived?
– Photo streams
• Ex: hGp://www.flickr.com/photos/whitehouse/
– Individual photos
• Ex: hGp://www.flickr.com/photos/whitehouse/ 8390033709/in/photostream
Flickr
Other Sites
• Can sites other than those already men2oned be archived?
– Yes! There are many more sites out there that
can be archived. Please send us sites you are interested in archiving.
– Other sites men2oned by partners currently are
Moving Forward
• These best prac2ces will change as the sites themselves make changes. Please be sure to check the Help Wiki page for updates • We con2nue to focus on working with our partners to improve
the capture and display of archived social networking sites • The Archive-‐It team is exploring other capture mechanisms
besides using a tradi2onal crawler resource (Heritrix) • Headless browsers
• Hybrid architecture • API
Thank you!
• Ques2ons? Discussion?
• Please take our quick survey: