LIS 839: Topic: Archiving Web & Social Media Content (1 cr.) School of Library and Information Studies University of Wisconsin-Madison 4238 Helen C. White 600 N. Park Street Madison, WI 53726
Instructor: • Bertram Lyons, MA, CA
Class Time: • Monday, June 2nd — Friday, June 6th, 9:00am–11:30am
Location: • Primary: SLIS, Room 4191F (Cat Lab) • Secondary: SLIS, Room 4160 (Computer Lab)
Office Hours: • Following class from 11:30am–12:30pm or via appointment
Contact: • Email: [email protected] • Twitter: @bertramlyons • Phone: 202-430-4457 (in emergencies only, please)
Overview Description This project-based course focuses on issues and challenges related to born-digital curation in an archives environment. Topics covered may include archives best practices; changing information life cycle; assessment, curation and preservation in a digital environment; and outreach strategies.
Websites and social media are inherently digital. To archive web and social media requires a basic understanding of digital information (e.g., formats, processes, and information structures).
This course will combine coverage of general digital preservation issues with those specifically related to the acquisition and preservation of web and social media data. We will couple selected readings with in-person discussions, and students will work alone and in small groups to complete daily projects that introduce technical skills and reinforce intellectual concepts.
In order to establish a framework for this intensive, we will use the SAA Guidelines for College and University Archives as a reference point for core archival functions and expectations. Such guidelines and institutional frameworks separate the work of archives from that of academic digital humanities, records management, software development, or commerce, all areas where web and social media data are of great interest. Using core archival functions as guideposts, we will work through the process of developing a comprehensive archival approach to the activity of collecting web and social media content.
In order to collect content from social media websites and applications it is imperative that archivists understand the information architectures that support their use and development as well as the data systems that store and manage the underlying information. We will explore and learn to use existing tools that support access to web and social media content and we will discuss the general environment of web technologies as they exist today.
SLIS Program Goals and Objectives
This course is designed to assess student progress in the following SLIS program-level outcomes (note: Goal 2 is not applicable to this course and does not appear below): Goal 1 Theory and history Students have a critical grounding in theoretical and historical perspectives that draw on research in other fields of knowledge as well as on LIS, and that inform their professional practices, including research, with respect to the organization and management of information and providing access to information. 1a. Students apply key concepts with respect to the relationship between power, knowledge, and information. Goal 3 Techniques and technologies Students are competent and knowledgeable in the core skills of the innovative information professional, and in any chosen area of specialization. 3a. Students organize and describe print and digital information resources. 3b. Students select and evaluate print and digital information resources. 3c. Students analyze information needs of diverse individuals and communities. 3d. Students understand and use appropriate information technologies. Goal 4. Professionalism and leadership Students are reflective, creative, problem-solving leaders, able to communicate, collaborate, and instruct effectively. 4a. Students evaluate, problem solve and think critically, both individually and in teams. 4b. Students demonstrate good oral and written communication skills.
Course Learning Objective Official Program-Level Learning Outcomes Evidence of Learning Outcomes Assessing Mastery of Learning Outcome Students have a critical grounding in theoretical and historical perspectives that draw on research in other fields of knowledge as well as on LIS, and that inform their professional practices, including research, with respect to the organization and management of information and providing access to information. 1a. Students apply key concepts with respect to the relationship between power, knowledge, and information. Daily Precis
Daily Reading Discussions
Final Project Students participate in discussion and analyze key concepts.
Students effectively incorporate some theoretical concepts into thesis and argument. Students are competent and knowledgeable in the core skills of the innovative information professional, and in any chosen area of specialization. 3a. Students organize and describe print and digital information resources. 3b. Students select and evaluate print and digital information resources. 3c. Students analyze information needs of diverse individuals and communities. 3d. Students understand and use appropriate information technologies. Practicum Exercises
Group Exercises
Final Project Students participate in activities and understand the purpose of the exercise.
Students analyze set of possible tools and understand the strengths and weaknesses of each.
Students provide justification for proposed project. Students are reflective, creative, problem-solving leaders, able to communicate, collaborate, and instruct effectively. 4a. Students evaluate, problem solve and think critically, both individually and in teams. 4b. Students demonstrate good oral and written communication skills. Practicum Exercises
Group Exercises
Final Project Students work in teams to complete technical exercises.
Students summarize readings clearly and effectively critique the issues therein.
Written assignments are clear, terse, and well-articulated.
Course Objectives Upon completion of this course students will have gained an in-depth appreciation of the challenges presented to archivists by the widespread adoption of online communication and knowledge sharing tools. Students will be knowledgeable about ways of dealing with such challenges through proper planning and strategizing and will be able to demonstrate an understanding of a variety of recommended and/or implemented methods for ensuring the preservation of web and social media data. Additionally, students will be familiar with recent and current research and literature on the preservation of digital records and they will be able to approach new and unforeseen digital records issues from a solid knowledge of concepts and principles.
Course Goal To give students an opportunity to explore in-depth issues concerning the preservation of web and social media data by the creating organization/individual and/or its successor(s), such as an archival program or institution.
Policies & Grading Absence and Attendance Policy Due to the limited length of this course, absences will not be permitted. If you believe you may need to miss a class, I kindly ask you to drop the course. Roll will be taken at the beginning of class. If you’re late, your participation score will decrease.
Academic Integrity As an academic discipline and professional field, library and information studies holds the legal rights of authors and artists in the highest regard and respects and protects those rights. As future LIS professionals, it is your responsibility to uphold these principles in your own work as you draw on the ideas, findings, and works of others. Using others’ ideas, findings, or works directly or indirectly in your own creative works (including papers, projects, blog posts, etc.) requires you to cite the original source. Not doing so will indicate to me that you have plagiarized the original author(s) and the assignment will receive no credit.
You should cite your sources strictly following the guidelines published in the American Psychological Association’s (APA) publication manual. If you are using a Web source in a blog post, a simple link to the source will suffice. For the University’s full academic integrity policy, please see: http://students.wisc.edu/doso/acadintegrity.html
Grade Distribution The grade distribution includes a general description of what a certain grade indicates of your progress. Please review the available rubrics for grading information specific to each assignment.
A 94-100 Outstanding achievement. Student performance demonstrates full command of course materials and evinces a high degree of originality and/or creativity that far surpasses course expectations.
AB 88-93 Very good achievement. Student performance demonstrates thorough knowledge of course materials and exceeds course expectations by completing all course requirements in a superior manner.
B 82-87 Good work. Student performance meets designated course expectations, demonstrates understanding of the course materials, and performs at an expected and acceptable level. A “B” is a normal grade.
BC
77-81 Marginal work. Student performance demonstrates incomplete understanding of course materials. C 72-76 Unsatisfactory work and inadequate understanding of course materials.
Late Work Policy To ensure against being penalized for late work, contact me on or before the due date to request a later due date. If this is not done, your assignment will receive a half-grade deduction each day the assignment is late. Only the final project is applicable here. Daily reading summaries are not accepted if they are late.
Reasonable Accommodation of Disabilities It is my desire to fully include students with disabilities in this course in such a way that maximizes their learning experiences and meets any specific needs they may have. To accomplish this goal, please communicate with me as early as possible regarding any accommodations that may need to be made in my instruction. I will do my utmost to protect your confidentiality.
Information for students with disabilities is available at the McBurney Disability Resource Center: • Address: 1305 Linden Drive, Madison, WI 53706 • Phone: 608-263-2741 • Phone (tty): 608-263-6392 • URL: http://mcburney.wisc.edu
Time Commitment For graduate level classes, a typical 3-credit course requires 9-10 hours of outside work per week plus the 3 hours for class each week, meaning over a term you're putting in approximately 180–195 hours. In theory for one credit you'd put in 1/3 of that. In a short format class that is not realistic. But you should plan that this class is a full-time commitment for the week. Plan your week wisely to build in time to read, study, and come refreshed to class each day.
Readings Readings are online. There are no set textbooks. Digital preservation and web and social media archiving are fields where there is a wealth of high-quality information available on the web. These are also fields where there is tremendous change and where there are, as yet, few common understandings. These factors make it very important to keep up to date. Further readings may be advised during the course.
Student Responsibilities Reading Summaries For each day, you will need to write a brief 1 page précis of one reading from the list of assigned readings for that day (of your choice), and submit it to me via email ([email protected]) before class. Late summaries will not be accepted. The summaries will be used as a starting point for discussions during that class day. Reading summaries consist of brief answers to the following questions, as well as two separate comments or questions you would like to bring to the attention of your classmates.
What is the issue addressed by the reading, and why is it important in the context of this class? What is the most important takeaway from the reading? What critiques (if any) do you have for the author?
A written précis will be limited to one page (single or double-spaced, no bigger than 12 point font, standard one-inch margins around).
Summaries will be graded on a 2 point scale, with 2 points corresponding to a good summary, 1 corresponding to a minimal summary, and 0 corresponding to a summary not turned in. These summaries are worth a total of 30% of your final grade.
Daily Class Participation Using a common methodology from software development, we will have an informal “stand up” session each morning at the beginning of class. Each student will deliver a brief (no more than 2 minute) reflection on something from the previous night’s reading assignments at the beginning of class. These informal round-robin discussions will be our starting point for each day. Participation is required.
Daily practicum exercises will be a large part of this class. In these practicums we will learn skills and new technologies useful for web and social media archiving. There will be no formal grades for successful use of the technologies, but there will be grades for participation and effort. Participation in these exercises is required.
Group exercises will also be a daily event in this class where we will work in pairs to accomplish specified tasks. Teamwork is central to digital collections management, especially web and social media archiving. No archivist is an island and we are constantly in need of the skills and expertise of others as we work towards accomplishing common goals. These exercises are designed to encourage teamwork and clear communication. No formal grades for successful completion of exercises will be given, but there will be grades for participation and effort. Participation in these exercises is required.
All class participation will be assessed by the instructor and will form 20% of your total grade.
Final Project Develop a written plan for your personal web or social media archive (chose one website or social media account that you manage as the test data for this project). If you don’t have a website or a social media account, you will be free to select a website of your choice to use in this project. Think of this paper as a thought experiment more than a research paper. You are welcome to incorporate readings and other sources as support for your arguments, but you are not obliged to do so. I will expect you to work on this paper in sections each day, building up to the final product that will be due on Saturday. This paper will be worth 50% of your final grade.
In your personal, academic, or professional life you most likely manage at least one website, blog, or social media account. This project will take the skills and issues we address in this class and apply them to a real-life scenario in order to develop a collection plan for your selected set of data. Select one target data set from your life (e.g., a website you manage, a blog you write, or a social media account you use often). Building on assigned activities each day, you will turn in a paper at the end of this class that documents the reasoning behind your selection process, the data you will target, the methods you will use to acquire the selected data, the frequency with which you will collect your data, the methods you will use to document and store your data, the avenues you will provide for access to your archive, and the long-term plan you will follow to ensure no data is lost in the future.
Students will be encouraged to consider the effects of the scale of the project they propose and they will need to be realistic about what is a manageable project for an emerging social media archiving program.
The project will build in daily increments.
The purpose of this assignment is to have you consider how to apply the principles, standards, approaches, and activities discussed in this course. The report should be 7-9 pages and should be organized around the following issues:
Appraisal Identify the target data you will collect Discuss the reasoning behind your selection (why is it important to preserve this data?) Original format of the data Describe the categories of data that you will collect from your target site Describe the method(s) that you will use to collect the target data Determine and document the frequency with which you will collect the data Metadata: how will information about the data be recorded and maintained? Discuss the method(s) you will use to document and store the data you collect Preservation Describe the approach(es) you will take to provide long term preservation to this data in order to ensure no data loss in the future, or at least to document any necessary data changes moving forward Describe all files, if any, for which you cannot provide long term access using current standards, principles, approaches, and best practices. What will you do for those data files? Authenticity and evidential integrity Discuss what changes, if any, you think you may have to make in the future in order to continue to grow and preserve your archive Access Determine how (and if) you will provide access to the data in your archive Are there any other issues which are important for understanding how the project was designed and implemented
For example, a student could decide to build his or her own personal twitter archive. If you were to select this path, how would you create an archive of all your tweets from the day you started until the day you stop using twitter? What is the anatomy of the data that you would collect? What method would you use to collect your data, and at what frequency would you collect it? How would you document your permissions information for future access? How would you plan for storage of your data? How you would plan to provide access to your archive for researchers, for yourself, or for online public consumption?
Strongly recommended prerequisite readings/activities (at least skim these resources)
Preserving contemporary news applications in the news: http://www.pbs.org/mediashift/2014/04/future-proofing-news-apps/
Read through the Cornell tutorial on Digital Preservation Management: http://dpworkshop.org/dpm-eng/eng_index.html.
Read Ed Summers’ talk here (Web as a Preservation Medium): http://inkdroid.org/journal/2013/11/26/the-web-as-a-preservation-medium/.
Follow the Brooklyn Museum & Flickr takedown conversations… Ed Summers has had much to say: http://inkdroid.org/journal/.
Read basic information about how the internet works: http://www.w3.org/wiki/How_does_the_Internet_work#Introduction.
Read basic information about HTML (wikipedia): http://en.wikipedia.org/wiki/HTML. Read Dave Raggett’s introduction to HTML: http://www.w3.org/MarkUp/Guide/.
Read basic information about CSS (wikipedia): http://en.wikipedia.org/wiki/CSS. Read tutorial on CSS (HTML Dog): http://htmldog.com/guides/css/beginner/.
Read basic information about web archiving (wikipedia): http://en.wikipedia.org/wiki/Web_archiving.
Schedule Day 1 | June 2nd, Monday The story of the Twitter archive at LC | Introduction to the theme
We’ll open with a quick case study, including an overview of issues associated with the establishment of a full archive for Twitter at the Library of Congress: understanding the data model(s) determining what is in the archive (selection) for an ongoing business, such as twitter, how do you define a delivery frequency? what’s the mechanism for delivering the content (how does LC make sure not to miss a single tweet?) process for inventorying the data as delivered understanding the legal issues associated with social data when do you think about access and how do you do it? what are the issues you might face?
Required Readings
Maureen Pennock. March 2013. Web-Archiving: DPC Technology Watch Report 13-01. http://dx.doi.org/10.7207/twr13-01.
Lyman, Peter. “Archiving the World Wide Web” in Building a National Strategy for Digital Preservation: Issues in Digital Media Archiving. Washington, DC: Council on Library and Information Resources, 2002, pp. 38-51. http://www.clir.org/pubs/reports/pub106/web.html
John, Jeremy Leighton, Ian Rowlands, Peter Williams, and Katrina Dean. "Digital Lives: Personal Digital Archives for the 21st Century >> an Initial Synthesis." 2010. [Read: pages vi-xviii] http://britishlibrary.typepad.co.uk/files/digital-lives-synthesis02-1.pdf
U.S. National Archives and Records Administration. “Guidance on Managing Records in Web 2.0/Social Media Platforms,” October 20, 2010, http://www.archives.gov/records-mgmt/bulletins/2011/2011-02.html.
Smithsonian Institution Archives Blog. “To preserve or not to preserve: Social Media.” 2012. http://siarchives.si.edu/blog/preserve-or-not-preserve-social-media.
Society of American Archivists. “Archiving Social Media in Senators’ Offices.” 2012. http://www2.archivists.org/sites/all/files/Archiving_social_media_senators_apx2_drft.pdf.
National Digital Stewardship Alliance / Library of Congress. “Keeping Personal Websites, Blogs and Social Media.” 2012. http://www.digitalpreservation.gov/personalarchiving/websites.html.
Skimmable Readings
Farrell, Susan ed. “A guide to web preservation.” 2010. http://jiscpowr.jiscinvolve.org/wp/files/2010/06/Guide-2010-final.pdf
Toyoda, M., Kitsuregawa, M. (2012). "The History of Web Archiving". Proceedings of the IEEE 100 (special centennial issue). doi:10.1109/JPROC.2012.2189920. http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=6182575
Prom, Chris. “Facilitating the Generation of Archives in the Facebook Era.” 2012. http://e-records.chrisprom.com/draft-facilitating-archives-in-facebook-era/.
O’Sullivan, Catherine. “Diaries, On-line, Diaries, and the Future Loss to Archives; or, Blogs and the Blogging Bloggers Who Blog Them.” American Archivist 68 (Spring/Summer): 53-73, 2005. http://archivists.metapress.com/content/7k7712167p6035vt/.
JISC-PoWR Team. PoWR: The Preservation of Web Resources Handbook. 2008. [READ SELECTIVELY] http://www.jisc.ac.uk/media/documents/programmes/preservation/powrhandbookv1.pdf.
Optional Readings
Hoffman, Starr. “Preserving Access to Government Websites: Development and Practice in the CyberCemetery.” World Library and Information Congress: 74th IFLA General Conference and Council (10-14 August 2008, Québec, Canada). http://www.ifla.org/IV/ifla74/papers/130-Hoffman-en.pdf.
Glenn, Valerie D. (2007) ‘Preserving Government and Political Information: The Web–at–Risk Project’, First Monday, v.12 no.7: http://journals.uic.edu/ojs/index.php/fm/article/view/1917/1799
Digital Preservation Coalition - handbook on Web Archives: http://www.dpconline.org/advice/web-archiving
Europe’s Blog Forever project has an interesting repository design paper: https://zenodo.org/record/7494/#.U1W75uZdXv0
Madhava, Rakesh, “10 things to know about preserving social media”, 2011, ARMA (from the perspective of a Records Manager), accessible at http://content.arma.org/IMM/September-October2011/10thingstoknowaboutpreservingsocialmedia.aspx.
National Archives and Records Administration. “2004 Presidential Term Web Harvest.” 2005. http://www.webharvest.gov.
PADI Web Archiving Sections 1 and 2; dip into section 3: http://www.nla.gov.au/padi/topics/92.html
Group Activity
Work In groups of two. Draw workflow models detailing the steps taken for web archiving by LC and the British Library. Separate maps for each institution. We will use these models to draw a single model that we all agree illustrates the processes of LC and the British Library.
How the Library of Congress does it: http://lcweb2.loc.gov/diglib/lcwa/html/lcwa-home.html http://www.loc.gov/webarchiving/faq.html http://www.loc.gov/webarchiving/technical.html http://iwaw.europarchive.org/10/IWAW2010.pdf#page=17
How the British Library does it: http://www.webarchive.org.uk/ukwa/
(interesting tool: http://www.archiveready.com/)
Final Project Steps
Determine what web or social media platform you will collect for your final project. Begin researching and documenting the options for collecting data from your selected platform. Begin drafting a narrative for this section of your paper:
Appraisal Identify the target data you will collect Discuss the reasoning behind your selection (why is it important to preserve this data?)
In prep for Day 2 - request archive downloads from Facebook and Twitter (if you have accounts)
Day 2 | June 3rd, Tuesday Behind the scenes | What are we archiving? Required Readings
Internet Archive. “Wayback Machine Hits 4,000,000,000 web pages.” 2014. http://blog.archive.org/2014/05/09/wayback-machine-hits-400000000000/.
Kahle, Brewster. “Preserving Wordpress Blogs.” Video. 2013. http://wordpress.tv/2013/08/26/brewster-kahle-internet-archive-and-preserving-wordpress-blogs/.
Hedstrom, Margaret and Christopher A. Lee. "Significant properties of digital objects: definitions, applications, implications." In Proceedings of the DLM-Forum 2002, Barcelona, 6-8 May 2002 , 218-227. Luxembourg: Office for Official Publications of the European Communities, 2002. http://www.ils.unc.edu/callee/sigprops_dlm2002.pdf
Understanding JSON: http://code.tutsplus.com/tutorials/understanding-json--active-8817.
Perez, Sarah. “This is What a Tweet Looks Like.” 2009. http://readwrite.com/2010/04/19/this_is_what_a_tweet_looks_like#awesm=~oE6AvtuJaYBlvl.
Internet Archive Frequently Asked Questions. First section: “The Wayback Machine.” http://archive.org/about/faqs.php#The_Wayback_Machine.
Skimmable Readings
Baker, Mary, Kimberly Keeton, Sean Martin. “Why Traditional Storage Systems Don’t Help Us Save Stuff Forever.” 2005. http://www.hpl.hp.com/techreports/2005/HPL-2005-120.pdf.
Brown, Adrian. “Selecting File Formats for Long-Term Preservation.” Digital Preservation Guidance Note 1. London: The National Archives, August 2008. http://www.nationalarchives.gov.uk/documents/selecting-file-formats.pdf
Caroline R. Arms and Carl Fleischhauer.“Sustainability of Digital Formats: Planning for Library for Congress Collections.” [Familiarize yourself with this website.] http://www.digitalpreservation.gov/formats/index.shtml.
Optional Readings
Kirschenbaum, Matthew G., Richard Ovenden, and Gabriela Redwine. "Digital Forensics and Born-Digital Content in Cultural Heritage Collections." Washington, DC: Council on Library and Information Resources, 2010. http://clir.org/pubs/reports/pub149/pub149.pdf
Washington State Government. “Records Management Advice for Blogs, Twitter, and other social media accounts.” 2013. http://www.sos.wa.gov/_assets/archives/RecordsManagement/Blogs-Twitter-and-Managing-Public-Records-Nov-2013.PDF.
InSPECT, “Investigating the Significant Properties of Electronic Content over Time,” 2009, http://www.significantproperties.org.uk/inspect-finalreport.pdf
Timmer, John. “Preserving science: what data do we keep?” http://arstechnica.com/science/2010/11/preserving-science-choosing-what-data-to-discard/.
Practicum
-
Compare/contrast social media archive exports: Follow Twitter and Facebook instructions to download your personal archive. Evaluate the results. We will all discuss the type of data these exports return, the quantity of files, the file and data types, and the method of access provided by the packages.
-
Use your browser's Save As feature to archive a complex Web page, such as The New York Times home page. Or choose a URL on the Internet Archive's Wayback Machine. Compare the file structure of the original and archived version. Operating your computer offline, try to reconstruct the page in its original form, and explain what if any obstacles you encountered. We will discuss together in class.
Group Activity
- Data analysis: Download three data packages provided to you by the instructor. Answer questions about each dataset.
- DP1 (csv): file count, file sizes, file types, creation dates, record count
- DP2 (xml): file count, file sizes, file types, creation dates, record count
- DP3 (website): file count, file sizes, file types, creation dates, record count
Final Project Steps
Begin thinking and writing about the following section of your paper:
Original format of the data Describe the categories of data that you will collect from your target site Metadata: how will information about the data be recorded and maintained? Discuss the method(s) you will use to document and store the data you collect
Day 3 | June 4th, Wednesday Technology and Tools | How do we do it?
Required Readings
Grotke, Abbie. NDSA National Agenda Digital Content Area: Web and Social Media. 2014. http://blogs.loc.gov/digitalpreservation/2014/01/ndsa-national-agenda-digital-content-area-web-and-social-media/
Grotke, Abbie. NDSA National Agenda Digital Content Area: Web and Social Media, 2014. http://blogs.loc.gov/digitalpreservation/files/2014/01/NDSACWG_WebSocialMedia_Overview_Grotke.pdf.
Internet Archive. Challenges of Collecting and Preserving the Social Web. 2013. http://blogs.loc.gov/digitalpreservation/files/2014/01/NDSA_CWG_120413_Carpenter.pdf.
UK National Archives. Social media archiving policy Press Release. 2014. http://blog.nationalarchives.gov.uk/blog/archiving-social-media/ & http://www.natiohttp://www.nationalarchives.gov.uk/news/929.htmnalarchives.gov.uk/news/929.htm
Wikipedia. “Web crawler.” http://en.wikipedia.org/wiki/Web_crawler.
International Standards Organization. ISO 28500. WARC format specification. http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf.
Milligan, Ian. “WARC Files: A Challenge for Historians and Finding Needles in Haystacks. 2012. http://ianmilligan.ca/2012/12/12/warc-files-a-challenge-for-historians-and-finding-needles-in-haystacks/.
SAA Web Archiving Roundtable. Guest post by Alex Duryee on Web Archiving, 2013. http://webarchivingrt.wordpress.com/2013/05/07/113/.
SAA Web Archiving Roundtable. Guest post by Nicholas Taylor: Personal Digital (web) Archiving, 2014. http://webarchivingrt.wordpress.com/2014/04/18/personal-digital-web-archiving-guest-post-by-nicholas-taylor/.
Optional Readings
Reyes Ayala, Brenda. “Web Archiving @ UNT: Web Archiving Bibliography 2013.” 2013. http://digital.library.unt.edu/ark:/67531/metadc172362/.
Conversations about Archives Working with Writers to Preserve Their Social Media Content. 2013 Archives Next blog. http://www.archivesnext.com/?p=3691.
Archive-it blog: http://blog.archive-it.org/tag/archive-social-media/.
Archive-it help pages (browse to see issues associated with using Archive-it): https://webarchive.jira.com/wiki/pages/viewpage.action?pageId=3113092
Practicum
- Using the command line interface (CLI) on your personal computer Navigation (pwd, cd) Listing (ls) Printing ( >/) Run programs Create directories/files (mkdir) Read files (vi or cat) Delete files (rm or rmdir)
http://en.wikipedia.org/wiki/List_of_DOS_commands TREE, TYPE, MOVE, MKDIR, CD, DIR, RMDIR, XCOPY
- Compare/contrast web archiving tools/strategies
- curl (http://curl.haxx.se/ (http://www.vettyofficer.com/2013/06/how-to-install-curl-in-mac-os-x.html)
get homebrew (http://brew.sh/)
- wget (http://en.wikipedia.org/wiki/Wget) (manual: http://www.gnu.org/software/wget/manual/) also, need homebrew (homebrew wget install) to install on mac os (examples of commands: http://www.tecmint.com/10-wget-command-examples-in-linux/ and http://www.kossboss.com/linux---wget-full-website) WGET (to generate a WARC file)
Group Activity
Use Ian Milligan’s three step instructions to create WARC file and analyze it. (http://ianmilligan.ca/2012/12/13/warc-files-part-two-using-warc-tools/)
Final Project Steps
Begin thinking and writing about the following questions:
Original format of the data Describe the method(s) that you will use to collect the target data Determine and document the frequency with which you will collect the data
Day 4 | June 5th, Thursday Making a case for selection | Why do we collect?
Required Readings
National Archives and Records Administration. “White Paper on Best Practices for the Capture of Social Media Records.” 2013. http://www.archives.gov/records-mgmt/resources/socialmediacapture.pdf.
Lynch, Clifford. "Authenticity and Integrity in the Digital Environment: An Exploratory Analysis of the Central Role of Trust." In Authenticity in a Digital Environment Council on Library Resources, 2000. http://www.clir.org/pubs/reports/pub92/lynch.html.
Duranti, Luciana and Kenneth Thibodeau. “The Concept of Record in Interactive, Experiential and Dynamic Environments: the View of InterPARES,” Archival Science 6(1): 13-68, 2006. http://www.interpares.org/ip2/display_file.cfm?doc=ip2_book_appendix_02.pdf.
Hirtle, Peter B. “The History and Current State of Digital Preservation in the United States.” In: Metadata and Digital Collections: A Festschrift in Honor of Thomas P. Turner. Ithaca, NY: Cornell University, 2010, pp., 121-140. http://cip.cornell.edu/DPubS/Repository/1.0/Disseminate?view=body&id=pdf_1&handle=cul.pub/123860930.
Hirtle, Peter B. “Digital Preservation and Copyright. 2003. http://fairuse.stanford.edu/2003/11/10/digital_preservation_and_copyr/.
Skimmable Readings
Coyle, Karen. “Rights in the PREMIS Data Model: A Report for the Library of Congress.” Washington, D.C.: Library of Congress, December 2006. http://www.loc.gov/standards/premis/Rights-in-the-PREMIS-Data-Model.pdf
Optional Readings
O’Brien, Jeff. “Electronic records: Basic concepts in preservation and access.” 1998. http://scaa.usask.ca/e-paper.html.
Rothenberg, Jeff. “Avoiding Technological Quicksand: Finding a Viable Technical Foundation for Digital Preservation.” Washington, D.C.: CLIR, 1999. http://www.clir.org/PUBS/reports/rothenberg/pub77.pdf.
Thibodeau, K. Overview of technological approaches to digital preservation and challenges in coming years. In The state of digital preservation: An international perspective (pp. 4- 31). Washington, DC: Council on Library and Information Resources, 2002. http://www.clir.org/pubs/reports/pub107/pub107.pdf.
Beagrie et al. “Digital preservation policies study.” 2008. http://www.jisc.ac.uk/media/documents/programmes/preservation/jiscpolicy_p1finalreport.pdf.
Practicum
- Continued use of tools:
we never made a Warc file yesterday with wget:
wget "http://www.archiveteam.org/" --mirror --warc-file="at" http://www.archiveteam.org/index.php?title=Wget_with_WARC_output
-
HTTrack (http://www.httrack.com/page/2/ - mostly windows only), on mac os can homebrew httrack install (httrack -h gets you help and examples in terminal) create mirror: httrack http://www.reddit.com/r/AskReddit/ update mirror (might be very useful for reddit site) in same directory: httrack --update
-
Heritrix (internet archive tool) -only for Linux really (difficult to manage on pc or mac) that’s why:
-
web based options: http://archive.today/ (you can download a zip from the output - but it only goes ONE page deep)
-WARCREATE - http://warcreate.com/ (Chrome only) - will create a WARC file but no way to look at it and only goes one page deep as far as i can tell
Web Archiving Integrated Layer - WAIL - http://matkelly.com/wail/
- Exploring APIs:
Lynda API tutorial available if anyone wants to do it
-Using Glyphy API: https://github.com/giphy/GiphyAPI
- Using the DPLA API - learning APIs challenge: (Scalar API Explorer: http://scalar.usc.edu/tools/apiexplorer/ )
If you’re feeling very advanced, look at the instructions here to setup a very useful tool for grabbing Twitter feeds: Social Feed Manager (Dan Chudnov’s software) http://dicarve.blogspot.com/2014/04/an-relatively-easy-way-for-installing.html
- Twitter API (streaming vs rest) [maybe we demonstrate how to extract all the tweets for a given hashtag, e.g., #QR1863] https://github.com/tweetstream/tweetstream
https://dev.twitter.com/docs/streaming-apis https://dev.twitter.com/
CodeAcademy on Twitter API http://www.codecademy.com/tracks/twitter
CodeAcademy on Soundcloud http://www.codecademy.com/tracks/soundcloud
https://github.com/twitter/twurl
use baxtwit
Using subscription servers vs doing it yourself: http://www.reedtech.com/business-needs/web-social-media-archiving/ http://www.ostermanresearch.com/whitepapers/orwp_or_201204a.pdf http://nexgate.com/solutions/intelligent-social-content-archiving/ http://www.iterasi.com/features http://www.smcapture.com/smc4_capture_archive.php http://perma.cc/
Group Activity: Discussion Questions
Critique this: http://www.digitalpreservation.gov/meetings/documents/ndiipp11/Workshop3_Slides-ODU-B.pdf
Critique this: DataONE http://mule1.dataone.org/ArchitectureDocs-current/design/PreservationStrategy.html#keep-the-bits-safe
What are the significant properties of social media archives? http://www.ijdc.net/index.php/ijdc/article/view/110
Where in the Digital Curation Lifecycle Model does this project fall? Digital Curation Lifecycle Model: http://www.dcc.ac.uk/resources/curation-lifecycle-model
Final Project Steps
Begin thinking and writing about the following questions:
Access Determine how (and if) you will provide access to the data in your archive
Day 5 | June 6th, Friday Sustaining the collection | Ensuring preservation and understanding the costs
Required Readings
Day, M. “The long-term preservation of Web Content.” In J. Masanes (Ed.), Web Archiving. Berlin: Springer, 2006. http://www.ukoln.ac.uk/preservation/publications/2006/webarchiving/md-final-draft.pdf.
Archive-It. “The Web Archiving Lifecycle Model.” 2013. https://archive-it.org/static/files/archiveit_life_cycle_model.pdf.
Richard Wright, Ant Miller, and Matthew Addis. “The Significance of Storage in the ‘Cost of Ris’' of Digital Preservation.” International Journal of Digital Curation 4/3 (2009). http://www.ijdc.net/index.php/ijdc/article/view/138
McGovern, Nancy Y., Anne R. Kenney, Richard Entlich, William R. Kehoe, and Ellie Buckley. "Virtual Remote Control: Building a Preservation Risk Management Toolbox for Web Resources." D-Lib Magazine 10, no. 4 (2004). http://dlib.org/dlib/april04/mcgovern/04mcgovern.html.
Sheldon, Madeline. “Digital preservation policies analysis.” 2013. http://blogs.loc.gov/digitalpreservation/2013/08/analysis-of-current-digital-preservation-policies-archives-libraries-and-museums/.
Besser, Howard. “Archiving Occupy Movements.” VIdeo. 2013. http://vimeo.com/43603604.
BagIt specification. http://www.digitalpreservation.gov/documents/bagitspec.pdf. [Please read the specification and try to understand how it works.]
Skimmable Readings
The Blue Ribbon Task Force on Sustainable Digital Preservation and Access. “Sustainable Economics for a Digital Planet: Ensuring Long-Term Access to Digital Information.” 2010. http://brtf.sdsc.edu/biblio/BRTF_Final_Report.pdf Read Executive Summary; Browse the rest of the document. See also the BRTF website: http://brtf.sdsc.edu/about.html Jantz, Ronald and Michael Giarlo. “Architecture and Technology for Trusted Digital Repositories. “ D-Lib Magazine, 2005. http://www.dlib.org/dlib/june05/jantz/06jantz.html.
California Digital Library. “Guidelines for Digital Objects.” http://www.cdlib.org/inside/diglib/guidelines/.
Optional Readings
CCSDS Reference Model for an Open Archival Information System (OAIS), pp. 10-90 only < http://public.ccsds.org/publications/archive/650x0b1.pdf >
Practicum
Ingest and Data Migration challenge:
Using checksums: create a checksum for a file. Change the file. Create another checksum for the changed version of the original file. Compare the two checksums.
md5 [filename]
Create a BagIt bag using Bagger or the BagIt library. We will create bags and dissect them in order to understand their structure. (http://sourceforge.net/projects/loc-xferutils/files/loc-bagger/2.1.3/bagger-2.1.3.zip/download) (http://project.wdl.org/arab_peninsula/workshop2012/en/doha_workshop_2012_bagger_en.pdf)
Fixity ls -i [filename] (get inode)
interesting tool: https://en.wikipedia.org/wiki/Google_Takeout
Group Activity: Discussion Questions
What does “creating durable digital objects” mean?
What does this mean: “Create appropriate metadata for digital objects for access, management, and preservation purposes.”?
How would you talk about the costs of a web-archiving or social media archiving project? Determine the costs of digitization projects and plan appropriate facilities and resources.
Final Project Steps
Spend the rest of class time thinking and writing about the following questions:
Preservation Describe the approach(es) you will take to provide long term preservation to this data in order to ensure no data loss in the future, or at least to document any necessary data changes moving forward Describe all files, if any, for which you cannot provide long term access using current standards, principles, approaches, and best practices. What will you do for those data files? Authenticity and evidential integrity Discuss what changes, if any, you think you may have to make in the future in order to continue to grow and preserve your archive Are there any other issues which are important for understanding how the project was designed and implemented
Final Project Due Saturday by 5pm ([email protected])
ADDITIONAL GENERAL RESOURCES:
SAA Core Archival Functions (http://www2.archivists.org/node/14804)
Digital Curation Centre: http://www.dcc.ac.uk/resource/curation-manual/chapters/
National Digital Information Infrastructure and Preservation program, http://www.digitalpreservation.gov/ndiipp/
Research Libraries Group, RLG DigiNews, http://www.rlg.org/preserv/diginews/
Digital Preservation Europe: http://www.digitalpreservationeurope.eu/
Digital Preservation Coalition: http://www.dpconline.org/
Preserving Access to Digital Information (PADI), http://www.nla.gov.au/padi/
InterPARES (International Research on Permanent Authentic Records in Electronic Systems), http://www.interpares.org
SAA Web Archiving Roundtable website (http://webarchivingrt.wordpress.com/)
International Internet Preservation Consortium website (http://www.netpreserve.org/)
http://www.infotoday.com/cilmag/dec11/Grotke.shtml (abbie’s overview)