Authors:
Published:
The Perma team has landed back in the US after our trip to the International Internet Preservation Consortium’s Web Archiving Conference. This year the IIPC met in Paris, at the Bibliothèque Nationale de France.
This is a gathering each year of colleagues from around the globe who are working in the web archiving space, ranging from institutions responsible for legal deposits, to researchers working with collections, to people who are building the core tools used for web archiving.
A major theme of the conference was the introduction of AI technologies into the web archiving space.
The Library of Congress is investigating how machine learning can address some of the difficulties associated with searching and accessing PDF collections, which are becoming more and more important to the historical record. You can read their paper, “Grappling with the Scale of Born-Digital Government Publications: Toward Pipelines for Processing and Searching Millions of PDFs”, on arxiv.org.
Folks at the University of Northern Texas have been using machine learning from a different angle: to help teams identify collection-relevant materials from large web archive troves. They say their work is close to being available for libraries to ingest their historical collection policy and run it on their metadata. You can read their paper, “Identifying Documents In-Scope of a Collection from Web Archives”, on arxiv.org.
Others were exploring ways to create a map of the web based on semantic similarity instead of traditional hyperlinks, and using LLMs to navigate large news media archives.
For our own part, there was representation from the Perma team at the Tools session on Friday. Kristi and Matteo shared their work on WARC-GPT and our developing concept of the librarianship of AI. Their world tour of talks on WARC-GPT continues this month, and we will post slides and any recordings of sessions we have available when they’ve all wrapped up. But in the meantime, trust us - it was great :)
Some other sessions we enjoyed tuning into included a workshop from friends at Webrecorder who were sharing new QA functionality for Browsertrix, and a great presentation from librarians at the National Library of the Netherlands who had taught themselves R in order to automate their validation and policy-checking workflow when processing new material. Their options for automation were somewhat limited by what they were allowed to run on their government-issued computers. We salute a team dedicated to skill building and working with what their IT departments mandate!
As always, spending time with the international community brought together by IIPC was a pleasure and we look forward to next year in Oslo!