Using Warcbase to Generate a Link Graph of the Wide Web Scrape | Ian Milligan
Submitted by Clyde on Tue, 17/03/2015 - 22:13In this short post, I want to take you from a collection of WARC files to a webgraph, which you can see pictured at left.
In this short post, I want to take you from a collection of WARC files to a webgraph, which you can see pictured at left.
In this short post, I want to take you from a collection of WARC files to a webgraph, which you can see pictured at left.
At the end of 2013 the Trove team at the National Library of Australia embarked on an exciting project to bring Trove’s current affairs coverage into the twenty-first century. The Australian Broadcasting Corporation (ABC) Radio National (RN) website exposes a wealth of contemporary content on cultural and political life in Australia. We knew that if we included these resources in Trove we could give users a current affairs discovery experience starting with the first Australian newspaper printed in 1803 and continuing all the way up to the podcasts of the present day.
By Dorothea Salo on February 5, 2015
I had to put together the introductory lecture for my “XML and Linked Data” course early this time around, because I’ll be out of town for the first class meeting owing to a service obligation. Since I’m starting with linked data instead of XML this time, I found myself having to think harder about the question nearly every student carries into nearly every first-class meeting: “Why should I be here?” Why, among all the umpty-billion things a library school could be teaching, teach linked data? Why does it matter?
The Earth may not be flat, but the web certainly is.
“There is no ‘top’ to the World-Wide Web,” declared a 1992 foundational document from the World Wide Web Consortium—meaning that there is no central server or organizational authority to determine what does or does not get published. It is, like Borges’ famous Library of Babel, theoretically infinite, stitched together with hyperlinks rather than top-down, Dewey Decimal-style categories.1 It is also famously open—built atop a set of publicly available industry standards.
To explain the utility of semantic search and linked data, Jeff Penka, director of channel and product development for information management solutions provider Zepheira, uses a simple exercise. Type “Chevy Chase” into Google’s search box, and in addition to a list of links, a panel appears on the right of the screen, displaying photos of the actor, a short bio, date of birth, height, full name, spouses and children, and a short list of movies and TV shows in which he has starred.
The white paper was commissioned by NISO's Discovery to Delivery (D2D) Topic Committee as part of its ongoing examination of areas in the discovery landscape that the information community could potentially standardize. Included in the paper is an overview of the current discovery environment; descriptions of how these technologies, methodologies, and products may be able to adapt to potential future change; and a look beyond current models of discovery to explore possible alternatives, especially those related to linked data.
Scholarly communication is undergoing fundamental changes, in particular with new requirements for open access to research outputs, new forms of peer-review, and alternative methods for measuring impact. In parallel, technical developments, especially in communication and interface technologies facilitate bi-directional data exchange across related applications and systems. The aim of this roadmap is to identify important trends and their associated action points in order for the repository community to determine priorities for further investments in interoperability.
Data needs to be more than just available, they need to be discoverable and understandable. Iain Hrynaszkiewicz introduces Nature’s new published data paper format, a Data Descriptor. Peer-review and curation of these data papers will facilitate open access to knowledge and interdisciplinary research, pushing the boundaries of discovery. Some of the most tangible benefits of open data stem from social and interdisciplinary sciences as these fields require effective cross-disciplinary communication.
Purpose
This paper will focus on a highly significant yet under-recognised concern: the huge growth in the volume of digital archival information, and the implications of this shift for information professionals.
Design/methodology/approach
Though data loss and format obsolescence are often considered to be the major threats to digital records, the problem of scale remains underacknowledged. The paper will discuss this issue, and the challenges it brings using a case study of a set of Second World War service records.