Estimating Information Production and the Size of the Public Domain
Here we’re going to look at using library catalogue data as a source for estimating information production (over time) and the size of the public domain.
Library Catalogues
Cultural institutions, primarily libraries, have long compiled records of the material they hold in the form of catalogues. Furthermore, most countries have had one or more libraries (usually the national library) whose task included an archival component and, hence, whose collections should be relatively comprehensive, at least as regards published material.
The catalogues of those libraries then provide an invaluable resource for charting, in the form of publications, levels of information production over time (subject, of course, to the obvious caveats about coverage and the relationship of general “information production” to publications).
Furthermore, library catalogue entries record (almost) the right sort of information for computing public domain status, in particular a given record usually has a) a publication date b) unambiguously identified author(s) with birth date(s) (though unfortunately not death date). Thus, we can also use this catalogue data to estimate the size of the public domain — size being equated here to the total number of items currently in the public domain.
Results
To illustrate, here are some results based on the catalogue of Cambridge University Library which is one of the UK’s “copyright libraries” (i.e. they have a right to obtain, though not an obligation to hold, one copy of every book published in the UK). This first plot shows the numbers of publications per year (as determined by their publication date) up until 1960 (when the dataset ends) based on the publication date recorded in the catalogue.
A major concern when basing an analysis on these kinds of trends is is that fluctuations over time derive not from changes in underlying production and publication rates but changes in acquisition policies of the library concerned. To check for this, we present a second plot which shows the same information but derived from the British Library’s catalogue. Reassuringly, though there are differences, the basic patterns look remarkably similar.

Number of items (books etc) Per Year in the Cambridge University Library Catalogue (1600-1960).

Number of items (books etc) Per Year in the British Library Catalogue (1600-1960).
What do we learn from these graphs?
- In total there were over a million “Items” in this dataset (and parsing, cleaning, loading and analyzing this data took on the order of days — while the preparation work to develop and perfect these algorithms took weeks if not months)
- The main trend is a fairly consistent, and approximately exponential, increase in the number of publications (items) per year. At the start of our time period in 1600 we have around 400 items a year in the catalogue while by 1960 the number is over 16000.
- This is a forty-fold increase and corresponds to an annual growth rate of approx 0.8%. Assuming “growth” began only around the time of the industrial revolution (~ 1750) when output was around 1000 (10-year moving average) gives a fairly similar growth rate of around 0.89%.
- There are some fairly noticeable fluctuations around this basic trend:
- There appears to be a burst in publications in the decade or decade and a half before 1800. One can conjecture several, more or less intriguing, reasons for this: the cultural impact of the French revolution (esp. on radicalism), the effect of loosening copyright laws after Donaldson v. Beckett, etc. However, without substantial additional work, for example to examine the content of the publications in that period these must remain little more than conjectures.
- The two world wars appear dramatically in our dataset as sharp dips: the pre-1914 level of around 7k+ falls by over a third during the war to around 4.5k and then rises rapidly again to reach, and pass, 7k per year in the early 20s. Similarly, the late 1930s level of around 9.5k per year drops sharply upon the outbreak of war reaching a low of 5350 in 1942 (a drop of 45%), and then rebounding rapidly at the war’s end: from 5.9k in 1945 to 8k in 1946, 9k in 1947 and 11k in 1948!
To do next (but in separate entries — this post is already rather long!):
- Estimates for the the size of the public domain: how many of those catalogue items are in the public domain
- Distinguishing Publications (“Items”) from “Works” — i.e. production of new material versus the reissuance of old (see previous post for more on this).
Colophon: Background to this Research
I’m working on a EU funded project on the Public Domain in Europe, with particular focus on the size and value of the public domain. This involves getting large datasets about cultural material and trying to answer questions like: How many of these items are in the public domain? What’s the difference in price and availability of public domain versus non public domain items?
I’ve also been involved for several years in Public Domain Works, a project to create a database of works which were in the public domain.
Colophon: Data and Code
All the code used in parsing, loading and analysis is open and available from the Public Domain Works mercurial repository. Unfortunately, the library catalogue data is not: library catalogue data, at least in the UK, appears to be largely proprietary and the raw data kindly made available to us for the purposes of this research by the British Library and Cambridge University Library was provided only on a strictly confidential basis.
-
Categories
- *nix
- Academic
- Activity Updates
- Books
- Cinema
- Code
- Command Line
- Copyright
- Culture and Society
- Data Digging
- Economics
- EUPD
- External
- Filesharing
- Governance
- Hacks
- Happiness
- Hardware
- History
- Innovation and Intellectual Property
- Intellectual Myths
- Javascript
- Knowledge Systems
- Miscellaneous
- Musings
- Notes
- Open Bibliographic Data
- Open Data
- Open Knowledge Foundation
- Openness
- Own Work
- Papers
- People
- Photos
- Platforms
- Poetry
- Policy
- PSI
- Python
- Quote
- RDF
- Shuttleworth Fellow
- Software
- Sysadmin
- Talks
- Transaction Costs
- Work In Progress
-
Articles
- February 2012
- January 2012
- December 2011
- November 2011
- October 2011
- September 2011
- August 2011
- July 2011
- May 2011
- April 2011
- March 2011
- February 2011
- January 2011
- December 2010
- November 2010
- October 2010
- September 2010
- July 2010
- June 2010
- May 2010
- April 2010
- March 2010
- February 2010
- January 2010
- December 2009
- November 2009
- October 2009
- September 2009
- August 2009
- July 2009
- June 2009
- May 2009
- April 2009
- March 2009
- February 2009
- January 2009
- December 2008
- November 2008
- October 2008
- September 2008
- August 2008
- July 2008
- June 2008
- May 2008
- April 2008
- March 2008
- February 2008
- January 2008
- December 2007
- November 2007
- October 2007
- September 2007
- August 2007
- July 2007
- June 2007
- May 2007
- April 2007
- March 2007
- February 2007
- January 2007
- December 2006
- November 2006
- October 2006
- September 2006
- August 2006
- July 2006
- June 2006
- May 2006
- April 2006
- March 2006
- February 2006
- January 2006
- December 2005
- November 2005
- October 2005
- September 2005
- August 2005
- July 2005
- June 2005
- April 2005
- March 2005
- February 2005
- January 2005
- December 2004
- November 2004
- October 2004
- June 2004
- May 2004
- March 2004
- October 2003
-
Meta




