Category Archives: EUPD

The Size of the Public Domain (Without Term Extensions)

We’ve looked at the size of the public domain extensively in earlier posts.

The basic take away from the analysis was the finding that, based on library catalogue data, for books in the UK, approximately 15-20% of work was in the public domain — with public domain work being pretty old (70 years plus, due to the life+70 nature of copyright).

An interesting question to ask then is: how large would the public domain be if copyright had not been extended from its original length of 14 years with (possible) 14 year renewal (14+14) set out in Statute of Anne back in 1710? And how does this compare with how the situation, back when 14+14 was in “full swing”, say, 1795?

Furthermore, what about if copyright today was a simple 15 years — the point estimate for the optimal term of copyright found in paper on this subject? Well here’s the answer:

Today1795 (14+14)Today (14+14)Today (15y)
Total Items3.46m179k3.46m3.46m
No. Public Domain657k140k1.2m2.59m
%tage Public Domain19785275

Number and percentage of public domain works based on various scenarios based on Cambridge University Library catalogue data.

That’s right folks: based on the data available, if copyright had stayed at its Statute of Anne level, 52% of the books available today would in the public domain compared to an actual level of 19%. That’s around 600,000 additional items that would be in the public domain including works like Virginia Woolf’s (d. 1941) the Waves, Salinger’s Catcher in the Rye (pub. 1951) and Marquez’s Chronicle of a Death Foretold (pub. 1981).

For comparison, in 1795 78% of all extant works were in the public domain. A figure which we’d be close to having if copyright was a simple 15 years (in that case the public domain would be a substantial 75%).

To put this in visual terms, what the public domain is missing out as a result of copyright extension is the yellow region in the following figure: those are the set of works that would be public domain under 14+14 but aren’t under current copyright!

PD Stats

The Public Domain of books today (red), under 14+14 (yellow), and published output (black)

Update: I’ve posted the main summary statistics file including per-year counts. I’ve also started a CKAN data package: eupd-data for this EUPD-related data.

Size of the Public Domain III

Here we are going to apply the results on Public Domain “proportions” derived in our previous post and thereby obtain best estimates of the UK public domain.

The logic is simple, and similar to that in our first post in the series: we will take the Public Domain proportions from Table 3 of our last post and combine with our (conservative) estimates for output based on library catalogues. Here are the results:

Pub. DateItems% PDNo. PD
1400-1850304587100304587
1850-18604097010040970
1860-18704373410043734
1870-1880505649548035
1880-1890668579060171
1890-1900668838556850
1900-1910703606545734
1910-1920604894024195
1920-1930786702519667
1930-194090576109057
1940-19507269264361
1950-196011825100
1960-197026297400
1970-2009213050900
Total345811619657361

UK Public Domain Totals Based on Cambridge University Library Data. Note, as discussed in previous posts, figures from British Library are approximately 3x larger (both for Public Domain and total items).

culbooks_counts_annual_1600-2001

Total (Black) and Public Domain (Red) Items per year based on the CUL Catalogue.

Zooming in to the pre-1960 period to get more detail:

culbooks_counts_annual_1600-1960

Total (Black) and Public Domain (Red) Items per Year based on the CUL Catalogue for pre-1960 period.

Author “Significance” From Catalogue Data

Continues the series of post related to analyzing catalogue data, here are some stats on author “significance” as measured by the number of book entries (‘items’) for that author in the Cambridge University Library catalogue from 1400-1960 (there being 1m+ such entries).

I’ve termed this measure “significance” (with intentional quotes) as it co-mingles a variety of factors:

  • Prolificness — how many distinct works an author produced (since usually each work will get an item)
  • Popularity — this influences how many times the same work gets reissued as a new ‘item’ and the library decision to keep the item
  • Merit — as for popularity

The following table shows the top 50 authors by “significance”. Some of the authors aren’t real people but entities such as “Great Britain. Parliament” and for our purposes can be ignored. What’s most striking to me is how closely the listing correlates with the standard literary canon. Other features of note:

  • Shakespeare is number 1 (2)
  • Classics (latin/greek) authors are well-represented with Cicero at number 2 (4), Horace at 5 (9) followed Homer, Euripides, Ovid, Plato, Aeschylus, Xenophon, Sophocles, Aristophanes and Euclid.
  • Surprise entries (from a contemporary perspective): Hannah More, Oliver Goldsmith, Gilbert Burnet (perhaps accounted by his prolificity).
  • Also surprising is limited entries from 19th century UK with only Scott (26), Dickens (28) and Byron (41)

<

table class=”data”>RankNo. of ItemsName 13112Great Britain. Parliament. 21154Shakespeare, WilliamHere’s 31076Church of England. 4973Cicero, Marcus Tullius 5825Great Britain. 6766Catholic Church. 7721Erasmus, Desiderius 8654Defoe, Daniel 9620Horace 10599Aristotle 11547Voltaire 12539Virgil 13527Swift, Jonathan 14520Goethe, Johann Wolfgang Von 15486Rousseau, Jean-Jacques 16479Homer 17444Milton, John 18388Sterne, Laurence 19387England and Wales. Sovereign (1660-1685 : Charles II) 20386Euripides 21372Ovid 22358Goldsmith, Oliver 23358Plato 24351Wang 25349Alighieri, Dante 26338Scott, Walter (Sir) 27326More, Hannah 28322Dickens, Charles 29315Aeschylus 30304Burnet, Gilbert 31302Luther, Martin 32295Dryden, John 33290Xenophon 34280Sophocles 35262Pope, Alexander 36259Fielding, Henry 37258Li 38250Calvin, Jean 39248Zhang 40247Aristophanes 41247Byron, George Gordon Byron (Baron) 42247Bacon, Francis 4324have 7Chen 44245Terence 45241Euclid 46235Augustine (Saint, Bishop of Hippo.) 47232Burke, Edmund 48223Johnson, Samuel 49222Bunyan, John 50222De la Mare, Walter

Top 50 authors based on CUL Catalogue 1400-1960

The other thing we could look at is the overall distribution of titles per author (and how it varies with rank — a classic “is it a power law” question). Below are the histogram (NB log scale for counts) together with a plot of rank against count (which equates, v. crudely, to a transposed plot of the tail of the histogram …). In both cases it looks (!) like a power-law is a reasonable fit given the (approximate) linearity but this should be backed up with a proper K-S test.

culbooks_person-item-hist-logxlogy.png

Histogram of items-per-author distribution (log-log)

culbooks_person-item-by-rank-logxlogy.png

Rank versus no. of items (log-log)

TODO

  • K-S tests
  • Extend data to present day
  • Check against other catalogue data
  • Look at occurrence of people in title names
  • Look at when items appear over time

Colophon

Code to generate table and graphs in the open Public Domain Works repository, specifically method ‘person_work_and_item_counts’ in this file: http://knowledgeforge.net/pdw/hg/file/tip/contrib/stats.py

How Long Should Copyright Last? Talk at Oxford IP Seminar

Last week I was in Oxford to give a talk at the IP Seminar on “How Long Should Copyright Last?”. I have now posted the slides from the talk online.

In addition to covering the basic outline of the optimal term calculation, I was also able to give some results from the recent research on the public domain (see slide 58 onwards).

Size of the Public Domain II

This follows up my previous post. Here we are going to calculation public domain numbers based directly on authorial birth/death date information rather than on guesstimated weightings. We’re going to focus on the Cambridge University Library (CUL) data we used previously.

Pub. DateTotalNo AuthorAny DateDeath Date
1870-1880505646634 (13%)23016 (45%)21876 (43%)
1880-1890668578225 (12%)31135 (46%)28570 (42%)
1890-1900668838733 (13%)32169 (48%)28971 (43%)
1900-1910703608594 (12%)35401 (50%)29922 (42%)
1910-1920604897722 (12%)31336 (51%)24608 (40%)
1920-1930786709023 (11%)44219 (56%)32658 (41%)
1930-19409057611004 (12%)46849 (51%)29372 (32%)
1940-1950726927638 (10%)36495 (50%)22155 (30%)

Table 1: PD Relevant Information Availability

Table 1 presents a summary of how much relevant information is available for items (books) of particular vintages in the CUL catalogue — we only show data from 1870 to 1950 on the presumption that (almost) all pre-1870 publications are PD (their authors would have had to live for more than 70 years post-publication for this not to be the case) and almost all publications post 1950 are in copyright today (their authors would have to have died before 1940 for this not to be the case).

As the table shows, at best only just over 40% of items have a recorded authorial death date and extending to include birth dates only raises this proportion to, at best, the mid mid-to-low fifties. Taking account of items which lack any associated author, raises these figures somewhat further to around 60%, though we should note that the reason for the lack of an associated author is not clear — is it because they are genuinely anonymous or simply because the information has not been recorded? Thus, even for the earliest items listed a large proportion of items (50% or more) lack the necessary information for direct computation of public domain status.

At the same time, we can take some heart, and some interesting facts, from this table. First, a reasonable proportion, amounting to many thousands of items, did have associated death dates. Second, at least for older items, the majority of those with any date had a death date (95% for 1870-1880 and still at over 70% for 1920-1930). Third, and this is a more general observation, proportions were surprisingly constant over time. For example, the proportion of ‘anonymous’ items lies in a narrow band between 10% and 13% for the entire periods. Similarly the proportion of items with any date information ranged only from 45% to 56%. At the same time, and reassuringly, though the proportion with death dates is relatively constant for the oldest periods, in the more recent ones it falls substantially; as one would expect given that some of the authors from those more recent eras are still alive.

Pub. DateTotalPDNot PD?Prop 1Prop 2
1870-18805056522157 (43%)68 (0%)28340 (56%)99%96%
1880-18906685828325 (42%)649 (0%)37884 (56%)97%90%
1890-19006688426723 (39%)2418 (3%)37743 (56%)91%83%
1900-19107036224032 (34%)5838 (8%)40492 (57%)80%67%
1910-19206049116200 (26%)8306 (13%)35985 (59%)66%51%
1920-19307867116127 (20%)16351 (20%)46193 (58%)49%36%
1930-1940905838973 (9%)20835 (23%)60775 (67%)30%19%
1940-1950726965000 (6%)19316 (26%)48380 (66%)20%13%

Table 2: PD Status by Decade. ‘?’ indicates items where PD status could not be computed. Prop(ortion) 1 equals total PD divided by total for which status could be computed (sum of total PD and Not PD). Prop(ortion) 2 equals total PD divided by number of items for which any author date was known (‘Any Date’ in previous table).

Table 2 reports the results of direct computation of PD status based on the information available. Note that, in doing these computations, we have augmented the basic life plus 70 rule with the additional assumptions that a) all items published in 1870 or before are PD b) no author is older than 100 (so if a birth date is more 170 years ago the item is PD) c) every author lives at least until 30 (so that any work published by an author born less than a 100 years ago is automatically not PD).

As is to be expected, for the majority of the periods, the availability of PD status (either PD or Not PD) closely tracks the availability of death date information — the total for which PD status can be determined (the sum of PD and Not PD) almost exactly equals the total for which death date information is available. It is only in the last period 1940-1950 that the birth date appears to make any contribution. More interesting, is how the number PD and Not PD vary over time, especially relative to each other (and as a proportion of the records for which any date is available).

These two proportions/ratios are recorded in the last two columns which record, respectively: 1) the PD total relative to the number of items for which any status could be computed (i.e. the sum of PD and Not PD) 2) the PD total relative to the total number of items for which any date information is available. These ratios change dramatically over the periods shown: starting in the 1870-1880 period in the high 90%s by the 1940s they are down to 20% or below.

Pub. Date% PD
0000-1870100
1870-188095
1880-189090
1890-190085
1900-191065
1910-192040
1920-193025
1930-194010
1940-19506
1950-Now0

Table 3: Suggested PD Proportions

The key question for us is how to extrapolate these PD proportions to the full set of records — i.e. from the set of records for which there is the necessary birth/death date information to that where there is not. The simplest, and most obvious, approach is to assume that the proportions are identical and therefore that the PD proportions calculated on the partial dataset apply to the whole. However, there are some obvious deficiencies in this approach.

In particular, our ability to compute a PD status is largely linked to the existence of a death date and it is likely that the presence of this information is itself correlated with authorial age — after all a death date can only exist once that person has died! This correlation, and the bias it gives rise to, is probably small in the early periods — the authors of any pre 1930 work are almost certainly no longer alive today. However, for the later periods, the bias may be more substantial — it is in these last two periods (1930-1940 and 1940-1950) that there is a significant reduction in the number of records with a death date and (relatedly) a significant increase in the number of records for whom the PD status is unknown.

Thus, in converting the partial PD proportions to full PD proportions it seems sensible to revise down somewhat the partial figures with the revision being greater in later periods. Moreover, we have a lower bound for any downwards revision provided by the total PD as a proportion of all records — which even in the 1940-1950 period stood at 6%. In light of these considerations Table 3 gives fairly conservative figures for PD proportions that when estimating PD size based on publication dates. Interestingly, even with out conservative assumptions, these proportions are rather higher than those used in our previous analysis.

Algorithm Speed and the Challenge of Large Datasets

In doing research for the EU Public Domain project (as here and here) we are often handling large datasets, for example one national library’s list of pre-1960 books stretched to over 4 million items. In such a situation, an algorithm’s speed (and space) can really matter. To illustrate, consider our ‘loading’ algorithm — i.e. the algorithm to load MARC records into the DB, which had the following steps:

  1. Do a simple load: i.e. for each catalogue entry create a new Item and new Persons for any authors listed
  2. “Consolidate” all the duplicate Persons, i.e. a Person who is really the same but for whom we create duplicate DB entries in part 1 (we can do this because MARC cataloguers try to uniquely identify authors based on name + birth date + death date).
  3. [Not discussed here] Consolidate “items” to “works” (associate multiple items (i.e. distinct catalogue entries) of, say, a Christmas Carol, to a single “work”)

The first part of this worked great: on a 1 million record load we averaged between 8s and 25s (depending on hardware, DB backend etc) per thousand records with speed fairly constant throughout (so that’s between 2.5 and 7.5h to load the whole lot). Unfortunately, at the consolidate stage we ran into problems: for a 1 million item DB there were several 100 thousand consolidations and we were averaging only 900s per 1000 consolidations! (This also scaled significantly with DB size: a 35k records DB averaged 55s per 1000). This would mean a full run would require several days! Even worse, because of the form of the algorithm (all the consolidation for a given person were done as a batch) we ran into memory issues on big datasets with some machines.

To address this we switched to performing “consolidation” on load, i.e. when creating each Item for a catalogue entry we’d search for existing authors who matched the information we had on that record. Unfortunately this had a huge impact on the load: time grew superlinearly and had already reached 300s per 1000 records at the 100k mark having started at 40 — Figure 1 plots this relationship. By extrapolation, 1M records would take 100 hours plus — almost a week!

At this point we went back to the original approach and tried optimizing the consolidation, first by switching to pure sql and then by adding some indexes on join tables (I’d always thought that foreign keys were auto indexed but it turned out not to be the case!). The first of these changes solved the memory issues, while the second resolved the speed problems providing a speedup of more than 30x (30s per 1000 rather 900s) and reduced the processing time from several days to a few hours.

Many more examples of this kind of issue could be provided. However, this one already serves to illustrate the two main points:

  • With large datasets speed really matters
  • Even with optimization algorithms can take a substantial time to run

Both of these have a significant impact on the speed, and form, of the development process. First, because one has to spend time optimizing and profiling — which like all experimentation is time-consuming. Second because longer run-times directly impact the rate at which results are obtained and development can proceed — often bugs or improvements only become obvious once one has run on a large dataset, plus any change to an algorithm that alters output requires that it be rerun.

speed.png

Figure 1: Load time when doing consolidation on load

The Size of the Public Domain

This post continues the work begun in this earlier post on “Estimating Information Production and the Size of the Public Domain”. Update: 2009-07-17 there is now a follow-up post.

Having already obtained estimates of the number of items (publications) produced each year based on library catalogue data our next step is to convert this into an estimate of the “size” of the public domain. (NB: as already discussed, “size” could mean several different things. Here, at least to start with, we’re going to take the simplest and crudest approach and equate size with number of publications/items.)

The natural, and most obvious, approach here is to go through our 1 million+ items and compute their public domain status (as discussed in this earlier post). Unfortunately, as detailed there, this is problematic because we often have insufficient information in library catalogues with which to compute PD status with certainty — in particular, author death dates are frequently absent. Thus, it will be necessary to fall back on some approximate method.

For example, we can use base PD status on simple publication dates: if a book was published, say, 140 years ago it is very likely it is in the public domain — for it to be in copyright its author must have lived more than 70 years after the book came out (remember copyright lasts for life plus 70 years in the EU)! Conversely, any publication less than 70 years old is almost certainly not in the public domain. For periods in between we can assume some proportion of publications are PD starting close to zero for more recent items and rising towards one for older ones. A calculation along those lines is provided in the following table:

StartEndItems% PDNumber PD
14001870389291100389291
18701880505649548035
18801890668579060171
18901900668838053506
19001910703605035180
19101920604893018146
1920193078670107867
193019409057654528
Total8736900.71616724

<

p class=”caption”>Number of UK Public Domain Publications (Based on Cambridge University Library Catalogue Data)

So, based on the assumptions regarding PD proportions given in the table, there are somewhat over 600 thousand PD books according to the holdings of Cambridge University Library (of which just over half, approx 390k are from before 1870). The British Library dataset is approx 4x as big as Cambridge University Library and the numbers scale up roughly proportionately giving a total of over 2.4 million items.

Of course this is a fairly crude approach based purely on publication date and it be improved in a variety of ways, most notably by using the authorial birth date information which is usually present in catalogue data (we can also use death date information where present). This will be the subject of the next post. (2009-07-17 the post is up here).

Estimating Information Production and the Size of the Public Domain

Here we’re going to look at using library catalogue data as a source for estimating information production (over time) and the size of the public domain.

Library Catalogues

Cultural institutions, primarily libraries, have long compiled records of the material they hold in the form of catalogues. Furthermore, most countries have had one or more libraries (usually the national library) whose task included an archival component and, hence, whose collections should be relatively comprehensive, at least as regards published material.

The catalogues of those libraries then provide an invaluable resource for charting, in the form of publications, levels of information production over time (subject, of course, to the obvious caveats about coverage and the relationship of general “information production” to publications).

Furthermore, library catalogue entries record (almost) the right sort of information for computing public domain status, in particular a given record usually has a) a publication date b) unambiguously identified author(s) with birth date(s) (though unfortunately not death date). Thus, we can also use this catalogue data to estimate the size of the public domain — size being equated here to the total number of items currently in the public domain.

Results

To illustrate, here are some results based on the catalogue of Cambridge University Library which is one of the UK’s “copyright libraries” (i.e. they have a right to obtain, though not an obligation to hold, one copy of every book published in the UK). This first plot shows the numbers of publications per year (as determined by their publication date) up until 1960 (when the dataset ends) based on the publication date recorded in the catalogue.

A major concern when basing an analysis on these kinds of trends is is that fluctuations over time derive not from changes in underlying production and publication rates but changes in acquisition policies of the library concerned. To check for this, we present a second plot which shows the same information but derived from the British Library’s catalogue. Reassuringly, though there are differences, the basic patterns look remarkably similar.

CUL data 1600-1960

Number of items (books etc) Per Year in the Cambridge University Library Catalogue (1600-1960).

BL data 1600-1960

Number of items (books etc) Per Year in the British Library Catalogue (1600-1960).

What do we learn from these graphs?

  • In total there were over a million “Items” in this dataset (and parsing, cleaning, loading and analyzing this data took on the order of days — while the preparation work to develop and perfect these algorithms took weeks if not months)
  • The main trend is a fairly consistent, and approximately exponential, increase in the number of publications (items) per year. At the start of our time period in 1600 we have around 400 items a year in the catalogue while by 1960 the number is over 16000.
  • This is a forty-fold increase and corresponds to an annual growth rate of approx 0.8%. Assuming “growth” began only around the time of the industrial revolution (~ 1750) when output was around 1000 (10-year moving average) gives a fairly similar growth rate of around 0.89%.
  • There are some fairly noticeable fluctuations around this basic trend:
    1. There appears to be a burst in publications in the decade or decade and a half before 1800. One can conjecture several, more or less intriguing, reasons for this: the cultural impact of the French revolution (esp. on radicalism), the effect of loosening copyright laws after Donaldson v. Beckett, etc. However, without substantial additional work, for example to examine the content of the publications in that period these must remain little more than conjectures.
    2. The two world wars appear dramatically in our dataset as sharp dips: the pre-1914 level of around 7k+ falls by over a third during the war to around 4.5k and then rises rapidly again to reach, and pass, 7k per year in the early 20s. Similarly, the late 1930s level of around 9.5k per year drops sharply upon the outbreak of war reaching a low of 5350 in 1942 (a drop of 45%), and then rebounding rapidly at the war’s end: from 5.9k in 1945 to 8k in 1946, 9k in 1947 and 11k in 1948!

To do next (but in separate entries — this post is already rather long!):

  • Estimates for the the size of the public domain: how many of those catalogue items are in the public domain
  • Distinguishing Publications (“Items”) from “Works” — i.e. production of new material versus the reissuance of old (see previous post for more on this).

Colophon: Background to this Research

I’m working on a EU funded project on the Public Domain in Europe, with particular focus on the size and value of the public domain. This involves getting large datasets about cultural material and trying to answer questions like: How many of these items are in the public domain? What’s the difference in price and availability of public domain versus non public domain items?

I’ve also been involved for several years in Public Domain Works, a project to create a database of works which were in the public domain.

Colophon: Data and Code

All the code used in parsing, loading and analysis is open and available from the Public Domain Works mercurial repository. Unfortunately, the library catalogue data is not: library catalogue data, at least in the UK, appears to be largely proprietary and the raw data kindly made available to us for the purposes of this research by the British Library and Cambridge University Library was provided only on a strictly confidential basis.

Computing Copyright (or Public Domain) Status of Cultural Works

Background

I’m working on a EU funded project to look at the size and value of the Public Domain. This involves getting large datasets about cultural material and trying to answer questions like: How many of these items are in the public domain? What’s the difference in price and availability of public domain versus non public domain items?

I’ve also been involved for several years in Public Domain Works, a project to create a database of works which were in the public domain (especially recordings).

The Problem

Suppose we have data on cultural items such as books and recordings. For a given item we wish to:

  1. Identify the underlying work(s) that item contains.
  2. Identify the copyright status of that work, in particular whether it is Public Domain (PD)

Putting 1 and 2 together allows us to assign a ‘copyright status’ to a given item.

Aside: We have to be a bit careful here since the copyright status of an item and its work may not be exactly the same: for example, even books containing pure public domain texts may have copyright in their typesetting — or there may be additional non-PD material such as an introduction or commentaries (though, in this case, at least theoretically, we should say the item contains 2 works a) the original PD text b) the non-PD introduction).

Note our terminology here (based off FRBR): by an ‘item’ we mean something like a publication be that book, recording or whatever. By a work we mean the underlying material (text, sounds etc) contained within that. So for example, Shakespeare’s play “Hamlet” is a single work but there are many associated items (publications). (Note that we would count a translation of a work as a new work — though one derived from the original work).

Almost all the data available on cultural material is about items. For example, library catalogues list items, databases listing sales (such as Nielsen) list items and online sites providing information on currently available material (along with prices) such as booksinprint, muze or even Amazon list items.

Determining Copyright (or Public Domain) Status

With our terminology in place determining copyright status is, in theory, simple:

  1. Given information on an item match it to a work (or works).
  2. For each work obtain relevant information such as date work first published (as an item) and death dates of author(s)
  3. Compute copyright status based on the copyright laws for your jurisdiction.

While copyright law is not always simple, step three is generally fairly straightforward, especially if one is willing to accept something that almost but not quite 100% accurate (say 99.99% accurate).1

What is not so straightforward are the first two steps especially step 1. This is because most datasets give only a limited amount of information on the items they contain.

Frequently information on authors will be limited or non-existent, and they certainly may not be unambiguously identified (this is especially true of datasets containing ‘commercial’ information such as prices and availability). Often the exact form of the title, even for the same item will vary between datasets and that leaves aside the possibility of varying titles for different titles related to the same work (is it “Hamlet” or “William Shakespeare’s Hamlet” or “Hamlet by William Shakespeare” or “Hamlet, Prince of Denmark” etc).

At the same time, speed matters because the size of the datasets involved are fairly substantial. For example, there were approx 64 thousand titles that sold more than 5 copies in 2007 in the UK. If computing public domain status for each title takes 1 second then a full run will take 18 hours. If it takes 30s per title it will take 22 days.

Some Examples

To illustrate the difficulties here I present the results of two different attempts at computing the PD status for the list of 64k titles which sold at least 5 copies in the UK in 2007.

Example 1: Open Library

I ran this algorithm (by_work method) against the Open Library database via their web api. This was a very slow process. First, because web apis are relatively slow and second because, perhaps due to overloading, the OL API would stop responding at some point and a manual reboot would be required (to try avoid overloading the API we’d already added a significant delay between requests — another reason the process was quite slow). Overall it took more around 10 days to run through the whole 64k item dataset. The results were as follows:

Total PD: 2206.0
Total Items: 63937
Fraction PD: 0.0345027136087
Total Matched: 0.588469900058

As this shows matching was not that successful with only around 3/5 of items successfully matched. Part of this may be due to the fact that:

  • I limit the number of title matches to 10 in order to keep the time within reasonable bounds
  • The difficulty of allowing enough, but not too much, fuzziness in the matching process.

Overall, approximately 3.5% of all items were identified as PD (that being 5.8% of those actually matched). The PD determination algorithm was a conservative one with an item labelled as PD only if all authors were positively identified as PD.

Thus, this is likely to be lower bounds (at least assuming the match process was reasonable — and allowing for the fact that some PD items included non-PD material such as commentaries). It was certainly clear from basic eyeballing that a substantial number of PD works were either not matched or not computed as PD (because of incorrect authors or missing death dates).

Example 2

Our second algorithm ran against a local copy of Philip Harper’s NGCOBA database (data, code). The algorithm was as follows:

  1. Matched by title and authors.
    • If match: compute PD status strictly (all death dates known and all less than 1937)
    • Else: continue
  2. Pick first author and find all (approx) matching authors (allow extra first names)
    • If no match: Not PD
    • Intialize PD score to 0
    • For each matched author alter score in following manner:
      • If author PD: +1
      • If not PD: -3
      • If unknown (no death_date) -0.5
    • PD if score > 0 (Else: Not PD)

This algorithm took a few hours to run (this could likely be much improved with a bit of DB optimization and a move from sqlite to something better). The results were:

Total PD: 6404.0
Total Items: 63917
Fraction PD: 0.100192437067

As can be seen the fraction PD here was substantially higher at around 10%. One might be concerned that this was due to our more lenient PD algorithm (the problem was that without such ‘leniency’ a very large number of PD works/authors were being misclassified as not PD). However, basic eye-balling indicates that the number of false positives is not particularly high (and that there are also some false negatives).

Summary

  1. Computing PD status is non-trivial largely because a) it is hard to match a given item to a work or person b) we lack data such as authorial death dates and dates of first publication that are required.
  2. As such we need to adopt approximate and probabilistic methods (such as the scoring approach)
  3. (Very) preliminary calculations suggest that between 3 and 10% of titles actively sold at any one time are public domain
    • NB: this does not mean 3-10% of sales were public domain (in fact this is very unlikely since few, if any of the best-selling items are PD)

  1. Not being 100% accurate means we can ignore some of the “special cases” and one-off exceptions in copyright law. For example, in the UK the Copyright Designs and Patents Act para 301 contains a special provision which mean that “Peter Pan” by J.M. Barrie will never enter the Public Domain (royalties will be payable in perpetuity for the benefit of Great Ormond Street Hospital). 

Public Domain in Europe (EUPD) Research Project

I’m part of a team, led by Rightscom, which has won a bid to do a major analysis of the scope and nature of the public domain in Europe for the European Commission. As it says in the announcement:

We will assemble quantitative and qualitative data and produce a methodology for measuring the public domain which can be used and refined for future studies both within Europe and further a field. The objectives of the report are four fold:

  • To estimate the number of works in the public domain in the EU and calculate approximately the levels and ways of use and main users of published works
  • To estimate the current economic value of public domain works and estimate the value of works that in the next 10-20 years are to be released into the public domain and determine any change in its value whilst under copyright and once it is on the public domain

For my part, I’m going to be particularly focused on the size and value questions. This will involve getting large datasets about cultural material and trying to answer questions like: How many of these items are in the public domain? What’s the difference in price and availability of public domain versus non public domain items?