Category Archives: Culture and Society

Open Shakespeare Annotation Sprint

Cross-posted from Open Knowledge Foundation blog.

Tomorrow we’re holding the first Open Shakespeare Annotation ‘Sprint’. We’ll be getting together online and in-person to collaborate on critically annotating a complete Shakespeare play with all our work being open.

All of Shakespeare’s texts are, of course, in the public domain, and therefore already ‘open’. However, most editions of Shakespeare people actually use (and purchase) are ‘critical’ editions, that is texts together with notes and annotations that explain or analyze the text, and, for these critical editions no open version yet exists. This weekend we’re aiming to change that!

Using the annotator tool we now have a way to work collaboratively online to add and develop these ‘critical’ additions and the aim of the sprint is to fully annotate one complete play. Anyone can get involved, from lay-Shakespeare-lover to English professor, all you’ll need is a web-browser and an interest in Bard, and even if you can’t make it, you can vote right now on which play we should work on!

Using specially-designed annotation software we intend to print an edition of Shakespeare unlike any other, incorporating glosses, textual notes and other information written by anyone able to connect to the Open Shakespeare website.

Work begins with a full-day annotation sprint on Saturday 5th February, which will take online as well as at in-person meetups. Anyone can organize a meetup and we’re organizing one at University of Cambridge English Faculty (if you’d like to hold your own please just add it to the etherpad linked above).

The Public Domain in 2011

According to http://publicdomainworks.net/ (which I helped build) there were 661 people whose works entered the public domain in 2011:

http://publicdomainworks.net/stats/year/2011

Of course, I should immediately state that this is a fairly crude calculation based on a simple life+70 model and therefore not applicable to e.g. the US with its 1923 cut-off (for those interested in the details of computing public domain status there’s you can find lots more here: http://wiki.okfn.org/PublicDomainCalculators).

The figure is also a significant underestimate — to do these calculations you need lots of information about authors, their death dates and their works. This kind of bibliographic metadata has, until fairly recently, been very hard to come in an open data form and so we have been limited to doing calculations with only a relatively small subset of the actual all works (though, it should be said, we do have many of the most ‘important’ authors).

Thankfully this is now changing thanks to people like the British Library opening up their data so we should see a much extended list for 2011 some time in the next few months (if you’re interested in open bibliographic data, you should join the Open Knowledge Foundation’s Open Bibliographic Data Working Group).

Launch of the Public Domain Review

Lastly, I have an exciting announcement. Thanks to the work of my Open Knowledge Foundation colleague Jonathan Gray, we’re pleased to announce the Launch of the Public Domain Review to celebrate Public Domain Day 2011:

http://publicdomainreview.okfn.org/

As Jonathan explains in the blog post:

The 1st of January every year is Public Domain Day, when new works enter the public domain in many (though unfortunately not all) countries around the world.

To celebrate, the Open Knowledge Foundation is launching the Public Domain Review, a web-based review of works which have entered the public domain:

Each week an invited contributor will present an interesting or curious work with a brief accompanying text giving context, commentary and criticism. The first piece takes a look at works by Nathanael West, whose works enter the public domain today in many jurisdictions.

You can sign up to receive the review in your inbox via email. If you’re on Twitter, you can also follow @publicdomainrev. Happy Public Domain Day!

Slouching Towards Bethelehem by Joan Didion

Read some time ago Joan Didion’s extraordinary set of essays Slouching Towards Bethlemem1, a book filled with the sense of dislocation and anomie that seems so essential to the experience, at least in literature, of America itself.

The most penetrating of the set was that which lends its title to the book2 and I marked one particular section out of that essay, and out of the book as a whole:

[in discussion of Haight-Ashbury in summer 1967] But the peculiar beauty of this political potential, as far as the activists were concerned, was that it remained not clear at all to most of the inhabitants of the District, perhaps because of the few seventeen-year-olds who are political realists tend not to adopt romantic idealism as a life style. Nor was it clear to the press, which at varying levels of the competence continued to report “the hippie phenomemon” as an extended panty raid; an artistic avant-garde led by such comfortable YMHA regulars as Allen Ginsberg; or a thoughtful protest, not unlike joining the Peace Corps, against the culture which had produced Saran-Wrap and the Vietnam War. This last, or they’re-trying-to-tell-us-something approach, reached its apogee in a Time cover story which revealed that hippies “scorn money — they call it ‘bread'” and remains the most remarkable, if unwitting, extant evidence that the signals between the generations are irrevocably jammed.

Because the signals the press were getting were immaculate of political possibilities, the tensions of the District went unremarked upon, even during the period when there were so many observers on Haight Street from Life and Look and CBS that they were largely observing one another. …

Of course the activists — not those whose thinking had become rigid, but those whose approach to revolution was imaginatively anarchic — had long ago grasped the reality which still eluded the press: we were seeing something important. We were seeing the desperate attempt of a handful of pathetically unequipped children to create a community in a social vacuum. Once we had seen these children, we could no longer overlook the vacuum, no longer pretend that the society’s atomization could be reversed. This was not a traditional generational rebellion. At some point between 1945 and 1967 we had somehow neglected to tell these children that the rules of the game we happened to be playing. Maybe we had stopped believing in the rules ourselves, maybe we were having a failure of nerve about the game. Maybe there was were just too few people around to do the telling. These were children who grew up cut loose from the web of cousins and great-aunts and family doctors and lifelong neighbors who had traditionally suggested and enforced the society’s values. They are children who have moved around a lot, San Jose, Cula Vista, here. They are less in rebellion against the society than ignorant of it, able only to feed back certain of its most publicized self-doubts, Vietnam, Saran-Wrap, diet pills, the Bomb. [bold emphasis added]

They feed back exactly what is given them. … [pp. 121-123]

Colophon

From the closing paragraph of the preface:

My only advantage as a reporter is that I am so physically small, so temperamentally unobtrusive, and so neurotically inarticulate that people tend to forget that my presence runs counter to their best interests. And it always does. That is one last thing to remember: writes are always selling somebody out.


  1. Flamingo 1993, first published Farrar, Strauss and Giroux 1968. 

  2. An entirely intentional choice. As Didion states in the preface:

    “[Slouching Towards Bethlemem] is also the title of one piece of the book, and that piece, which derived from some time spent in the Haight-Ashbury district of San Francisco, was the for me both the most imperative of all these pieces to write and the only one that made me despondent after it was printed. It was the first time I had dealt directly and flatly with evidence of atomization, the proof that things fall apart: I went to San Francisco because I had not been able to work in some months, had been paralyzed by the conviction that writing was an irrelevant act, that the world as I had understood it no longer existed. If I was to work again at all, it would be necessary for me to come to terms with disorder. That was why the piece was important to me. And after it was printed I was that, however directly and flatly I thought I had said it, I had failed to get through to many of the people who read and even liked the piece, failed to suggest that I was talking about something more general that a handful of children wearing mandalas on their foreheads. Disc jockeys telephoned my house and wanted to discuss (on the air) the incidence of the “filth” in the Haight-Ashbury, and acquaintances congratulated me on having finished the piece “Just in time”, because “the whole fad’s dad now, fini, kaput.” I suppose almost everyone who writes is afflicted some of the time by the suspicion that nobody out there is listening, but it seemed to me then (perhaps because the piece was important to me) that I had never gotten a feedback so universally beside the point. 

Papers on the Size and Value of EU Public Domain

I’ve just posted two new papers on the size of and ‘value’ the EU Public Domain. These papers are based on the research done as part of the Public Domain in Europe (EUPD) Research Project (which has now been submitted).

  • Summary Slides Covering Size and Value of the Public Domain – Talk at COMMUNIA in Feb 2010
  • The Size of the EU Public Domain

    This paper reports results from a large recent study of the public domain in the European Union. Based on a combination of catalogue and survey data our figures for the number of items (and works) in the public domain extend across a variety of media and provide one of the first quantitative estimates of the ‘size’ of the public domain in any jurisdiction. We find that for books and recordings the public domain is around 10-20% of published extant output and would consist of millions and hundreds of thousands of items respectively. For films the figure is dramatically lower (almost zero). We also establish some interesting figures relevant to the orphan works debate such as the number of catalogue entries without any identified author (approximately 10%).

  • The Value of the EU Public Domain

    This paper reports results from a large recent study of the public domain in the European Union. Based on a combination of catalogue, commercial and survey data we present detailed figures both on the prices (and price differences) of in copyright and public domain material and on the usage of that material. Combined with the estimates for the size of the EU public domain presented in the companion paper our results allow us to provide the first quantitative estimate for the `value’ of the public domain (i.e. welfare gains from its existence) in any jurisdiction. We also find clear, and statistically significant, differences between the prices of in-copyright and public-domain in the two areas which we have significant data: books and sounds recordings in the UK. Patterns of usage indicate a significant demand for public domain material but limitations of the data make it difficult to draw conclusions on the impact of entry into the public domain on demand.

The results on price differences are particularly striking, as to my knowledge, these are by far the largest analysis done to date. More significantly, they clearly show that the claim in the Commission’s impact assessment that there was no price effect of copyright (compared to the public domain) was wrong. That claim was central to the impact assessment and to the proposal to extend copyright term in sound recordings (a claim that was based on a single study using a very small size, performed by PwC as part of a music-industry sponsored piece of consultancy for submission to the Gowers review).

The Size of the Public Domain (Without Term Extensions)

We’ve looked at the size of the public domain extensively in earlier posts.

The basic take away from the analysis was the finding that, based on library catalogue data, for books in the UK, approximately 15-20% of work was in the public domain — with public domain work being pretty old (70 years plus, due to the life+70 nature of copyright).

An interesting question to ask then is: how large would the public domain be if copyright had not been extended from its original length of 14 years with (possible) 14 year renewal (14+14) set out in Statute of Anne back in 1710? And how does this compare with how the situation, back when 14+14 was in “full swing”, say, 1795?

Furthermore, what about if copyright today was a simple 15 years — the point estimate for the optimal term of copyright found in paper on this subject? Well here’s the answer:

Today1795 (14+14)Today (14+14)Today (15y)
Total Items3.46m179k3.46m3.46m
No. Public Domain657k140k1.2m2.59m
%tage Public Domain19785275

Number and percentage of public domain works based on various scenarios based on Cambridge University Library catalogue data.

That’s right folks: based on the data available, if copyright had stayed at its Statute of Anne level, 52% of the books available today would in the public domain compared to an actual level of 19%. That’s around 600,000 additional items that would be in the public domain including works like Virginia Woolf’s (d. 1941) the Waves, Salinger’s Catcher in the Rye (pub. 1951) and Marquez’s Chronicle of a Death Foretold (pub. 1981).

For comparison, in 1795 78% of all extant works were in the public domain. A figure which we’d be close to having if copyright was a simple 15 years (in that case the public domain would be a substantial 75%).

To put this in visual terms, what the public domain is missing out as a result of copyright extension is the yellow region in the following figure: those are the set of works that would be public domain under 14+14 but aren’t under current copyright!

PD Stats

The Public Domain of books today (red), under 14+14 (yellow), and published output (black)

Update: I’ve posted the main summary statistics file including per-year counts. I’ve also started a CKAN data package: eupd-data for this EUPD-related data.

The Elusive Disappearance of Community

From Laslett ‘Phillipe Ariès and “La Famille”‘ p.83 (quoted in Eisenstein, p.131):

The actual reality, the tangible quality of community life in earlier towns or villages … is puzzling … and only too susceptible to sentimentalisation. People seem to want to believe that there was a time when every one belonged to an active, supportive local society, providing a palpable framework for everyday life. But we find that the phenomenon itself and its passing — if that is what, in fact happened– perpetually elude our grasp.

Author “Significance” From Catalogue Data

Continues the series of post related to analyzing catalogue data, here are some stats on author “significance” as measured by the number of book entries (‘items’) for that author in the Cambridge University Library catalogue from 1400-1960 (there being 1m+ such entries).

I’ve termed this measure “significance” (with intentional quotes) as it co-mingles a variety of factors:

  • Prolificness — how many distinct works an author produced (since usually each work will get an item)
  • Popularity — this influences how many times the same work gets reissued as a new ‘item’ and the library decision to keep the item
  • Merit — as for popularity

The following table shows the top 50 authors by “significance”. Some of the authors aren’t real people but entities such as “Great Britain. Parliament” and for our purposes can be ignored. What’s most striking to me is how closely the listing correlates with the standard literary canon. Other features of note:

  • Shakespeare is number 1 (2)
  • Classics (latin/greek) authors are well-represented with Cicero at number 2 (4), Horace at 5 (9) followed Homer, Euripides, Ovid, Plato, Aeschylus, Xenophon, Sophocles, Aristophanes and Euclid.
  • Surprise entries (from a contemporary perspective): Hannah More, Oliver Goldsmith, Gilbert Burnet (perhaps accounted by his prolificity).
  • Also surprising is limited entries from 19th century UK with only Scott (26), Dickens (28) and Byron (41)

<

table class=”data”>RankNo. of ItemsName 13112Great Britain. Parliament. 21154Shakespeare, WilliamHere’s 31076Church of England. 4973Cicero, Marcus Tullius 5825Great Britain. 6766Catholic Church. 7721Erasmus, Desiderius 8654Defoe, Daniel 9620Horace 10599Aristotle 11547Voltaire 12539Virgil 13527Swift, Jonathan 14520Goethe, Johann Wolfgang Von 15486Rousseau, Jean-Jacques 16479Homer 17444Milton, John 18388Sterne, Laurence 19387England and Wales. Sovereign (1660-1685 : Charles II) 20386Euripides 21372Ovid 22358Goldsmith, Oliver 23358Plato 24351Wang 25349Alighieri, Dante 26338Scott, Walter (Sir) 27326More, Hannah 28322Dickens, Charles 29315Aeschylus 30304Burnet, Gilbert 31302Luther, Martin 32295Dryden, John 33290Xenophon 34280Sophocles 35262Pope, Alexander 36259Fielding, Henry 37258Li 38250Calvin, Jean 39248Zhang 40247Aristophanes 41247Byron, George Gordon Byron (Baron) 42247Bacon, Francis 4324have 7Chen 44245Terence 45241Euclid 46235Augustine (Saint, Bishop of Hippo.) 47232Burke, Edmund 48223Johnson, Samuel 49222Bunyan, John 50222De la Mare, Walter

Top 50 authors based on CUL Catalogue 1400-1960

The other thing we could look at is the overall distribution of titles per author (and how it varies with rank — a classic “is it a power law” question). Below are the histogram (NB log scale for counts) together with a plot of rank against count (which equates, v. crudely, to a transposed plot of the tail of the histogram …). In both cases it looks (!) like a power-law is a reasonable fit given the (approximate) linearity but this should be backed up with a proper K-S test.

culbooks_person-item-hist-logxlogy.png

Histogram of items-per-author distribution (log-log)

culbooks_person-item-by-rank-logxlogy.png

Rank versus no. of items (log-log)

TODO

  • K-S tests
  • Extend data to present day
  • Check against other catalogue data
  • Look at occurrence of people in title names
  • Look at when items appear over time

Colophon

Code to generate table and graphs in the open Public Domain Works repository, specifically method ‘person_work_and_item_counts’ in this file: http://knowledgeforge.net/pdw/hg/file/tip/contrib/stats.py

Size of the Public Domain II

This follows up my previous post. Here we are going to calculation public domain numbers based directly on authorial birth/death date information rather than on guesstimated weightings. We’re going to focus on the Cambridge University Library (CUL) data we used previously.

Pub. DateTotalNo AuthorAny DateDeath Date
1870-1880505646634 (13%)23016 (45%)21876 (43%)
1880-1890668578225 (12%)31135 (46%)28570 (42%)
1890-1900668838733 (13%)32169 (48%)28971 (43%)
1900-1910703608594 (12%)35401 (50%)29922 (42%)
1910-1920604897722 (12%)31336 (51%)24608 (40%)
1920-1930786709023 (11%)44219 (56%)32658 (41%)
1930-19409057611004 (12%)46849 (51%)29372 (32%)
1940-1950726927638 (10%)36495 (50%)22155 (30%)

Table 1: PD Relevant Information Availability

Table 1 presents a summary of how much relevant information is available for items (books) of particular vintages in the CUL catalogue — we only show data from 1870 to 1950 on the presumption that (almost) all pre-1870 publications are PD (their authors would have had to live for more than 70 years post-publication for this not to be the case) and almost all publications post 1950 are in copyright today (their authors would have to have died before 1940 for this not to be the case).

As the table shows, at best only just over 40% of items have a recorded authorial death date and extending to include birth dates only raises this proportion to, at best, the mid mid-to-low fifties. Taking account of items which lack any associated author, raises these figures somewhat further to around 60%, though we should note that the reason for the lack of an associated author is not clear — is it because they are genuinely anonymous or simply because the information has not been recorded? Thus, even for the earliest items listed a large proportion of items (50% or more) lack the necessary information for direct computation of public domain status.

At the same time, we can take some heart, and some interesting facts, from this table. First, a reasonable proportion, amounting to many thousands of items, did have associated death dates. Second, at least for older items, the majority of those with any date had a death date (95% for 1870-1880 and still at over 70% for 1920-1930). Third, and this is a more general observation, proportions were surprisingly constant over time. For example, the proportion of ‘anonymous’ items lies in a narrow band between 10% and 13% for the entire periods. Similarly the proportion of items with any date information ranged only from 45% to 56%. At the same time, and reassuringly, though the proportion with death dates is relatively constant for the oldest periods, in the more recent ones it falls substantially; as one would expect given that some of the authors from those more recent eras are still alive.

Pub. DateTotalPDNot PD?Prop 1Prop 2
1870-18805056522157 (43%)68 (0%)28340 (56%)99%96%
1880-18906685828325 (42%)649 (0%)37884 (56%)97%90%
1890-19006688426723 (39%)2418 (3%)37743 (56%)91%83%
1900-19107036224032 (34%)5838 (8%)40492 (57%)80%67%
1910-19206049116200 (26%)8306 (13%)35985 (59%)66%51%
1920-19307867116127 (20%)16351 (20%)46193 (58%)49%36%
1930-1940905838973 (9%)20835 (23%)60775 (67%)30%19%
1940-1950726965000 (6%)19316 (26%)48380 (66%)20%13%

Table 2: PD Status by Decade. ‘?’ indicates items where PD status could not be computed. Prop(ortion) 1 equals total PD divided by total for which status could be computed (sum of total PD and Not PD). Prop(ortion) 2 equals total PD divided by number of items for which any author date was known (‘Any Date’ in previous table).

Table 2 reports the results of direct computation of PD status based on the information available. Note that, in doing these computations, we have augmented the basic life plus 70 rule with the additional assumptions that a) all items published in 1870 or before are PD b) no author is older than 100 (so if a birth date is more 170 years ago the item is PD) c) every author lives at least until 30 (so that any work published by an author born less than a 100 years ago is automatically not PD).

As is to be expected, for the majority of the periods, the availability of PD status (either PD or Not PD) closely tracks the availability of death date information — the total for which PD status can be determined (the sum of PD and Not PD) almost exactly equals the total for which death date information is available. It is only in the last period 1940-1950 that the birth date appears to make any contribution. More interesting, is how the number PD and Not PD vary over time, especially relative to each other (and as a proportion of the records for which any date is available).

These two proportions/ratios are recorded in the last two columns which record, respectively: 1) the PD total relative to the number of items for which any status could be computed (i.e. the sum of PD and Not PD) 2) the PD total relative to the total number of items for which any date information is available. These ratios change dramatically over the periods shown: starting in the 1870-1880 period in the high 90%s by the 1940s they are down to 20% or below.

Pub. Date% PD
0000-1870100
1870-188095
1880-189090
1890-190085
1900-191065
1910-192040
1920-193025
1930-194010
1940-19506
1950-Now0

Table 3: Suggested PD Proportions

The key question for us is how to extrapolate these PD proportions to the full set of records — i.e. from the set of records for which there is the necessary birth/death date information to that where there is not. The simplest, and most obvious, approach is to assume that the proportions are identical and therefore that the PD proportions calculated on the partial dataset apply to the whole. However, there are some obvious deficiencies in this approach.

In particular, our ability to compute a PD status is largely linked to the existence of a death date and it is likely that the presence of this information is itself correlated with authorial age — after all a death date can only exist once that person has died! This correlation, and the bias it gives rise to, is probably small in the early periods — the authors of any pre 1930 work are almost certainly no longer alive today. However, for the later periods, the bias may be more substantial — it is in these last two periods (1930-1940 and 1940-1950) that there is a significant reduction in the number of records with a death date and (relatedly) a significant increase in the number of records for whom the PD status is unknown.

Thus, in converting the partial PD proportions to full PD proportions it seems sensible to revise down somewhat the partial figures with the revision being greater in later periods. Moreover, we have a lower bound for any downwards revision provided by the total PD as a proportion of all records — which even in the 1940-1950 period stood at 6%. In light of these considerations Table 3 gives fairly conservative figures for PD proportions that when estimating PD size based on publication dates. Interestingly, even with out conservative assumptions, these proportions are rather higher than those used in our previous analysis.

Estimating Information Production and the Size of the Public Domain

Here we’re going to look at using library catalogue data as a source for estimating information production (over time) and the size of the public domain.

Library Catalogues

Cultural institutions, primarily libraries, have long compiled records of the material they hold in the form of catalogues. Furthermore, most countries have had one or more libraries (usually the national library) whose task included an archival component and, hence, whose collections should be relatively comprehensive, at least as regards published material.

The catalogues of those libraries then provide an invaluable resource for charting, in the form of publications, levels of information production over time (subject, of course, to the obvious caveats about coverage and the relationship of general “information production” to publications).

Furthermore, library catalogue entries record (almost) the right sort of information for computing public domain status, in particular a given record usually has a) a publication date b) unambiguously identified author(s) with birth date(s) (though unfortunately not death date). Thus, we can also use this catalogue data to estimate the size of the public domain — size being equated here to the total number of items currently in the public domain.

Results

To illustrate, here are some results based on the catalogue of Cambridge University Library which is one of the UK’s “copyright libraries” (i.e. they have a right to obtain, though not an obligation to hold, one copy of every book published in the UK). This first plot shows the numbers of publications per year (as determined by their publication date) up until 1960 (when the dataset ends) based on the publication date recorded in the catalogue.

A major concern when basing an analysis on these kinds of trends is is that fluctuations over time derive not from changes in underlying production and publication rates but changes in acquisition policies of the library concerned. To check for this, we present a second plot which shows the same information but derived from the British Library’s catalogue. Reassuringly, though there are differences, the basic patterns look remarkably similar.

CUL data 1600-1960

Number of items (books etc) Per Year in the Cambridge University Library Catalogue (1600-1960).

BL data 1600-1960

Number of items (books etc) Per Year in the British Library Catalogue (1600-1960).

What do we learn from these graphs?

  • In total there were over a million “Items” in this dataset (and parsing, cleaning, loading and analyzing this data took on the order of days — while the preparation work to develop and perfect these algorithms took weeks if not months)
  • The main trend is a fairly consistent, and approximately exponential, increase in the number of publications (items) per year. At the start of our time period in 1600 we have around 400 items a year in the catalogue while by 1960 the number is over 16000.
  • This is a forty-fold increase and corresponds to an annual growth rate of approx 0.8%. Assuming “growth” began only around the time of the industrial revolution (~ 1750) when output was around 1000 (10-year moving average) gives a fairly similar growth rate of around 0.89%.
  • There are some fairly noticeable fluctuations around this basic trend:
    1. There appears to be a burst in publications in the decade or decade and a half before 1800. One can conjecture several, more or less intriguing, reasons for this: the cultural impact of the French revolution (esp. on radicalism), the effect of loosening copyright laws after Donaldson v. Beckett, etc. However, without substantial additional work, for example to examine the content of the publications in that period these must remain little more than conjectures.
    2. The two world wars appear dramatically in our dataset as sharp dips: the pre-1914 level of around 7k+ falls by over a third during the war to around 4.5k and then rises rapidly again to reach, and pass, 7k per year in the early 20s. Similarly, the late 1930s level of around 9.5k per year drops sharply upon the outbreak of war reaching a low of 5350 in 1942 (a drop of 45%), and then rebounding rapidly at the war’s end: from 5.9k in 1945 to 8k in 1946, 9k in 1947 and 11k in 1948!

To do next (but in separate entries — this post is already rather long!):

  • Estimates for the the size of the public domain: how many of those catalogue items are in the public domain
  • Distinguishing Publications (“Items”) from “Works” — i.e. production of new material versus the reissuance of old (see previous post for more on this).

Colophon: Background to this Research

I’m working on a EU funded project on the Public Domain in Europe, with particular focus on the size and value of the public domain. This involves getting large datasets about cultural material and trying to answer questions like: How many of these items are in the public domain? What’s the difference in price and availability of public domain versus non public domain items?

I’ve also been involved for several years in Public Domain Works, a project to create a database of works which were in the public domain.

Colophon: Data and Code

All the code used in parsing, loading and analysis is open and available from the Public Domain Works mercurial repository. Unfortunately, the library catalogue data is not: library catalogue data, at least in the UK, appears to be largely proprietary and the raw data kindly made available to us for the purposes of this research by the British Library and Cambridge University Library was provided only on a strictly confidential basis.

Filesharing Costs: Dubious Figures Making the Rounds Again

The BBC ran a story yesterday headlined “Seven million ‘use illegal files'”. Its bolded first paragraph stated:

Around seven million people in the UK are involved in illegal downloads, costing the economy tens of billions of pounds, government advisers say. [emphasis added]

7 million people involved in unauthorised file-sharing is possible, but costs of tens of billions of pounds? It’s not unusual to see such figures bandied around by the rightsholders derived from wild guesstimates of download figures and ludicrously unsound assumptions such as equating every download with a lost sale.

Here, however, it is according to “government advisers” — surely a much more reliable source! A quick read and we discover this isn’t the case at all and these figures are directly recycled from rightsholder sources — with an additional uplift from the BBC: a possible £10 billion or more a year has becomes tens (notice that extra “s”) of billions a year.

First off, the story is based on a report entitled “Copycats? Digital Consumers in an Online Age” commissioned by the Strategic Advisory Board in Intellectual Property (SABIP) from UCL’s Centre for Information Behaviour and the Evaluation of Research. So this is CIBER’s report not SABIP’s — SABIP need not even have endorsed the report. That said, one can see how the BBC’s confusion came about, and this is a minor point (after all CIBER is part of a university).

More important is a check of the actual evidence underlying these very large claimed costs to the economy. Let’s take a look at the report. Page 6, at the start of the Exec Summary states (this is where I guess the BBC got its material from):

Industry reports [3] suggest that at least seven million British citizens have downloaded unauthorised content, many on a regular basis, and many also without ethical consideration. Estimates as to the overall lost revenues [4] if we include all creative industries whose products can be copied digitally, or counterfeited, reach £10 billion (IP Rights, 2004), conservatively, as our figure is from 2004, and a loss of 4,000 jobs. This is in the context of the “Creative Industries” providing around 8% of British GDP. And the situation is not solely a British problem, but a global one. …

But wait a moment: their only source here seems to be (IP Rights, 2004) and that turns out to be a single page press release from an IP (law) firm which simply states:

“Rights owners have estimated that last year alone counterfeiting and piracy cost the UK economy £10 billion and 4,000 jobs.”

So these are just the standard (and utterly unreliable) rightsholders-claimed figures (and not even first-hand!). To be fair in footnote 4 the authors acknowledge that the phrase “lost revenues” is complex and that not all downloaded content would have been purchased. However, they then seem to backtrack on this by saying (rightsholders provided figures again!):

Nevertheless, industries such as music and film do frequently publish estimated lost revenues, or “value gaps’. The BPI recently claimed that between 2008 and 2012 the music industry was looking at a ‘value gap’ of £1.2 billion. (Music Ally, 2008)

Furthermore, that claim that things are “complex” worries me, as things are, in fact, pretty simple: lost revenues mean lost revenues, i.e. the revenues the industry would have got if no unauthorised downloading had occurred. This will clearly be much, much lower than a figure based on assuming every unauthorised download is a lost sale.

Furthermore, looking at revenues in a single industry is dangerous here: we’ve got to look at the overall impact on the economy (and that’s still ignoring the welfare/income distinction). For example, if someone makes an unauthorised download rather than buying a CD they spend the money they would have spent on the CD on something else, be that a haircut, a meal, or going to a concert. If we want to count that as a loss to the music industry we need to count the gain it generates elsewhere.

Good evidence doesn’t get any thicker on the ground later on either as far as I can tell. For example, in the first key finding section (entitled “The scale of the ‘problem’ is huge and growing”):

  • The only empirical study they cite on the impact of filesharing is that Zentner with no mention of some other major studies such as that of Oberholzer and Strumpf.
  • The only figure on the film industry they quote is a claim of a $6 billion annual loss put forward by the UK film industry in interview and “some research (Henning-Thurau et al., 2007) [which] appears to demonstrate evidence that consumers’ intention to pirate movies “cause them to forego theatre visits and legal DVD rentals and/or purchases.”. Looking up that citation one finds (seems there was a typo in the date!): Henning-Thurau, T, Gwinner, K, Walsh, G, Gremler, D (2004) Electronic Word of Mouth via Consumer-Opinion Platforms: What Motivates Consumers to Articulate Themselves on the Internet? Journal of Interactive Marketing. 18 (1) pp.38-52. While I haven’t actually read this article, the title (and journal) don’t suggest this as the most reliable source as to the actual effect of unauthorised downloads on film industry income.

To sum up: it turns out the BBC’s line that illegal downloads are “costing the economy tens of billions of pounds” is based on nothing more than the usual, and completely unreliable, rightsholders claims, recycled via CIBER’s report. This is a worrying example of how industry PR, via repetition in other, more “respected” and supposedly independent sources, can gain legitimacy.