Visualizing Technology Flows from Patent Data
One of the things I’ve been playing around with over the last few months is the NBER Patent dataset. This provides a listing of all US patents from 1963-1999 together with a full set of citations for patents in the period 1975-1999.
As it is an open dataset you’re able to get it and use it without seeking special permission, filling in forms or paying any money (so an especially big thank-you though to Hall, Jaffe and Trajtenberg who created it).
First step was loading the data into a postgres database using some python and SQLAlchemy. This was not as trivial as it should have been as some data cleaning was needed (a few duplicate patents and citations, citations with missing patents etc) but once it was done I had a nice multi-GB db with approx 2 million patents and 16 million citations in it (on my machine, the full load, once properly coded, took around an hour).
Next step was the analysis. I’ve done a variety of things, but what I describe here are the efforts to visualize ‘knowledge flows’ between different technology areas. In essence each patent is given a classificatory technological ‘class’ (there is some ambiguity about exactly how a class relates to the actual technology in the patent but we will ignore that here). In the NBER data there are over 400 of these.
This turned out to be rather too many to conveniently visualize so instead we use groupings of these classes termed ‘subcategories’ (or just subcats) in the NBER dataset. There are 36 of these subcats. We’re interested in calculating and visualizing ‘technology (or knowledge) flows’ between different technological subcategories.
Of course we don’t directly observe knowledge flows: all we observe are citations. Thus, here what we are calculating and showing are citation flows — which one can imagine provide some approximation of the underlying technology/knowledge flows. Specifically, for a given year we go through all the patents in a given year and look at its citations. For each such citation we add 1/N to the flow from category i to category j where the patent is in category j, the cited patent is in category i and N is the total number of citations that patent has (the reversal of i and j is because a cite i -> j corresponds to a flow j -> i)
The result of this is a flow matrix which we can then plot using standard tools (networkx and graphviz to be precise). Here are some of the results:
The Diagrams
- Size of nodes indicates total citation flows from that area in that year
- Yellow portion is citations back into that subcategory while black represents portion that is into other subcategories (comparison by area).
- Direction of flow is indicated by an arrow head (a rectangular block) with size of flow measured by width of edge and size of head.
All Citations Flows (1994 Patents) (click through for full-size ~ 1.5MB)
This figure is only shown at a very low resolution in order to keep the image size down. As is to be expected, most categories have some level flow to most other categories, and so the image is very busy. In addition the resulting (automated) node layout does not do a great job of clustering items (it is based on the simple adjacency matrix which ignore flow sizes and what we have here, approxiimately, is the complete graph).
To address this a threshold approach was used whereby all flows below a threshold were discarded. This threshold was set equal to a percentage of the total flows of that category in that period — so it varied across categories (experimentation indicated that 5% was a good cut-off to use). Here’s the results:
Citation Flows (1994 Patents) Above Threshold (click through for full size ~ 0.6 MB)
As you can see we now have a lot more structure coming through in the layout as well as a much cleaner visualization. In particular, two distinct groupings emerge: a ‘chemical’ one in the centre of the picture focused on the ‘Miscellaneous Chemical’ category and a second ‘Computing/Electronics’ in the top right focused on the ‘Computer Hardware and Software’ category.
One can also see natural bridging groups, for example various (high-tech) mechanical and measuring categories in the middle-to-top left which connect to both the computer group and the chemical group.
The next step is to watch how these flows, and the relationships implied by them, have evolved over time. We can do this by plotting the same graph say, every 3 years, from 1975 up until the present. However, as this is already a fairly long post and the images are fairly large this will left for a follow-up article.
Sources: NBER data is available from the link given above. The code used to do the analysis and produce the images is not yet available online somewhere as it isn’t yet packaged up into a ‘nice’ form. However, if anyone wanted it I’d be more than happy to share it under an open licence so just let me know.
-
Categories
- *nix
- Academic
- Activity Updates
- Books
- Cinema
- Code
- Command Line
- Copyright
- Culture and Society
- Data Digging
- Economics
- EUPD
- External
- Filesharing
- Governance
- Hacks
- Happiness
- Hardware
- History
- Innovation and Intellectual Property
- Intellectual Myths
- Javascript
- Knowledge Systems
- Miscellaneous
- Musings
- Notes
- Open Bibliographic Data
- Open Data
- Open Knowledge Foundation
- Openness
- Own Work
- Papers
- People
- Photos
- Platforms
- Poetry
- Policy
- PSI
- Python
- Quote
- RDF
- Shuttleworth Fellow
- Software
- Sysadmin
- Talks
- Transaction Costs
- Work In Progress
-
Articles
- February 2012
- January 2012
- December 2011
- November 2011
- October 2011
- September 2011
- August 2011
- July 2011
- May 2011
- April 2011
- March 2011
- February 2011
- January 2011
- December 2010
- November 2010
- October 2010
- September 2010
- July 2010
- June 2010
- May 2010
- April 2010
- March 2010
- February 2010
- January 2010
- December 2009
- November 2009
- October 2009
- September 2009
- August 2009
- July 2009
- June 2009
- May 2009
- April 2009
- March 2009
- February 2009
- January 2009
- December 2008
- November 2008
- October 2008
- September 2008
- August 2008
- July 2008
- June 2008
- May 2008
- April 2008
- March 2008
- February 2008
- January 2008
- December 2007
- November 2007
- October 2007
- September 2007
- August 2007
- July 2007
- June 2007
- May 2007
- April 2007
- March 2007
- February 2007
- January 2007
- December 2006
- November 2006
- October 2006
- September 2006
- August 2006
- July 2006
- June 2006
- May 2006
- April 2006
- March 2006
- February 2006
- January 2006
- December 2005
- November 2005
- October 2005
- September 2005
- August 2005
- July 2005
- June 2005
- April 2005
- March 2005
- February 2005
- January 2005
- December 2004
- November 2004
- October 2004
- June 2004
- May 2004
- March 2004
- October 2003
-
Meta




