Principal Investigator: Matt Marx, Questrom Strategy & Innovation
The goal of this work is to build a a publicly-available set of citations from of U.S. patents (1947-2018) to scientific articles (1800-2018). As a result of our work so far on the front-page material of patents, we have established approximately 15MM very high confidence patent citations to science (PCS). PCS are extremely difficult to work with as the US patent citations to scientific articles are given as unstructured data and there are 36 million of them. On the other side of the comparison we are using the 166 million articles in the publicly available Microsoft Academic Graph (MAG). The MAG data is fully structured, including information such as title, author(s), year of publication, journal of publication, volume number, issue number, and page range.
A full comparison of the US Patent Office data to the MAG data would take 5.9 x 1015 comparisons and as such is completely impractical computationally. Dr. Marx came to RCS with the hope that our staff could help to solve this issue. Aaron Fuegi suggested several ways to deal with this issue and began a close collaboration with Dr. Marx. The original plan was to match on year, volume number, issue number, and first page. The first thing we did was to look for 4 digit numbers in the range of 1800-2019 and treat these as years so for a given comparison we could only match against those patent citations that included the appropriate year. If a given citation mentioned two or more such numbers, it would be considered for both/all of these years, but that is fine. We then, using each year’s data, basically created a hash in the filesystem based on all of the numbers that appear in a given line. This creates a large amount of redundancy but gives us a huge benefit. After having done this, if a paper starts with page 77, we can go directly to the file associate with this number and find all patent citations that mention the number 77 and only those citations. We later extended this approach to also break apart each patent line into words, so that we could search for significant keywords, generally either of the two longest words in the paper title, in a computationally feasible way.
We are currently matching citations that have the first author surname and also match on one of the first page of the article (or, if missing, the volume) or on one of the two longest words in the article name. We also require the year to match generally; currently for the page based approach it must match exactly and for the words based approach it must be within one of the year of publication in the structured data. Between the two hashing steps based on year of publication and words/numbers in the reference there was a speedup by a factor of at least 4000 (see figure above) and the problem became approachable.
We are now also allowing for patent citations with no year given at all and matching them against the full set of articles in MAG, just using the words/numbers hashing. This set of comparisons is not included in the reduced number of comparisons in the figure above since it does not utilize the full approach.
In combination, these approaches create a loose set of 1.5 billion potential PCS matches. Having so massively narrowed down our search, we then consider each potential match in a more thorough and computationally expensive way using a number of heuristics based on title, author, volume, issue, first page, last page, and journal to judge if we think it is really a match and apply a confidence score to that judgment. We end up with over 15 million matches with an estimated accuracy of 99.5%.
This work also made extensive use of the BU Shared Computing Cluster (SCC), as the work was highly parallelizable and codes that would have taken many days to run serially could be run in less than an hour on the SCC. The entire process takes around a day to run on the SCC.
The current results of this work are available, including our paper and citations from U.S. patents 1926-2018 to articles in the Microsoft Academic Graph from 1800-2018.