Context
Our goal was to identify key opinion leaders in a research area in life sciences through the analysis of the academic literature and to do it at scale.
The planned source was a copy of the bibliographic database MEDLINE. There would have been more than 22 million references to journal articles to process. The current source is search results from Web of Science.
To process the data, we chose to experiment around three open source technologies:
- Apache Spark, a general engine for large-scale data processing,
- Apache Zeppelin, a web-based notebook with built-in Spark integration,
- Scala, a functional and object-oriented programming language.
So, this is experimental work. This isn't finished nor production ready.
Method
Data collection
The data are collected by crawling the search results of a query performed against Web of Science. They are persisted as JSON files. We would come back to this project later.
Data processing
Once we have collected the data, HTML web pages, we process them in 8 steps:
- import web pages,
- parse web pages,
- build entities (publication, author, institution, journal),
- disambiguate entities,
- identify the city mentioned in an affiliation,
- identify the institution mentioned in an affiliation,
- compute indicators for the defined entities,
- export indicators and entities for further analysis.
This is a data pipeline approach. It can be replicated for other sources, like press releases.
Challenges
During this project, we encounter several challenges.
Debugging
One of the main painful challenge was to be able to debug the distributed execution of the processing code. To tackle this challenge, we implement three solutions:
- use a notebook approach to separate the execution of blocks of code,
- for each step, do sanity checks to detect unexpected cases,
- at the end of each step, persist the data to prevent processed data loss.
This tends to ensure that an unexpected case won't break the following steps. It also makes the debugging easier as errors are more likely to be detected near the code to investigate.
Sanity checks and data persistence are costly but, considering the significant investigating and processing time wasted to find where the error occurred, it's worth it.
Parsing
The HTML of the web pages is parsed with the Java library jsoup.
Web of Science web pages aren’t well structured HTML. We don't know in advance which fields could be or not in a web page. So, it's difficult to create a simple and generic parsing method.
The session provided by Web of Science becomes invalid several times during the crawling. Restarting the crawling can only be done at the results page level. So, all the already crawled web pages will be crawled one more time and need to be deduplicated.
Entity building
The scraped fields are grouped together to build entities (publications, authors, institutions, journals) and some are parsed to extract structured data, like the publication year. Each entity has also identifiers to keep track of the relations between the entities.
The challenge here is about parsing the author full names. There are many forms. They can depend on the way they have been entered or on cultural factors. It is especially true for distinguishing middle names from compound first names.
Entity disambiguation
There are two notebooks for the entity disambiguation step.
Frist version
The aim of this proof of concept is to disambiguate entities according to their properties and their relationships with other entities and to do it with GraphX, the Spark's API for graph-parallel computation.
The main idea of the algorithm is to build a graph of the entity mentions to disambiguate - here, authors - linked to each other through properties - like the last name and the first name initial - and related other entities - like publications - and to compute a score of similarity based of these relationships. Basically, two author mentions with the same last name and the same first name initial are more likely to be the same author. Likewise, two author mentions with at least a common publication can't be the same author. The algorithm is extensible and other properties or other related entities like the affiliations can be used.
Second version
The aim of this other version is to handle the case of authors with the same first name initial but a different first name or different middle names.
Entity extraction and Entity linking
City
The knowledge base about the geographical names (cities, states, countries, continents) is GeoNames.
We create a rule-based algorithm in Scala which performs entity extraction and entity linking, at the same time, on affiliation strings. The algorithm uses Scala Option chaining with orElse. The limits of the algorithm are related to the ambiguous and incomplete cases (the city or the country is missing).
Institution
As institution names are almost always at the beginning of the affiliation string, we extract them in this way.
Due to a lack of time there is no entity linking for the institutions. However, it was planned to link institution to a knowledge base like DBpedia.
Web of Science do some entity linking. The result can be found in an "enhanced names" field for each affiliation. There can be multiple enhanced names for a unique affiliation string, for example: multiple institutions mentioned in the affiliation string or an university system added by Web of Science.
We can't use the associations between affiliation strings and enhanced names because some identical affiliation strings have different enhanced names, the rules behind the entity linking done by Web of Science seem not to be trivial, and doing so requires to verify that we take into account other parameters like the acquisition of a company by an other.
It should be noted that the affiliation strings contain a lot of context-dependent abbreviations (contexts: research area, language, country, ...) which makes the institution identification harder.
Indicator computation
Basic indicators are computed:
- several counts,
- an impact score,
- the number of co-publications between two authors or two institutions.
We defined the impact as the number of times a publication has been cited divided by the number of years since the publication. If the publication year is the current year, we set the impact to zero. Using this notion of impact, we define the impact of:
- an author as the sum of the impact of the publications authored by the author,
- an institution as the sum of the impact of the publications authored by the authors affiliated to the institution,
- a city, a country and a continent as the sum of the impact of the institutions which have been located in the location.
The institution and location indicators consider only publications which publication year is the latest for the author.
Results
The data produced after the execution of all the steps are finally exported to CSV files. The data can be later used for further analysis in other tools, like Tableau or R, or to create visualizations in collaboration with a designer.
Next steps
As it's necessary to extract information, on one hand from not well structured HTML pages, and on the other hand from manually entered and Web of Science further processed fields, there are many 'not seen yet' cases to handle and a critical need for efficient and accurate disambiguation algorithms. So, as well as some inline comments improvement and coding style harmonization, the main next steps would be to detect and handle properly more 'unexpected' cases and to improve disambiguation, entity extraction and entity linking.
Conclusion
The methodological and technological foundations are here. One can reproduce or extend the analysis on a Web of Science corpus or on an other bibliographic database.
The association of Apache Spark, a distributed data processing framework with high-level interfaces (DataFrame, GraphX, ...), and of Apache Zeppelin, a web-based notebook which lets execute Spark code in an interactive way and with a direct access and basic visualization of the data, demonstrates the usefulness, the flexibility and the scalability of a data pipeline built with these technologies to deliver a playground for applying bibliometric methods to academic literature at scale.