Skip to content

Latest commit

 

History

History
96 lines (64 loc) · 2.12 KB

README-dev.md

File metadata and controls

96 lines (64 loc) · 2.12 KB

README

Warning: this is partly deprecated

Build Project

git clone
cd patcit/
git submodule add https://github.com/kermitt2/grobid.git  # If not done yet
pipenv install --dev  # or `pip -r requirements.txt` if you don"t use pipenv
pipenv install -e .  # or `pip install -e .` if you don"t use pipenv

Prepare data

PatStat data are provided in large .zip chunks. We want:

  • small chunks (for easier parallel processing)
  • in .gz format (for easy streaming through smart_open)
cd PscUniverse/
sh prepare_tls214.sh data/path  # Nb: no trailing "/"

Take a 💤 ... or a ☕ , that should take 15-20min.

Process data

Grobid + CrossRef API

  1. Start Grobid server
cd grobid/
./gradlew run

Nb: don't forget to fill the grobid-home/config/grobid.properties to make sure that your requests to CrossRef API can be properly identified. See here for more details.

  1. Start processing data
python3 process-citations.py data/path

Disclaimer

Although we use multi threading, don't expect to process very large amounts of data with this method.

Bottlenecks:

- Grobid supports 10 concurrent engines
- CrossRef API supports 30 requests per second

All in all, you can reasonably expect to process ~4 citations per second, ie 100,000 in 7 hours.

Grobid + Biblio-Glutton (on AWS)

  1. AWS Set up
  • Start EC2
  • Update ES policy strategy with IPv4 EC2
  1. Start biblio-glutton
cd biblio-glutton/lookup/
./gradlew clean build
java -jar build/libs/lookup-service-1.0-SNAPSHOT-onejar.jar server data/config/config.yml
curl localhost:8080/service/data  # Check that the service is running properly
  1. Start Grobid
cd SciCit/grobid/
./gradlew run
curl -X POST -d "citations=Graff, Expert. Opin. Ther. Targets (2002) 6(1): 103-113" localhost:8070/api/processCitation
  1. Start Processing
cd SciCit/
# pipenv install --dev
pipenv shell
python bin/ProcessCitations.py ~/data/small_chunks/