README

Warning: this is partly deprecated

Build Project

git clone
cd patcit/
git submodule add https://github.com/kermitt2/grobid.git  # If not done yet
pipenv install --dev  # or `pip -r requirements.txt` if you don"t use pipenv
pipenv install -e .  # or `pip install -e .` if you don"t use pipenv

Prepare data

PatStat data are provided in large .zip chunks. We want:

small chunks (for easier parallel processing)
in .gz format (for easy streaming through smart_open)

cd PscUniverse/
sh prepare_tls214.sh data/path  # Nb: no trailing "/"

Take a 💤 ... or a ☕ , that should take 15-20min.

Process data

Grobid + CrossRef API

Start Grobid server

cd grobid/
./gradlew run

Nb: don't forget to fill the grobid-home/config/grobid.properties to make sure that your requests to CrossRef API can be properly identified. See here for more details.

Start processing data

python3 process-citations.py data/path

Disclaimer

Although we use multi threading, don't expect to process very large amounts of data with this method.

Bottlenecks:

- Grobid supports 10 concurrent engines
- CrossRef API supports 30 requests per second

All in all, you can reasonably expect to process ~4 citations per second, ie 100,000 in 7 hours.

Grobid + Biblio-Glutton (on AWS)

AWS Set up

Start EC2
Update ES policy strategy with IPv4 EC2

Start biblio-glutton

cd biblio-glutton/lookup/
./gradlew clean build
java -jar build/libs/lookup-service-1.0-SNAPSHOT-onejar.jar server data/config/config.yml
curl localhost:8080/service/data  # Check that the service is running properly

Start Grobid

cd SciCit/grobid/
./gradlew run
curl -X POST -d "citations=Graff, Expert. Opin. Ther. Targets (2002) 6(1): 103-113" localhost:8070/api/processCitation

Start Processing

cd SciCit/
# pipenv install --dev
pipenv shell
python bin/ProcessCitations.py ~/data/small_chunks/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README-dev.md

README-dev.md

README

Build Project

Prepare data

Process data

Grobid + CrossRef API

Disclaimer

Grobid + Biblio-Glutton (on AWS)

Files

README-dev.md

Latest commit

History

README-dev.md

File metadata and controls

README

Build Project

Prepare data

Process data

Grobid + CrossRef API

Disclaimer

Grobid + Biblio-Glutton (on AWS)