Warning: this is partly deprecated
git clone
cd patcit/
git submodule add https://github.com/kermitt2/grobid.git # If not done yet
pipenv install --dev # or `pip -r requirements.txt` if you don"t use pipenv
pipenv install -e . # or `pip install -e .` if you don"t use pipenv
PatStat data are provided in large .zip
chunks.
We want:
- small chunks (for easier parallel processing)
- in
.gz
format (for easy streaming throughsmart_open
)
cd PscUniverse/
sh prepare_tls214.sh data/path # Nb: no trailing "/"
Take a 💤 ... or a ☕ , that should take 15-20min.
- Start Grobid server
cd grobid/
./gradlew run
Nb: don't forget to fill the grobid-home/config/grobid.properties
to make sure that your requests to CrossRef API can
be properly identified. See here for more details.
- Start processing data
python3 process-citations.py data/path
Although we use multi threading, don't expect to process very large amounts of data with this method.
Bottlenecks:
- Grobid supports 10 concurrent engines
- CrossRef API supports 30 requests per second
All in all, you can reasonably expect to process ~4 citations per second, ie 100,000 in 7 hours.
- AWS Set up
- Start EC2
- Update ES policy strategy with IPv4 EC2
- Start biblio-glutton
cd biblio-glutton/lookup/
./gradlew clean build
java -jar build/libs/lookup-service-1.0-SNAPSHOT-onejar.jar server data/config/config.yml
curl localhost:8080/service/data # Check that the service is running properly
- Start Grobid
cd SciCit/grobid/
./gradlew run
curl -X POST -d "citations=Graff, Expert. Opin. Ther. Targets (2002) 6(1): 103-113" localhost:8070/api/processCitation
- Start Processing
cd SciCit/
# pipenv install --dev
pipenv shell
python bin/ProcessCitations.py ~/data/small_chunks/