Content

In this repository an entity linker is (partly) implemented. In short, there are a few key components for the entity linking, namely:

The linking/pipeline.py for the Pipeline class, which uses classes from other scripts and parses the input (WARC files) through it.
The linking/main.py for running the entity linker from command line (basically what run.sh runs).
The linking/part_of_pipeline/clean.py for the Clean class, this cleans the html-text.
The linking/part_of_pipeline/extract.py for the Extract class, this extract the entities from the "cleaned" text.
The linking/part_of_pipeline/search.py for the Search class, this searches for the wikilinks of the entities using ElasticSearch.
The linking/part_of_pipeline/decision.py for the Decision class, this class is used for disambiguation (dummy class since there is no actual implementation). Each script can be adjusted as long the input is the same (i.e. text, list, dict, etc).

The entire pipeline is called from main.py. the file can be run with several different arguments (all of which have a default value, so the file can also be run without setting any arguments), namely:

Argument         Default Value:Description
data_dir              /app/assignment/data/sample.warc.gz : Directory of the xxxx.warc.gz file which will be used to for the entity linking.
clean_text           1 : The cleaning procedure which is used: 1: html2text (default)2: BeautifulSoup:html.parser
extract_model    en_core_web_sm : The model which is used to extract:en_core_web_sm (default) en_core_web_lg (not included but can be downloaded after the build)
query_size_ES     20 : The max number of hits a query can return
search_ES            normal : Please do not alter this - the 'fast' implementation is very buggy
batch_size_NER   8 : The NER model parses n samples in parallel. We found 8 to be the best value (on 8 threads, intel i7, 16gb RAM)
n_threads            8 : The number of threads to use.Please be carefull - do not set to -1, this will go wrong(!)
sim_cutoff_NER   0.35 : To reduce the number of queries = entities, we compute similarity.cross-referencing scores (no time to implement that in parallel) and use a threshold (last one is kept)

An extra documentation/rationalisation file is added for clarification (WDPS - Assignment 1 - Entity Linking - Group 50).

Installation

We assume that Elasticsearch and Trident are installed locally (we build from the given Docker and installed Elasticsearch and Trident locally).
The next installation steps are also stated in the Dockerfile.dockerfile:

Build Image the Dockerfile (in VSCode with Docker extension; rightclick on file and lowest option "Build Image").
Make sure that the parent folder is called "webproc" (that is what we used during development). This will install the needed packages.
Next run the previous build and add volume to it, for example,
docker run -ti -v C:/Users/Jensv/git/webproc:/app/assignment -p 9200:9200 webproc
Next, run the following command in the Linux shell (in the container) to start the elasticsearch server:
./assets/elasticsearch-7.9.2/bin/elasticsearch -d
Before going on make sure the server is actually available by checking: https://localhost:9200/ .
At last, to run the main script with default settings (as described above) run the following command:
python3 linking/main.py
Use the arguments to change input or output file(names). See main.py or above for the arguments and their values.

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
examples		examples
linking		linking
results		results
scripts_ignore		scripts_ignore
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
Assignment 2 – Entity Linking.pptx		Assignment 2 – Entity Linking.pptx
Dockerfile.dockerfile		Dockerfile.dockerfile
README.md		README.md
WDPS - Assignment 1 - Entity Linking - Group 50.docx		WDPS - Assignment 1 - Entity Linking - Group 50.docx
WDPS - Assignment 2 - Entity Linking - Group 50.docx		WDPS - Assignment 2 - Entity Linking - Group 50.docx
WDPS - Assignment 2 Proposal - Group 50.docx		WDPS - Assignment 2 Proposal - Group 50.docx
docker vm max map count.txt		docker vm max map count.txt
results.txt		results.txt
run.sh		run.sh
run_dependencies.sh		run_dependencies.sh
run_model_download.sh		run_model_download.sh
run_server.sh		run_server.sh
settings.json		settings.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Content

Installation

About

Releases

Packages

Contributors 3

Languages

jvhgit/webproc

Folders and files

Latest commit

History

Repository files navigation

Content

Installation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages