Skip to content

jvhgit/webproc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Content

In this repository an entity linker is (partly) implemented. In short, there are a few key components for the entity linking, namely:

  • The linking/pipeline.py for the Pipeline class, which uses classes from other scripts and parses the input (WARC files) through it.
  • The linking/main.py for running the entity linker from command line (basically what run.sh runs).
  • The linking/part_of_pipeline/clean.py for the Clean class, this cleans the html-text.
  • The linking/part_of_pipeline/extract.py for the Extract class, this extract the entities from the "cleaned" text.
  • The linking/part_of_pipeline/search.py for the Search class, this searches for the wikilinks of the entities using ElasticSearch.
  • The linking/part_of_pipeline/decision.py for the Decision class, this class is used for disambiguation (dummy class since there is no actual implementation). Each script can be adjusted as long the input is the same (i.e. text, list, dict, etc).

The entire pipeline is called from main.py. the file can be run with several different arguments (all of which have a default value, so the file can also be run without setting any arguments), namely:

Argument         Default Value:Description
data_dir              /app/assignment/data/sample.warc.gz : Directory of the xxxx.warc.gz file which will be used to for the entity linking.
clean_text           1 : The cleaning procedure which is used: 1: html2text (default)2: BeautifulSoup:html.parser
extract_model    en_core_web_sm : The model which is used to extract:en_core_web_sm (default) en_core_web_lg (not included but can be downloaded after the build)
query_size_ES     20 : The max number of hits a query can return
search_ES            normal : Please do not alter this - the 'fast' implementation is very buggy
batch_size_NER   8 : The NER model parses n samples in parallel. We found 8 to be the best value (on 8 threads, intel i7, 16gb RAM)
n_threads            8 : The number of threads to use.Please be carefull - do not set to -1, this will go wrong(!)
sim_cutoff_NER   0.35 : To reduce the number of queries = entities, we compute similarity.cross-referencing scores (no time to implement that in parallel) and use a threshold (last one is kept)

An extra documentation/rationalisation file is added for clarification (WDPS - Assignment 1 - Entity Linking - Group 50).

Installation

We assume that Elasticsearch and Trident are installed locally (we build from the given Docker and installed Elasticsearch and Trident locally).
The next installation steps are also stated in the Dockerfile.dockerfile:

  1. Build Image the Dockerfile (in VSCode with Docker extension; rightclick on file and lowest option "Build Image").
    Make sure that the parent folder is called "webproc" (that is what we used during development). This will install the needed packages.

  2. Next run the previous build and add volume to it, for example,
    docker run -ti -v C:/Users/Jensv/git/webproc:/app/assignment -p 9200:9200 webproc

  3. Next, run the following command in the Linux shell (in the container) to start the elasticsearch server:
    ./assets/elasticsearch-7.9.2/bin/elasticsearch -d
    Before going on make sure the server is actually available by checking: https://localhost:9200/ .

  4. At last, to run the main script with default settings (as described above) run the following command:
    python3 linking/main.py
    Use the arguments to change input or output file(names). See main.py or above for the arguments and their values.

About

For the course Web Data Processing Systems

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •