Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add secrets #3

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,12 @@ A python script to detect duplicate documents in Elasticsearch. Once duplicates

For a full description on how this script works including an analysis of the memory requirements, see: https://alexmarquardt.com/2018/07/23/deduplicating-documents-in-elasticsearch/

The following files are expected to exist in the directory from which this module is executed:

* secrets.py
- A collection of Elasticsearch vars and Authentication credentials with expected defintions:
- ES_HOST = "URL_WITHOUT_SCHEMA"
- ES_USER = "elastic"
- ES_PASSWORD = "elastic"
- ES_PORT = "9200"
- ES_INDEX = "source-YYYY."
15 changes: 9 additions & 6 deletions deduplicate-elaticsearch.py → deduplicate-elasticsearch.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,15 @@

import hashlib
from elasticsearch import Elasticsearch, helpers
import secrets

ES_HOST = 'localhost:9200'
ES_USER = 'elastic'
ES_PASSWORD = 'elastic'
ES_HOST = secrets.ES_HOST
ES_USER = secrets.ES_USER
ES_PASSWORD = secrets.ES_PASSWORD
ES_PORT = secrets.ES_PORT
ES_INDEX = secrets.ES_INDEX

es = Elasticsearch([ES_HOST], http_auth=(ES_USER, ES_PASSWORD))
es = Elasticsearch([{'host': ES_HOST, 'port': ES_PORT, 'use_ssl': True}], http_auth=(ES_USER, ES_PASSWORD))
dict_of_duplicate_docs = {}

# The following line defines the fields that will be
Expand Down Expand Up @@ -41,7 +44,7 @@ def populate_dict_of_duplicate_docs(hit):
# Loop over all documents in the index, and populate the
# dict_of_duplicate_docs data structure.
def scroll_over_all_docs():
for hit in helpers.scan(es, index='stocks'):
for hit in helpers.scan(es, index=ES_INDEX):
populate_dict_of_duplicate_docs(hit)


Expand All @@ -52,7 +55,7 @@ def loop_over_hashes_and_remove_duplicates():
if len(array_of_ids) > 1:
print("********** Duplicate docs hash=%s **********" % hashval)
# Get the documents that have mapped to the current hasval
matching_docs = es.mget(index="stocks", doc_type="doc", body={"ids": array_of_ids})
matching_docs = es.mget(index=ES_INDEX, doc_type="doc", body={"ids": array_of_ids})
for doc in matching_docs['docs']:
# In order to remove the possibility of hash collisions,
# write code here to check all fields in the docs to
Expand Down
5 changes: 5 additions & 0 deletions secrets.template
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
ES_HOST = "URL_WITHOUT_SCHEMA"
ES_USER = "elastic"
ES_PASSWORD = "elastic"
ES_PORT = "9200"
ES_INDEX = "source-YYYY."