This project references to my master thesis work "Distributed computation of Approximate Nearest Neighbors" for my Master's Degree in Computer Science in external collaboration with Wikimedia Foundation. The work focuses on making the image upload service on Wikipedia more efficient, studying how to distribute the computation of Approximate Nearest Neighbors, developing ad hoc solutions, exploring existing possibilities and adapting them to perform distributed searches of similar images within the Wikimedia dataset.
This repository is intended to contain implementations of several libraries for performing similarity (cosine) searches in order to find the best one for my purpose. The various implementations are based on the previously exposed use case.
- KNN: this library is not approximated and is highly inefficient for my research use case, its implementation, with sklearn, has been provided only for comparison purposes with approximated libraries. Inside KNN you will also find a brute-force version that execute a linear comparison between a single vector and the entire dataset, without creating a data structure.
- Annoy: this library implements a version of ANN, written by Erik Bernhardsson on behalf of Spotify.
- Faiss: this library was created for efficient similarity searches and clustering of dense vectors by the facebook (now Meta) research team.
- SPTAG: this library is designed for large scale vector approximate nearest neighbor search, created by Microsoft as an alternative version to their previous library, called DiskANN. Unlike other technologies, this solution involves the use of RAM memory and the physical disk, efficiently managing which and how much information save on them.
- Distributed-Faiss: this library has already been covered, but not its distributed version. This algorithm, although very efficient, does not add much new to the proposals previously analyzed that runs on a single machine. What makes this library special, however, is that it can be distributed on multiple machines in a very simple way.
- Distributed-SPTAG: this library has already been covered, but not its distributed version. This algorithm, although very efficient, does not add much new to the proposals previously analyzed that runs on a single machine. What makes this library special, however, is that it can be distributed on multiple machines in a very simple way.
- Milvus: this library implements vector database that focuses on providing an efficient system and architecture for similarity searches and AI-powered applications. The need that Milvus tries to satisfy is to provide a complete system capable of operating on huge dynamic datasets, providing solutions that can also run on GPUs also in a distributed way. To do this Milvus was build on libraries that have already been explained as Faiss, making the necessary changes for its use case.
The project has been structured in a precise way to ensure easy organization of files and to make navigation within it simple.
- Algorithms: inside this folder you can find all the implementations
- for-each-library:
- config.py: this file contains the configurations necessary for the correct operation of the libraries
- init.py: this file contains the code for reading datasets, training data structures and creating the index
- search.py: this file contains the code for performing searches and returning the resulting items
- other files: chronometer.py, mocks.py
- for-each-library:
- Datasets: the datasets and data structures saved in memory will be inserted inside this folder
- Compressed: in this folder it is possible to insert the compressed datasets, as can be seen from the example inserted. you may also notice a file called links.txt containing all Wikimedia datasets
- Decompressed: in this folder it is possible to insert the decompressed datasets, from here they will be filtered and decomposed for a correct functioning of the libraries
- Images: inside this folder it is possible to find the images corresponding to the vectors extracted from the dataset
- Vectors: within this folder it is possible to find the vectors corresponding to the images extracted from the dataset
- formatDatasets.bash: this file, executable following this README.md, contains the code to decompose all the datasets from the Decompressed folder moving the vectors inside the Vectors/ folder and the corresponding images inside the Images/ folder. This file uses formatDatasets.py to convert .csv file into binary format to speed up all the project.
- Docs: documents obtained during the writing of the thesis
- Go to Compressed folder and execute
wget -i links.txt
to download the Wikimedia dataset (images, vectors) - Decompress archives into Decompressed, in Wikimedia dataset case
gunzip *.gz
- Go into Datasets folder and follow the README.md
First go into Algorithms folder and install general requirements with pip install -r requirements.txt
.
- KNN:
pip install scikit-learn
or don't install anything if you prefer to use the brute-force version - Annoy:
pip install annoy
- Faiss: follow the conda tutorial
- SPTAG: you can compile this library via source code, or use Docker. I personally recommend the latter solution, by running these commands:
set GIT_LFS_SKIP_SMUDGE=1 git clone --recurse-submodules https://github.com/microsoft/SPTAG cd SPTAG docker build -t sptag .
- Distributed-Faiss: clone the repository (if you want to use my PR for adding precision to search phase with
set_index_parameter
orset_index_parameters
methods clone this repository), enter the folder and then runpip install -e .
- Distributed-SPTAG: follow the instruction for SPTAG
- Milvus:
pip install pymilvus
and then milvus can be install with docker compose or Kubernetes (Milvus Standalone for testing) or with Helm + Kubernetes (Milvus Cluster for production)
Before running the libraries customize the various config.py files
python init.py
python search.py
Before running the libraries customize the various config.py files
- Distributed-Faiss:
- Customize the DISCOVERY_CONFIG.txt file.
- Execute on each server
python init_server.py
- Execute on the client
python init_client.py
- Execute on the client
python search.py
- Distributed-SPTAG:
- Go to each server, customize the config.py file (attention to START_METADATA_OFFSET) and run
python init.py
- Go to the servers hosting the previously created index. In the Release folder of the library, import and customize the service.ini file and run the command
./server -m socket -c service.ini
- Go to the machine you identify as the aggregator. In the Release folder of the library, import and customize the Aggregator.ini file and run the command
./aggregator
- Go to the client, customize the config.py file, and execute
python search.py
- Go to each server, customize the config.py file (attention to START_METADATA_OFFSET) and run
- Milvus:
python init.py
python search.py
Made by Christian Stingone