Welcome to Cancer Sleuth, where cutting-edge Artificial Intelligence meets medical science to tackle one of the most challenging diseases of our time: colon cancer. This project is a deep dive into the application of machine learning models to predict colon cancer presence from genetic information. Using a variety of algorithms, from Decision Trees to K-Nearest Neighbors, we navigate through complex data, uncover patterns, and strive for accurate predictions.
Colon cancer, a leading cause of cancer-related deaths worldwide, is notoriously difficult to detect in its early stages. Cancer Sleuth aims to change the narrative by leveraging the power of machine learning. Our mission? To provide researchers and healthcare professionals with tools that can predict colon cancer more efficiently and accurately.
This repository showcases the journey of applying different machine learning techniques on colon cancer datasets, exploring how each model performs and interacts with the genetic data at our disposal. Everyone does this kind of analysis, but I wanted others to use this model so they can utilize the best one to test on other datasets.
The heart of our analysis lies in the datasets obtained from comprehensive genetic studies. These datasets contain genetic information from patients, annotated with whether colon cancer was detected.
To access the datasets used in this project, please visit the following links:
Please note: These links are provided for educational and research purposes. Ensure compliance with any usage restrictions.
In our quest, we delve into several models, each offering a unique perspective on the data:
- Decision Tree Classifier: A fundamental yet powerful model that makes decisions based on the data's attributes. The implementation includes parameter tuning using cross-validation to achieve the best possible results.
- K-Nearest Neighbors (KNN): This model predicts the classification of a sample based on the majority vote of its neighbors. Various distance metrics are evaluated to determine the most effective combination for the data.
- Support Vector Machines (SVM): A robust algorithm that finds the hyperplane which best separates the data into classes.
- Naive Bayes Classifier: A probabilistic model that applies Bayes' theorem with the assumption of independence between the features.
To embark on this exploration with us, you'll need Python and several libraries, including SciKit Learn, PySpark, and others relevant to data processing and machine learning.
For detailed instructions on setting up your environment, running the scripts, and exploring the models, refer to the docs/
directory.
We're on a mission to make colon cancer detection more accessible and accurate. Your contributions and feedback are invaluable to this journey. Whether it's improving the models, refining the data processing, or providing insights, we welcome your input.
This project is open-source and available under the MIT License. See the LICENSE file for more details.
A heartfelt thank you to the researchers, patients, and communities who have made these datasets available for study. Together, we're making strides towards a future where colon cancer can be detected early and treated more effectively.