Codebase Digest is a command-line tool written in Python that helps you analyze and understand your codebase. It provides a structured overview of your project's directory structure, file sizes, token counts, and even consolidates the content of all text-based files into a single output for easy analysis with Large Language Models (LLMs).
- Features
- Installation
- Usage
- Configuration
- Ignore Functionality
- LLM Prompts for Enhanced Analysis
- Contributing
- License
- π Directory Tree Visualization: Generate a hierarchical view of your project structure
- π Codebase Statistics: Calculate total files, directories, code size, and token counts
- π File Content Consolidation: Combine all text-based files into a single output
- π« Flexible Ignore System: Support for custom patterns, defaults, and
.gitignore
files - π¨ Multiple Output Formats: Choose between text, JSON, Markdown, XML, or HTML
- π Colored Console Output: Visually appealing and informative summaries
- π§ LLM Analysis Support: Comprehensive prompt library for in-depth codebase analysis
pip install codebase-digest
git clone https://github.com/kamilstanuch/codebase-digest.git
cd codebase-digest
pip install -r requirements.txt
Basic usage:
cdigest [path_to_directory] [options]
Examples:
-
Analyze a project with default settings:
cdigest /path/to/my_project
-
Analyze with custom depth and output format:
cdigest /path/to/my_project -d 3 -o markdown
-
Ignore specific files and folders:
cdigest /path/to/my_project --ignore "*.log" "temp_folder" "config.ini"
-
Show file sizes and include git directory:
cdigest /path/to/my_project --show-size --include-git
-
Analyze and copy output to clipboard:
cdigest /path/to/my_project --copy-to-clipboard
Option | Description |
---|---|
path_to_directory |
Path to the directory you want to analyze |
-d, --max-depth |
Maximum depth for directory traversal |
-o, --output-format |
Output format (text, json, markdown, xml, or html). Default: text |
-f, --file |
Output file name |
--show-size |
Show file sizes in directory tree |
--show-ignored |
Show ignored files and directories in tree |
--ignore |
Patterns to ignore (e.g., '*.pyc' '.venv' 'node_modules') |
--keep-defaults |
Keep default ignore patterns when using --ignore |
--no-content |
Exclude file contents from the output |
--include-git |
Include .git directory in the analysis |
--max-size |
Maximum allowed text content size in KB (default: 10240 KB) |
--copy-to-clipboard |
Copy the output to clipboard |
The following patterns are ignored by default:
DEFAULT_IGNORE_PATTERNS = [
'.pyc', '.pyo', '.pyd', 'pycache', # Python
'node_modules', 'bower_components', # JavaScript
'.git', '.svn', '.hg', '.gitignore', # Version control
'venv', '.venv', 'env', # Virtual environments
'.idea', '.vscode', # IDEs
'.log', '.bak', '.swp', '.tmp', # Temporary and log files
'.DS_Store', # macOS
'Thumbs.db', # Windows
'build', 'dist', # Build directories
'.egg-info', # Python egg info
'.so', '.dylib', '.dll' # Compiled libraries
]
You can specify additional patterns to ignore using the --ignore
option. These patterns will be added to the default ignore patterns unless --no-default-ignores
is used.
Patterns can use wildcards (* and ?) and can be:
- Filenames (e.g., 'file.txt')
- Directory names (e.g., 'node_modules')
- File extensions (e.g., '*.pyc')
- Paths (e.g., '/path/to/ignore')
Example:
cdigest /path/to/my_project --ignore ".txt" "temp" "/path/to/specific/file.py"
You can create a .cdigestignore
file in your project root to specify project-specific ignore patterns. Each line in this file will be treated as an ignore pattern.
To use only your custom ignore patterns without the default ones, use the --no-default-ignores
option:
cdigest /path/to/my_project --no-default-ignores --ignore "custom_pattern" "another_pattern"
Codebase Digest includes a comprehensive set of prompts in the prompt_library
directory to help you analyze your codebase using Large Language Models. These prompts cover various aspects of code analysis and business alignment:
- Codebase Mapping and Learning: Quickly understand the structure and functionality of a new or complex codebase.
- Improving User Stories: Analyze existing code to refine or generate user stories.
- Initial Security Analysis: Perform a preliminary security assessment.
- Code Quality Enhancement: Identify areas for improvement in code quality, readability, and maintainability.
- Documentation Generation: Automatically generate or improve codebase documentation.
- Learning Tool: Use as a teaching aid to explain complex coding concepts or architectures.
- Business Alignment: Analyze how the codebase supports business objectives.
- Stakeholder Communication: Generate insights to facilitate discussions with non-technical stakeholders.
- Analysis:
- Codebase Error and Inconsistency Analysis: Identify and analyze errors and inconsistencies in the codebase.
- Codebase Risk Assessment: Evaluate potential risks within the codebase (e.g., security vulnerabilities, maintainability issues).
- Code Complexity Analysis: Identify areas with high cyclomatic complexity, deep nesting, or excessive method lengths.
- Code Duplication Analysis: Identify duplicated code fragments and suggest refactoring opportunities.
- Code Style Consistency Analysis: Analyze the codebase for consistency in code style, naming conventions, and formatting.
- Code Documentation Coverage Analysis: Determine the coverage and quality of code documentation.
- Generation:
- Codebase Documentation Generation: Automatically generate or improve codebase documentation.
- Analysis:
- Frontend Code Analysis: Analyze the frontend codebase to identify best practices, potential improvements, and common pitfalls.
- Backend Code Analysis: Analyze the backend codebase to identify best practices, potential improvements, and common pitfalls.
- Code Style and Readability Analysis: Evaluate the codebase's overall style and readability, providing suggestions for improvement.
- Personal Development Recommendations: Analyze the codebase and provide personalized recommendations for areas where the engineer can improve their skills.
- Generation:
- User Story Reconstruction from Code: Reconstruct and structure user stories based on the codebase.
- Code-Based Mini-Lesson Generation: Create mini-lessons to explain complex coding concepts or architectures.
- Algorithmic Storytelling: Generate engaging narratives that explain the logic and flow of key algorithms in the codebase.
- Code Pattern Recognition and Explanation: Identify and explain design patterns, architectural patterns, and common coding idioms used in the codebase.
- Socratic Dialogue Generation for Code Review: Generate Socratic-style dialogues that explore the reasoning behind code design decisions and encourage critical thinking during code reviews.
- Code Evolution Visualization: Create visualizations that illustrate how the codebase has evolved over time, highlighting key milestones, refactorings, and architectural changes.
- Codebase Trivia Game Generation: Generate trivia questions and answers based on the codebase to gamify learning and encourage team engagement.
- Code-Inspired Analogies and Metaphors: Generate analogies and metaphors inspired by the codebase to help explain complex technical concepts to non-technical stakeholders.
- Frontend Component Documentation: Generate documentation for frontend components, including props, usage examples, and best practices.
- Backend API Documentation: Generate documentation for backend APIs, including endpoints, request/response formats, and authentication requirements.
- Code Refactoring Exercises: Generate code refactoring exercises based on the codebase to help engineers improve their refactoring skills.
- Code Review Checklist Generation: Generate a checklist of important points to consider during code reviews, based on the codebase's specific requirements and best practices.
- Analysis:
- Codebase Best Practice Analysis: Analyze the codebase for good and bad programming practices.
- Generation:
- Codebase Translation to Another Programming Language: Translate the codebase from one programming language to another.
- Codebase Refactoring for Improved Readability and Performance: Suggest refactoring improvements for better readability and performance.
- Generation:
- Unit Test Generation for Codebase: Generate unit tests for the provided codebase.
- Analysis:
- Security Vulnerability Analysis of Codebase: Identify potential security vulnerabilities in the codebase.
- Analysis:
- Business Impact Analysis: Identify key features and their potential business impact.
- SWOT Analysis: Evaluate the codebase's current state and future potential.
- Jobs to be Done (JTBD) Analysis: Understand core user needs and identify potential improvements.
- OKR (Objectives and Key Results) Analysis: Align codebase features with potential business objectives and key results.
- Value Chain Analysis: Understand how the codebase supports the larger value creation process.
- Porter's Five Forces Analysis: Analyze competitive forces shaping the product's market.
- Product/Market Fit Analysis: Evaluate how well the product meets market needs.
- PESTEL Analysis: Analyze macro-environmental factors affecting the product.
- Generation:
- Business Model Canvas Generation: Create a Business Model Canvas based on codebase analysis.
- Value Proposition Canvas Generation: Generate a Value Proposition Canvas aligning technical features with user needs and benefits.
- Lean Canvas Generation: Create a Lean Canvas to evaluate business potential and identify areas for improvement or pivot.
- Customer Journey Map Creation: Generate a map showing how different parts support various stages of the user's journey.
- Blue Ocean Strategy Canvas: Create a strategy canvas to identify untapped market space and new demand.
- Ansoff Matrix Generation: Produce an Ansoff Matrix to evaluate growth strategies for the product.
- BCG Growth-Share Matrix Creation: Generate a BCG Matrix to assess the product portfolio and resource allocation.
- Kano Model Diagram: Create a Kano Model diagram to prioritize product features based on customer satisfaction.
- Technology Adoption Lifecycle Curve: Generate a curve showing the product's position in the adoption lifecycle.
- Competitive Positioning Map: Create a visual map of the product's position relative to competitors.
- McKinsey 7S Framework Diagram: Generate a diagram evaluating internal elements for organizational effectiveness.
- Stakeholder Persona Generation: Infer and create potential stakeholder personas based on codebase functionalities.
- Analysis:
- Identify Architectural Layers: Analyze the codebase and identify different architectural layers (e.g., presentation, business logic, data access), highlighting inconsistencies or deviations from common architectural patterns.
- Analyze Coupling and Cohesion: Evaluate coupling and cohesion between modules or components, identifying areas with high coupling or low cohesion that might indicate design flaws.
- Identify Design Patterns: Analyze the codebase for instances of common design patterns (e.g., Singleton, Factory, Observer), explaining their implementation and purpose.
- Database Schema Review: Review the database schema for normalization, indexing, and potential performance bottlenecks, suggesting improvements based on best practices.
- API Conformance Check: Given an API specification (e.g., OpenAPI), analyze the codebase to identify any inconsistencies or deviations from the defined API contract.
- Generation:
- Generate Architectural Diagram: Based on codebase structure and dependencies, generate a visual representation of the system architecture, including components, layers, and interactions.
- Suggest Refactoring for Design Patterns: Analyze the codebase and suggest opportunities to implement design patterns for improved maintainability, extensibility, or reusability.
- Generate Database Schema Documentation: Create comprehensive documentation for the database schema, including table descriptions, relationships, indexes, and constraints.
- Generate API Client Code: Based on an existing API specification or codebase implementation, generate client code (e.g., in JavaScript, Python) to interact with the API.
- Analysis:
- Identify Performance Bottlenecks: Analyze the codebase for performance bottlenecks like inefficient algorithms, excessive database queries, or slow network requests, focusing specifically on performance-related issues.
- Resource Usage Profiling: Analyze the codebase to identify areas with high CPU utilization, memory consumption, or disk I/O, providing insights into potential optimizations for efficient resource usage.
- Scalability Analysis: Analyze the codebase and architectural choices to assess the system's scalability, identifying potential limitations and suggesting improvements for handling increased load.
- Concurrency and Synchronization Analysis: Analyze the codebase for potential concurrency issues like race conditions or deadlocks. Suggest solutions to improve thread safety and synchronization mechanisms.
- Generation:
- Suggest Code Optimization Techniques: Based on the analysis of potential bottlenecks, suggest specific code optimization techniques like caching, asynchronous operations, or algorithm improvements.
- Generate Performance Test Scenarios: Create realistic performance test scenarios (e.g., using tools like JMeter or Gatling) to simulate high load and identify performance bottlenecks.
- Suggest Configuration Tuning: Recommend optimal configuration settings for databases, application servers, or other infrastructure components to improve performance.
- Analysis:
- Code Churn Hotspot Analysis: Analyze code commit history to identify areas of the codebase with high churn rates, which can indicate areas requiring refactoring or potentially problematic code.
- Technical Debt Estimation: Based on code complexity, code smells, and other factors, estimate the amount of technical debt present in the codebase and prioritize areas for refactoring. Focus on estimations derived from historical code analysis rather than general code quality.
- Impact Analysis of Code Changes: Analyze the potential impact of specific code changes (e.g., bug fixes, new features) on other parts of the system to identify potential regressions or conflicts.
- Generation:
- Generate Code Evolution Report: Create a report summarizing the evolution of the codebase over time, including key metrics like code churn, code complexity, and contributor activity.
- Generate Refactoring Recommendations (History-Based): Based on code evolution analysis and technical debt estimation, generate specific refactoring recommendations to improve code quality and reduce maintenance costs, focusing on areas identified through historical analysis.
- Visualize Codebase Evolution: Generate visualizations (e.g., heatmaps, graphs) to represent the codebase's evolution, highlighting areas of frequent change, code complexity, and potential technical debt.
For detailed instructions on using these prompts, refer to the individual files in the prompt_library
directory.
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.