Thank you for your interest in contributing to Shekar! We welcome contributions from the community and are grateful for your support in making Persian NLP more accessible.
- Code of Conduct
- How Can I Contribute?
- Getting Started
- Development Setup
- Making Changes
- Testing
- Submitting Changes
- Style Guidelines
- Documentation
- Community
By participating in this project, you agree to maintain a respectful and inclusive environment. We expect all contributors to:
- Be respectful and considerate in communication
- Welcome newcomers and help them get started
- Focus on constructive feedback
- Respect differing viewpoints and experiences
- Accept responsibility and apologize for mistakes
Before creating bug reports, please check existing issues to avoid duplicates. When creating a bug report, include:
- A clear and descriptive title
- Steps to reproduce the issue
- Expected behavior vs actual behavior
- Python version and operating system
- Shekar version (
pip show shekar) - Sample code or text that demonstrates the problem
- Error messages or stack traces
Enhancement suggestions are welcome! Please provide:
- A clear description of the proposed feature
- Use cases and benefits
- Examples of how it would work
- Any relevant references or implementations in other libraries
We appreciate code contributions! Areas where you can help:
- Bug fixes
- New features or improvements
- Performance optimizations
- Better Persian language support
- Documentation improvements
- Test coverage expansion
- Model improvements
- Fork the repository on GitHub
- Clone your fork locally:
git clone https://github.com/YOUR_USERNAME/shekar.git cd shekar - Add upstream remote:
git remote add upstream https://github.com/amirivojdan/shekar.git
- Python 3.10 or higher
- uv - Fast Python package installer and resolver
- git
- (Optional) CUDA Toolkit for GPU support
If you don't have uv installed:
# On macOS and Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
# On Windows
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
# Or with pip
pip install uv-
Clone the repository (if you haven't already)
-
Install development dependencies with uv:
uv sync --all-extras
shekar/
├── shekar/ # Main package
│ ├── preprocessing/ # Text preprocessing components
│ ├── tokenizers/ # Tokenization modules
│ ├── embeddings/ # Embedding models
│ ├── stemmer/ # Stemming functionality
│ ├── lemmatizer/ # Lemmatization functionality
│ ├── pos_tagger/ # POS tagging
│ ├── ner/ # Named entity recognition
│ └── ...
├── tests/ # Test suite
├── docs/ # Documentation
└── examples/ # Example scripts
-
Create a new branch from
main:git checkout -b feature/your-feature-name # or git checkout -b fix/bug-description -
Use descriptive branch names:
feature/add-lemmatizationfix/tokenizer-unicode-issuedocs/update-embedding-examplestest/add-ner-tests
Write clear, concise commit messages:
- Use present tense ("Add feature" not "Added feature")
- Use imperative mood ("Move cursor to..." not "Moves cursor to...")
- First line: brief summary (50 chars or less)
- Follow with detailed explanation if needed
Examples:
Add support for custom stopword lists
- Allow users to provide custom stopword files
- Add validation for stopword format
- Update documentation with examples
- Write clean, readable code
- Follow PEP 8 style guidelines
- Add docstrings to functions and classes
- Keep functions focused and modular
- Avoid breaking existing APIs without discussion
When working on Persian NLP features:
- Test with various Persian texts including informal writing
- Consider Zero Width Non-Joiner (ZWNJ) usage
- Handle both Persian and Arabic character variants
- Test with different diacritic combinations
- Consider right-to-left text rendering issues
# Run all tests
uv run pytest
# Run specific test file
uv run pytest tests/test_tokenizer.py
# Run with coverage
uv run pytest --cov=shekar tests/- Write tests for new features and bug fixes
- Place tests in the
tests/directory - Name test files as
test_*.py - Use descriptive test function names
- Include both positive and negative test cases
- Test edge cases and Persian-specific scenarios
Example test structure:
def test_normalizer_removes_diacritics():
normalizer = Normalizer()
text = "سَلام"
expected = "سلام"
assert normalizer(text) == expected-
Update your branch with latest upstream changes:
git fetch upstream git rebase upstream/main
-
Push to your fork:
git push origin feature/your-feature-name
-
Create a Pull Request on GitHub with:
- Clear title describing the change
- Detailed description of what and why
- Reference any related issues (e.g., "Fixes #123")
- Screenshots or examples if applicable
- Notes on testing performed
-
Review process:
- Maintainers will review your PR
- Address any feedback or requested changes
- Once approved, your PR will be merged
- Code follows the project's style guidelines
- Tests added/updated and passing
- Documentation updated if needed
- No breaking changes (or clearly documented)
- Commit messages are clear and descriptive
- Branch is up to date with main
- Follow PEP 8
- Maximum line length: 100 characters
- Use type hints where appropriate
- Use meaningful variable names
- Use Google-style docstrings
- Include parameter types and return values
- Provide usage examples
- Write in clear, simple English
Example:
def normalize_text(text: str, remove_diacritics: bool = True) -> str:
"""Normalize Persian text.
Args:
text: Input Persian text to normalize
remove_diacritics: Whether to remove diacritical marks
Returns:
Normalized Persian text
Example:
>>> normalize_text("سَلام")
'سلام'
"""
pass- Update README.md for user-facing changes
- Add docstrings to new code
- Update examples if APIs change
- Consider adding tutorials for complex features
Documentation is built using MkDocs:
# Build documentation
uv run mkdocs build
# Serve documentation locally (with auto-reload)
uv run mkdocs serveThen open http://127.0.0.1:8000 in your browser to view the documentation.
- Open an issue for questions
- Check existing issues and documentation first
- Be patient and respectful
Contributors will be recognized in:
- Project README
- Release notes
- GitHub contributors page
If you have questions about contributing, feel free to:
- Open an issue with the "question" label
- Reach out to the maintainers
Thank you for contributing to Shekar! Your efforts help make Persian NLP more accessible to everyone. 🙏
Persian is Sugar
"فارسی شکر است"