A customized, lightweight spaCy pipeline for splitting Meitei Mayek (Manipuri) text into sentences. It uses a hybrid approach with SentencePiece tokenization and a Contextual Neural Network to accurately detect sentence boundaries.
Evaluated on a held-out validation set of 4,062 documents (~40k sentences).
| Metric | Score | Matches |
|---|---|---|
| F-Score | 94.71% | Harmonic mean of Precision & Recall |
| Precision | 93.94% | Accuracy of boundary predictions |
| Recall | 95.49% | % of actual boundaries found |
| Feature | Specification |
|---|---|
| Model Size | ~252 KB (Extremely Lightweight) |
| Architecture | CNN (HashEmbedCNN v2) with Window Size 1 |
| Vocab Size | 8,000 Subword Unigrams |
| Tokenizer | SentencePiece (Unigram) |
| Pipeline | senter (Sentence Recognizer) |
| Language | Custom / Multilingual (xx) |
The system solves the ambiguity of sentence boundaries in Meitei Mayek (where delimiters like || might be used for other purposes) by learning local context.
graph LR
A[Raw Text] --> B(SentencePiece Tokenizer);
B --> C{Neural Context\nwindow=3 tokens};
C -->|End of Sent?| D[Doc / Sentence List];
style B fill:#f9f,stroke:#333
style C fill:#bbf,stroke:#333
- Tokenization: Splits complex Meitei words into subwords.
- Senter (CNN): Scans the token stream. It looks at the current token + neighbors to decide if a sentence ends.
Requires Python 3.11+.
# Clone the repository
git clone https://github.com/Okramjimmy/mni_tokenizer.git
cd mni_tokenizer
# Create environment
conda create -y --name mni_tokenizer python=3.11
conda activate mni_tokenizer
# Install dependencies
pip install -r requirements.txtYou can test the model interactively in your terminal:
python inference.py --interactiveType any Meitei sentence (e.g., ꯆꯦꯔꯣꯀꯤ ꯑꯁꯤ...) and hit Enter to see how it splits.
import spacy
from meitei_tokenizer import MeiteiTokenizer
# Load the model
nlp = spacy.load("./output/model-best")
# Attach the custom tokenizer (Required!)
nlp.tokenizer = MeiteiTokenizer("meitei_tokenizer.model", nlp.vocab)
# Process text
text = "ꯆꯦꯔꯣꯀꯤ ꯑꯁꯤ ꯑꯣꯀ꯭ꯂꯥꯍꯣꯃꯥꯒꯤ ꯁꯍꯔꯅꯤ ꯫ ꯃꯁꯤ ꯌꯥꯝꯅ ꯆꯥꯎꯏ ꯫"
doc = nlp(text)
# Print sentences
for sent in doc.sents:
print(sent.text)Start the high-performance local web server:
uvicorn app:app --host 0.0.0.0 --port 8000 --reload- Docs/Swagger UI: Visit
http://localhost:8000/docs - API Endpoint:
POST /split
Example Request (curl):
curl -X POST "http://localhost:8000/split" \
-H "Content-Type: application/json" \
-d '{"text": "ꯆꯦꯔꯣꯀꯤ ꯑꯁꯤ ꯑꯣꯀ꯭ꯂꯥꯍꯣꯃꯥꯒꯤ ꯁꯍꯔꯅꯤ ꯫ ꯃꯁꯤ ꯌꯥꯝꯅ ꯆꯥꯎꯏ ꯫"}'output/model-best: The trained spaCy model.meitei_tokenizer.model: The SentencePiece tokenizer model.meitei_tokenizer.py: The custom wrapper code.inference.py: Script to run the model.
(Training scripts are included for reproducibility, but not required for usage.)