Meitei Mayek Sentence Splitter

A customized, lightweight spaCy pipeline for splitting Meitei Mayek (Manipuri) text into sentences. It uses a hybrid approach with SentencePiece tokenization and a Contextual Neural Network to accurately detect sentence boundaries.

� Model Specifications & Results

Performance Metrics

Evaluated on a held-out validation set of 4,062 documents (~40k sentences).

Metric	Score	Matches
F-Score	94.71%	Harmonic mean of Precision & Recall
Precision	93.94%	Accuracy of boundary predictions
Recall	95.49%	% of actual boundaries found

Model Details

Feature	Specification
Model Size	~252 KB (Extremely Lightweight)
Architecture	CNN (HashEmbedCNN v2) with Window Size 1
Vocab Size	8,000 Subword Unigrams
Tokenizer	SentencePiece (Unigram)
Pipeline	`senter` (Sentence Recognizer)
Language	Custom / Multilingual (`xx`)

🔄 Workflow

The system solves the ambiguity of sentence boundaries in Meitei Mayek (where delimiters like || might be used for other purposes) by learning local context.

graph LR
    A[Raw Text] --> B(SentencePiece Tokenizer);
    B --> C{Neural Context\nwindow=3 tokens};
    C -->|End of Sent?| D[Doc / Sentence List];
    
    style B fill:#f9f,stroke:#333
    style C fill:#bbf,stroke:#333

Tokenization: Splits complex Meitei words into subwords.
Senter (CNN): Scans the token stream. It looks at the current token + neighbors to decide if a sentence ends.

🚀 Quick Start

1. Installation

Requires Python 3.11+.

# Clone the repository
git clone https://github.com/Okramjimmy/mni_tokenizer.git
cd mni_tokenizer

# Create environment
conda create -y --name mni_tokenizer python=3.11
conda activate mni_tokenizer

# Install dependencies
pip install -r requirements.txt

2. Run Inference

You can test the model interactively in your terminal:

python inference.py --interactive

Type any Meitei sentence (e.g., ꯆꯦꯔꯣꯀꯤ ꯑꯁꯤ...) and hit Enter to see how it splits.

3. Use in Python

import spacy
from meitei_tokenizer import MeiteiTokenizer

# Load the model
nlp = spacy.load("./output/model-best")
# Attach the custom tokenizer (Required!)
nlp.tokenizer = MeiteiTokenizer("meitei_tokenizer.model", nlp.vocab)

# Process text
text = "ꯆꯦꯔꯣꯀꯤ ꯑꯁꯤ ꯑꯣꯀ꯭ꯂꯥꯍꯣꯃꯥꯒꯤ ꯁꯍꯔꯅꯤ ꯫ ꯃꯁꯤ ꯌꯥꯝꯅ ꯆꯥꯎꯏ ꯫"
doc = nlp(text)

# Print sentences
for sent in doc.sents:
    print(sent.text)

4. Web API (FastAPI)

Start the high-performance local web server:

uvicorn app:app --host 0.0.0.0 --port 8000 --reload

Docs/Swagger UI: Visit http://localhost:8000/docs
API Endpoint: POST /split

Example Request (curl):

curl -X POST "http://localhost:8000/split" \
     -H "Content-Type: application/json" \
     -d '{"text": "ꯆꯦꯔꯣꯀꯤ ꯑꯁꯤ ꯑꯣꯀ꯭ꯂꯥꯍꯣꯃꯥꯒꯤ ꯁꯍꯔꯅꯤ ꯫ ꯃꯁꯤ ꯌꯥꯝꯅ ꯆꯥꯎꯏ ꯫"}'

📂 Repository Structure

output/model-best: The trained spaCy model.
meitei_tokenizer.model: The SentencePiece tokenizer model.
meitei_tokenizer.py: The custom wrapper code.
inference.py: Script to run the model.

(Training scripts are included for reproducibility, but not required for usage.)

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
output/model-best		output/model-best
.gitignore		.gitignore
README.md		README.md
app.py		app.py
inference.py		inference.py
meitei_tokenizer.model		meitei_tokenizer.model
meitei_tokenizer.py		meitei_tokenizer.py
meitei_tokenizer.vocab		meitei_tokenizer.vocab
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Meitei Mayek Sentence Splitter

� Model Specifications & Results

Performance Metrics

Model Details

🔄 Workflow

🚀 Quick Start

1. Installation

2. Run Inference

3. Use in Python

4. Web API (FastAPI)

📂 Repository Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Meitei Mayek Sentence Splitter

� Model Specifications & Results

Performance Metrics

Model Details

🔄 Workflow

🚀 Quick Start

1. Installation

2. Run Inference

3. Use in Python

4. Web API (FastAPI)

📂 Repository Structure

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages