Extract and count emoji from text efficiently and accurately. Fully supports multi-part emoji (skin tones, zero-width joiners, flags).
pip install emoji_extractorUsage examples: see this Jupyter notebook
You can use the top-level convenience functions to extract emoji using the default (latest) Unicode version:
from emoji_extractor import count_emoji, detect_emoji
# Returns a Counter object of emojis and their counts
counts = count_emoji("I love apples π and bananas ππ")
print(counts)
# Counter({'π': 2, 'π': 1})
# Check if a string has emoji
has_emoji = detect_emoji("No emoji here") # FalseBy default, the package uses the latest available Unicode Emoji data.
If you need to extract emoji precisely as they were defined in a specific historical Unicode version, instantiate the Extractor class:
from emoji_extractor import Extractor
# Initialise an extractor for a specific version
ext_14 = Extractor(version='14.0')
ext_15 = Extractor(version='15.0')
# π©· Pink heart was introduced in 15.0
print(ext_14.detect_emoji("π©·")) # False
print(ext_15.detect_emoji("π©·")) # TrueAvailable versions: 4.0, 5.0, 11.0, 12.0, 12.1, 13.0, 13.1, 14.0, 15.0, 15.1, 16.0, 17.0.
- Accurate Counting: Uses dynamically generated regular expressions to properly capture multi-codepoint sequences, including ZWJ sequences like 'ππ½ββοΈ' and flags.
- Historical Accuracy: Supports strict adherence to older Unicode specifications, avoiding false positives on newer emoji.
- Always Up to Date: Automatically checks for new Unicode releases via GitHub Actions and updates itself.
The package relies on official Unicode data parsed from emoji-test.txt. Inside the data/ folder for each version, it generates:
possible_emoji.json: A set of all characters that could possibly be part of an emoji (used as a fast initial filter before checking the regex).big_regex.txt: A massive list of exact matching strings piped together in order of decreasing length. This guarantees multi-part emojis are matched before their individual components.tme_regex.txt: Regex definitions for Tone-Modifiable Emoji.
(Note: Prior versions of this package used .pkl files, but we have migrated to standard formats like JSON/TXT for better security and cross-platform compatibility).
Some emoji have a variation selector 0xFE0F, but some platforms strip these and still render the emoji form. However, the regex used here will capture both (e.g. 0xFE0F after each emoji codepoint vs no 0xFE0F). See Unicode's Full Emoji List and search for '0xFE0F' to see which emoji this potentially affects.
If you want to do stuff more complicated than simply detecting, extracting and counting emoji then you might find this Python package useful.
Feel free to email me about any of this stuff.