Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
0615dea
Initial plan
Copilot Mar 22, 2026
2def9a2
fix(virgool): switch from status_code to message check type with erro…
Copilot Mar 22, 2026
4f397fe
Re-enable taplink.cc with browser User-Agent to bypass Cloudflare (#2…
Copilot Mar 22, 2026
9ac0a65
feat(workflow): fix update site data workflow err (#2312)
soxoj Mar 22, 2026
e81b50e
Update site data workflow fix: remove ambiguous main tag (#2313)
soxoj Mar 22, 2026
2c2d340
Updated site list and statistics (#2314)
github-actions[bot] Mar 22, 2026
01049b7
Fix Love.Mail.ru: update to numeric-only identifiers and new profile …
Copilot Mar 22, 2026
56d0c9f
Remove dead site xxxforum.org (#2310)
Copilot Mar 22, 2026
b1a211c
Disable forums.developer.nvidia.com (auth-gated user profiles) (#2305)
Copilot Mar 22, 2026
b960ace
Pin requests-toolbelt>=1.0.0 to fix urllib3 v2 incompatibility (#2316)
Copilot Mar 22, 2026
a2d4373
build(deps): bump reportlab from 4.4.5 to 4.4.10 (#2323)
dependabot[bot] Mar 23, 2026
3ba0759
build(deps-dev): bump coverage from 7.12.0 to 7.13.5 (#2321)
dependabot[bot] Mar 23, 2026
2c55501
build(deps-dev): bump pytest-cov from 7.0.0 to 7.1.0 (#2320)
dependabot[bot] Mar 23, 2026
b4482e0
build(deps): bump aiohttp-socks from 0.10.1 to 0.11.0 (#2319)
dependabot[bot] Mar 23, 2026
5930a30
Disable false-positive site probe: amateurvoyeurforum.com (#2332)
Copilot Mar 23, 2026
146bc04
Disable forums.stevehoffman.tv due to false positives (#2331)
Copilot Mar 23, 2026
9b35fc1
[WIP] Fix false-positive probe for vegalab site (#2336)
Copilot Mar 23, 2026
e3aada6
Fix RoyalCams site check using BongaCams white-label pattern (#2334)
Copilot Mar 23, 2026
005863c
Fix Setlist site check: switch to message checkType with proper marke…
Copilot Mar 23, 2026
00a9249
[WIP] Fix invalid link on forums.imore.com (#2337)
Copilot Mar 23, 2026
e0559e4
Updated site list and statistics (#2315)
github-actions[bot] Mar 23, 2026
479a614
Automated Sites List Update (#2339)
github-actions[bot] Mar 23, 2026
d3f13ac
Fix false-positive site probe: Re-enable Taplink with message checkTy…
Copilot Mar 23, 2026
b00ef1f
build(deps): bump aiodns from 3.5.0 to 4.0.0 (#2345)
dependabot[bot] Mar 24, 2026
2775181
build(deps-dev): bump mypy from 1.19.0 to 1.19.1 (#2347)
dependabot[bot] Mar 24, 2026
4c97025
Disable Librusec site check (false positive) (#2349)
Copilot Mar 24, 2026
eb541dc
Disable MirTesen site check (false positive) (#2350)
Copilot Mar 24, 2026
829bda8
build(deps): bump attrs from 25.4.0 to 26.1.0 (#2344)
dependabot[bot] Mar 24, 2026
2d94269
Automated Sites List Update (#2341)
github-actions[bot] Mar 24, 2026
79cea49
feat: add CTFtime and PentesterLab site support (#2318)
juliosuas Mar 24, 2026
28f35f9
Fix club.cnews.ru false positive: switch from status_code to message …
Copilot Mar 24, 2026
3e56c95
Fix SoundCloud false-positive: switch to message-based check (#2355)
Copilot Mar 24, 2026
f5786f1
build(deps): bump certifi from 2025.11.12 to 2026.2.25 (#2346)
dependabot[bot] Mar 24, 2026
2e430e5
feat: add tag blacklisting via `--exclude-tags` (#2352)
Copilot Mar 24, 2026
abd9aa5
Fix domain substring matching and NoneType crash in submit dialog (#2…
Copilot Mar 24, 2026
b145e7b
feat(core): add POST request support, new sites, migrate to Majestic …
soxoj Mar 24, 2026
5aae2ee
Fix update-site-data workflow race condition on branch push (#2366)
Copilot Mar 24, 2026
4d70f0f
feat(virgool): add POST support and use user-existence API to bypass …
Copilot Mar 24, 2026
7ba2fd3
refactor: move json import to module level and address review comments
Copilot Mar 24, 2026
64cca25
fix(virgool): use existing POST support from main to enable virgool.i…
Copilot Mar 24, 2026
ab01dfc
Merge remote-tracking branch 'origin/copilot/fix-broken-site-virgool'…
Copilot Mar 24, 2026
a214263
fix(virgool): enable virgool.io via POST user-existence API
Copilot Mar 24, 2026
b59179b
Rebase: squash branch onto main with single Virgool data.json change
Copilot Mar 24, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 15 additions & 2 deletions .github/workflows/update-site-data.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,18 @@ on:
push:
branches: [ main ]

concurrency:
group: update-sites-${{ github.ref }}
cancel-in-progress: true

jobs:
build:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v2.3.2
uses: actions/checkout@v4
with:
ref: main
fetch-depth: 0 # otherwise, there would be errors pushing refs to the destination repository.

- name: Install system dependencies
Expand All @@ -22,6 +27,9 @@ jobs:
pip3 install .
python3 ./utils/update_site_data.py --empty-only

- name: Remove ambiguous main tag
run: git tag -d main || true

- name: Check for meaningful changes
id: check
run: |
Expand All @@ -32,13 +40,18 @@ jobs:
echo "has_changes=false" >> $GITHUB_OUTPUT
fi

- name: Delete existing PR branch
if: steps.check.outputs.has_changes == 'true'
run: git push origin --delete auto/update-sites-list || true

- name: Create Pull Request
if: steps.check.outputs.has_changes == 'true'
uses: peter-evans/create-pull-request@v5
uses: peter-evans/create-pull-request@v7
with:
token: ${{ secrets.GITHUB_TOKEN }}
commit-message: "Updated site list and statistics"
title: "Automated Sites List Update"
body: "Automated changes to sites.md based on new Alexa rankings/statistics."
branch: "auto/update-sites-list"
base: main
delete-branch: true
1 change: 1 addition & 0 deletions LLM/site-checks-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,7 @@ Summary from an earlier false-positive review for: OpenSea, Mercado Livre, Redtu
- For **Kaggle**, additionally: **`headers`**, **`errors`** for browser-check text.
- **Redtube** stayed valid on **`status_code`** with a stable **404** for non-existent users.
- **Picsart**: the web profile URL is a thin SPA shell; use the **JSON API** (`api.picsart.com/users/show/{username}.json`) in **`url`** with **`message`**-style markers (`"status":"success"` vs `user_not_found`), not the browser-only `/posts` vs `/not-found` navigation.
- For **Weblate / Anubis Anti-Bot**: Setting `headers` with a basic script User-Agent (e.g. `python-requests/2.25.1`) rather than the default browser UA completely bypassed the Anubis Proof-of-Work challenge HTTP 307 redirect, instantly recovering the native HTTP 404 framework.

### What required disabling checks

Expand Down
5 changes: 4 additions & 1 deletion LLM/site-checks-playbook.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,8 +76,11 @@ Practical observations from fixing top-ranked sites. Full details: section **7**
| **Some sites always generate a page** | Pbase stubs "pbase Artist {name}" for any path; ffm.bio fuzzy-matches to the nearest real entry. No markers can help — `disabled: true`. |
| **TLS fingerprinting degrades over time** | Kaggle's custom `User-Agent` fix stopped working — aiohttp now gets 404 for both usernames. Accept `disabled: true` when no API exists. |
| **API endpoints bypass Cloudflare** | Fandom `api.php` and Substack `/api/v1/` returned clean JSON while main pages were blocked by Cloudflare. Always try API paths on the same domain. |
| **GraphQL supports GET too** | hashnode GraphQL works via `GET ?query=...` (URL-encoded). Don't assume POST-only — Maigret can use GET `urlProbe` for GraphQL. |
| **Inspect Network tab for POST APIs** | Many modern platforms (e.g., Discord) heavily protect HTML profiles but expose unauthenticated `POST` endpoints for username checks. Maigret supports this natively: define `"request_method": "POST"` and `"request_payload": {"username": "{username}"}` in `data.json` to query them! |
| **Strict JSON markers are bulletproof** | When probing APIs, use `checkType: "message"` with exact JSON substrings (like `"{\"taken\": false}"`). Unlike HTML layout checks, this approach is immune to UI redesigns, A/B testing, and language translations. |
| **GraphQL supports GET too** | hashnode GraphQL works via `GET ?query=...` (URL-encoded). You can use either native POST payloads or GET `urlProbe` for GraphQL. |
| **URL-encode braces for template safety** | GraphQL `{...}` conflicts with Maigret's `{username}`. Use `%7B`/`%7D` for literal braces in `urlProbe` — `.format()` ignores percent-encoded chars. |
| **Anti-bot bypass via simple UA** | "Anubis" anti-bot PoW screens (like on Weblate) intercept modern browser UAs via HTTP 307. Hardcoding `"headers": {"User-Agent": "python-requests/2.25.1"}` circumvents the scraper filter and restores default detection logic. |

## 8. Documentation maintenance

Expand Down
14 changes: 10 additions & 4 deletions docs/source/command-line-options.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,19 +31,25 @@ two-letter country codes (**not a language!**). E.g. photo, dating, sport; jp, u
Multiple tags can be associated with one site. **Warning**: tags markup is
not stable now. Read more :doc:`in the separate section <tags>`.

``--exclude-tags`` - Exclude sites with specific tags from the search
(blacklist). E.g. ``--exclude-tags porn,dating`` will skip all sites
tagged with ``porn`` or ``dating``. Can be combined with ``--tags`` to
include certain categories while excluding others. Read more
:doc:`in the separate section <tags>`.

``-n``, ``--max-connections`` - Allowed number of concurrent connections
**(default: 100)**.

``-a``, ``--all-sites`` - Use all sites for scan **(default: top 500)**.

``--top-sites`` - Count of sites for scan ranked by Alexa Top
``--top-sites`` - Count of sites for scan ranked by Majestic Million
**(default: top 500)**.

**Mirrors:** After the top *N* sites by Alexa rank are chosen (respecting
**Mirrors:** After the top *N* sites by Majestic Million rank are chosen (respecting
``--tags``, ``--use-disabled-sites``, etc.), Maigret may add extra sites
whose database field ``source`` names a **parent platform** that itself falls
in the Alexa top *N* when ranking **including disabled** sites. For example,
if ``Twitter`` ranks in the first 500 by Alexa, a mirror such as ``memory.lol``
in the Majestic Million top *N* when ranking **including disabled** sites. For example,
if ``Twitter`` ranks in the first 500 by Majestic Million, a mirror such as ``memory.lol``
(with ``source: Twitter``) is included even though it has no rank and would
otherwise be cut off. The same applies to Instagram-related mirrors (e.g.
Picuki) when ``Instagram`` is in that parent top *N* by rank—even if the
Expand Down
10 changes: 9 additions & 1 deletion docs/source/development.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,15 @@ The supported methods (``checkType`` values in ``data.json``) are:
- ``status_code`` - checks that status code of the response is 2XX
- ``response_url`` - check if there is not redirect and the response is 2XX

.. note::
Maigret natively treats specific anti-bot HTTP status codes (like LinkedIn's ``HTTP 999``) as a standard "Not Found/Available" signal instead of throwing an infrastructure Server Error, gracefully preventing false positives.

See the details of check mechanisms in the `checking.py <https://github.com/soxoj/maigret/blob/main/maigret/checking.py#L339>`_ file.

**Mirrors and ``--top-sites``:** When you limit scans with ``--top-sites N``, Maigret also includes *mirror* sites (entries whose ``source`` field points at a parent platform such as Twitter or Instagram) if that parent would appear in the Alexa top *N* when disabled sites are considered for ranking. See the **Mirrors** paragraph under ``--top-sites`` in :doc:`command-line-options`.
.. note::
Maigret now uses the **Majestic Million** dataset for site popularity sorting instead of the discontinued Alexa Rank API. For backward compatibility with existing configurations and parsers, the ranking field in `data.json` and internal site models remains named ``alexaRank`` and ``alexa_rank``.

**Mirrors and ``--top-sites``:** When you limit scans with ``--top-sites N``, Maigret also includes *mirror* sites (entries whose ``source`` field points at a parent platform such as Twitter or Instagram) if that parent would appear in the Majestic Million top *N* when disabled sites are considered for ranking. See the **Mirrors** paragraph under ``--top-sites`` in :doc:`command-line-options`.

Testing
-------
Expand Down Expand Up @@ -114,6 +120,8 @@ There are few options for sites data.json helpful in various cases:
- ``headers`` - a dictionary of additional headers to be sent to the site
- ``requestHeadOnly`` - set to ``true`` if it's enough to make a HEAD request to the site
- ``regexCheck`` - a regex to check if the username is valid, in case of frequent false-positives
- ``requestMethod`` - set the HTTP method to use (e.g., ``POST``). By default, Maigret natively defaults to GET or HEAD.
- ``requestPayload`` - a dictionary with the JSON payload to send for POST requests (e.g., ``{"username": "{username}"}``), extremely useful for parsing GraphQL or modern JSON APIs.

``urlProbe`` (optional profile probe URL)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Expand Down
16 changes: 16 additions & 0 deletions docs/source/tags.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,3 +23,19 @@ Usage
``--tags coding`` -- search on sites related to software development.

``--tags ucoz`` -- search on uCoz sites only (mostly CIS countries)

Blacklisting (excluding) tags
------------------------------
You can exclude sites with certain tags from the search using ``--exclude-tags``:

``--exclude-tags porn,dating`` -- skip all sites tagged with ``porn`` or ``dating``.

``--exclude-tags ru`` -- skip all Russian sites.

You can combine ``--tags`` and ``--exclude-tags`` to fine-tune your search:

``--tags forum --exclude-tags ru`` -- search on forum sites, but skip Russian ones.

In the web interface, the tag cloud supports three states per tag:
click once to **include** (green), click again to **exclude** (dark/strikethrough),
and click once more to return to **neutral** (red).
2 changes: 1 addition & 1 deletion docs/source/usage-examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ Use Cases
---------


1. Search for accounts with username ``machine42`` on top 500 sites (by default, according to Alexa rank) from the Maigret DB.
1. Search for accounts with username ``machine42`` on top 500 sites (by default, according to Majestic Million rank) from the Maigret DB.

.. code-block:: console

Expand Down
62 changes: 50 additions & 12 deletions maigret/checking.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,30 +61,49 @@ def __init__(self, *args, **kwargs):
self.headers = None
self.allow_redirects = True
self.timeout = 0
self.allow_redirects = True
self.timeout = 0
self.method = 'get'
self.payload = None

def prepare(self, url, headers=None, allow_redirects=True, timeout=0, method='get'):
def prepare(self, url, headers=None, allow_redirects=True, timeout=0, method='get', payload=None):
self.url = url
self.headers = headers
self.allow_redirects = allow_redirects
self.timeout = timeout
self.method = method
self.payload = payload
return None

async def close(self):
pass

async def _make_request(
self, session, url, headers, allow_redirects, timeout, method, logger
self, session, url, headers, allow_redirects, timeout, method, logger, payload=None
) -> Tuple[str, int, Optional[CheckError]]:
try:
request_method = session.get if method == 'get' else session.head
async with request_method(
url=url,
headers=headers,
allow_redirects=allow_redirects,
timeout=timeout,
) as response:
if method.lower() == 'get':
request_method = session.get
elif method.lower() == 'post':
request_method = session.post
elif method.lower() == 'head':
request_method = session.head
else:
request_method = session.get

kwargs = {
'url': url,
'headers': headers,
'allow_redirects': allow_redirects,
'timeout': timeout,
}
if payload and method.lower() == 'post':
if headers and headers.get('Content-Type') == 'application/x-www-form-urlencoded':
kwargs['data'] = payload
else:
kwargs['json'] = payload

async with request_method(**kwargs) as response:
status_code = response.status
response_content = await response.content.read()
charset = response.charset or "utf-8"
Expand Down Expand Up @@ -141,6 +160,7 @@ async def check(self) -> Tuple[str, int, Optional[CheckError]]:
self.timeout,
self.method,
self.logger,
self.payload,
)

if error and str(error) == "Invalid proxy response":
Expand All @@ -165,7 +185,7 @@ def __init__(self, *args, **kwargs):
self.logger = kwargs.get('logger', Mock())
self.resolver = aiodns.DNSResolver(loop=loop)

def prepare(self, url, headers=None, allow_redirects=True, timeout=0, method='get'):
def prepare(self, url, headers=None, allow_redirects=True, timeout=0, method='get', payload=None):
self.url = url
return None

Expand All @@ -191,7 +211,7 @@ class CheckerMock:
def __init__(self, *args, **kwargs):
pass

def prepare(self, url, headers=None, allow_redirects=True, timeout=0, method='get'):
def prepare(self, url, headers=None, allow_redirects=True, timeout=0, method='get', payload=None):
return None

async def check(self) -> Tuple[str, int, Optional[CheckError]]:
Expand Down Expand Up @@ -220,6 +240,11 @@ def detect_error_page(
if status_code == 403 and not ignore_403:
return CheckError("Access denied", "403 status code, use proxy/vpn")

elif status_code == 999:
# LinkedIn anti-bot / HTTP 999 workaround. It shouldn't trigger an infrastructure
# Server Error because it represents a valid "Not Found / Blocked" state for the username.
pass

elif status_code >= 500:
return CheckError("Server", f"{status_code} status code")

Expand Down Expand Up @@ -494,7 +519,9 @@ def make_site_result(
for k, v in site.get_params.items():
url_probe += f"&{k}={v}"

if site.check_type == "status_code" and site.request_head_only:
if site.request_method:
request_method = site.request_method.lower()
elif site.check_type == "status_code" and site.request_head_only:
# In most cases when we are detecting by status code,
# it is not necessary to get the entire body: we can
# detect fine with just the HEAD response.
Expand All @@ -505,6 +532,15 @@ def make_site_result(
# not respond properly unless we request the whole page.
request_method = 'get'

payload = None
if site.request_payload:
payload = {}
for k, v in site.request_payload.items():
if isinstance(v, str):
payload[k] = v.format(username=username)
else:
payload[k] = v

if site.check_type == "response_url":
# Site forwards request to a different URL if username not
# found. Disallow the redirect so we can capture the
Expand All @@ -521,6 +557,7 @@ def make_site_result(
headers=headers,
allow_redirects=allow_redirects,
timeout=options['timeout'],
payload=payload,
)

# Store future request object in the results object
Expand Down Expand Up @@ -577,6 +614,7 @@ async def check_site_for_username(
allow_redirects=checker.allow_redirects,
timeout=checker.timeout,
method=checker.method,
payload=getattr(checker, 'payload', None),
)
response = await checker.check()

Expand Down
12 changes: 12 additions & 0 deletions maigret/maigret.py
Original file line number Diff line number Diff line change
Expand Up @@ -277,6 +277,12 @@ def setup_arguments_parser(settings: Settings):
filter_group.add_argument(
"--tags", dest="tags", default='', help="Specify tags of sites (see `--stats`)."
)
filter_group.add_argument(
"--exclude-tags",
dest="exclude_tags",
default='',
help="Specify tags to exclude from search (blacklist).",
)
filter_group.add_argument(
"--site",
action="append",
Expand Down Expand Up @@ -532,6 +538,11 @@ async def main():
if args.tags:
args.tags = list(set(str(args.tags).split(',')))

if args.exclude_tags:
args.exclude_tags = list(set(str(args.exclude_tags).split(',')))
else:
args.exclude_tags = []

db_file = args.db_file \
if (args.db_file.startswith("http://") or args.db_file.startswith("https://")) \
else path.join(path.dirname(path.realpath(__file__)), args.db_file)
Expand All @@ -553,6 +564,7 @@ async def main():
get_top_sites_for_id = lambda x: db.ranked_sites_dict(
top=args.top_sites,
tags=args.tags,
excluded_tags=args.exclude_tags,
names=args.site_list,
disabled=args.use_disabled_sites,
id_type=x,
Expand Down
Loading
Loading