soxoj · Copilot · Mar 22, 2026 · Mar 22, 2026 · Mar 22, 2026 · Mar 22, 2026
diff --git a/.github/workflows/update-site-data.yml b/.github/workflows/update-site-data.yml
@@ -4,13 +4,18 @@ on:
   push:
     branches: [ main ]
 
+concurrency:
+  group: update-sites-${{ github.ref }}
+  cancel-in-progress: true
+
 jobs:
   build:
     runs-on: ubuntu-latest
     steps:
     - name: Checkout repository
-      uses: actions/checkout@v2.3.2
+      uses: actions/checkout@v4
       with:
+        ref: main
         fetch-depth: 0 # otherwise, there would be errors pushing refs to the destination repository.
 
     - name: Install system dependencies
@@ -22,6 +27,9 @@ jobs:
         pip3 install .
         python3 ./utils/update_site_data.py --empty-only
 
+    - name: Remove ambiguous main tag
+      run: git tag -d main || true
+
     - name: Check for meaningful changes
       id: check
       run: |
@@ -32,13 +40,18 @@ jobs:
           echo "has_changes=false" >> $GITHUB_OUTPUT
         fi
 
+    - name: Delete existing PR branch
+      if: steps.check.outputs.has_changes == 'true'
+      run: git push origin --delete auto/update-sites-list || true
+
     - name: Create Pull Request
       if: steps.check.outputs.has_changes == 'true'
-      uses: peter-evans/create-pull-request@v5
+      uses: peter-evans/create-pull-request@v7
       with:
         token: ${{ secrets.GITHUB_TOKEN }}
         commit-message: "Updated site list and statistics"
         title: "Automated Sites List Update"
         body: "Automated changes to sites.md based on new Alexa rankings/statistics."
         branch: "auto/update-sites-list"
+        base: main
         delete-branch: true
diff --git a/LLM/site-checks-guide.md b/LLM/site-checks-guide.md
@@ -157,6 +157,7 @@ Summary from an earlier false-positive review for: OpenSea, Mercado Livre, Redtu
 - For **Kaggle**, additionally: **`headers`**, **`errors`** for browser-check text.
 - **Redtube** stayed valid on **`status_code`** with a stable **404** for non-existent users.
 - **Picsart**: the web profile URL is a thin SPA shell; use the **JSON API** (`api.picsart.com/users/show/{username}.json`) in **`url`** with **`message`**-style markers (`"status":"success"` vs `user_not_found`), not the browser-only `/posts` vs `/not-found` navigation.
+- For **Weblate / Anubis Anti-Bot**: Setting `headers` with a basic script User-Agent (e.g. `python-requests/2.25.1`) rather than the default browser UA completely bypassed the Anubis Proof-of-Work challenge HTTP 307 redirect, instantly recovering the native HTTP 404 framework.
 
 ### What required disabling checks
 

diff --git a/LLM/site-checks-playbook.md b/LLM/site-checks-playbook.md
@@ -76,8 +76,11 @@ Practical observations from fixing top-ranked sites. Full details: section **7**
 | **Some sites always generate a page** | Pbase stubs "pbase Artist {name}" for any path; ffm.bio fuzzy-matches to the nearest real entry. No markers can help — `disabled: true`. |
 | **TLS fingerprinting degrades over time** | Kaggle's custom `User-Agent` fix stopped working — aiohttp now gets 404 for both usernames. Accept `disabled: true` when no API exists. |
 | **API endpoints bypass Cloudflare** | Fandom `api.php` and Substack `/api/v1/` returned clean JSON while main pages were blocked by Cloudflare. Always try API paths on the same domain. |
-| **GraphQL supports GET too** | hashnode GraphQL works via `GET ?query=...` (URL-encoded). Don't assume POST-only — Maigret can use GET `urlProbe` for GraphQL. |
+| **Inspect Network tab for POST APIs** | Many modern platforms (e.g., Discord) heavily protect HTML profiles but expose unauthenticated `POST` endpoints for username checks. Maigret supports this natively: define `"request_method": "POST"` and `"request_payload": {"username": "{username}"}` in `data.json` to query them! |
+| **Strict JSON markers are bulletproof** | When probing APIs, use `checkType: "message"` with exact JSON substrings (like `"{\"taken\": false}"`). Unlike HTML layout checks, this approach is immune to UI redesigns, A/B testing, and language translations. |
+| **GraphQL supports GET too** | hashnode GraphQL works via `GET ?query=...` (URL-encoded). You can use either native POST payloads or GET `urlProbe` for GraphQL. |
 | **URL-encode braces for template safety** | GraphQL `{...}` conflicts with Maigret's `{username}`. Use `%7B`/`%7D` for literal braces in `urlProbe` — `.format()` ignores percent-encoded chars. |
+| **Anti-bot bypass via simple UA** | "Anubis" anti-bot PoW screens (like on Weblate) intercept modern browser UAs via HTTP 307. Hardcoding `"headers": {"User-Agent": "python-requests/2.25.1"}` circumvents the scraper filter and restores default detection logic. |
 
 ## 8. Documentation maintenance
 

diff --git a/docs/source/command-line-options.rst b/docs/source/command-line-options.rst
@@ -31,19 +31,25 @@ two-letter country codes (**not a language!**). E.g. photo, dating, sport; jp, u
 Multiple tags can be associated with one site. **Warning**: tags markup is
 not stable now. Read more :doc:`in the separate section <tags>`.
 
+``--exclude-tags`` - Exclude sites with specific tags from the search
+(blacklist). E.g. ``--exclude-tags porn,dating`` will skip all sites
+tagged with ``porn`` or ``dating``. Can be combined with ``--tags`` to
+include certain categories while excluding others. Read more
+:doc:`in the separate section <tags>`.
+
 ``-n``, ``--max-connections`` - Allowed number of concurrent connections
 **(default: 100)**.
 
 ``-a``, ``--all-sites`` - Use all sites for scan **(default: top 500)**.
 
-``--top-sites`` - Count of sites for scan ranked by Alexa Top
+``--top-sites`` - Count of sites for scan ranked by Majestic Million
 **(default: top 500)**.
 
-**Mirrors:** After the top *N* sites by Alexa rank are chosen (respecting
+**Mirrors:** After the top *N* sites by Majestic Million rank are chosen (respecting
 ``--tags``, ``--use-disabled-sites``, etc.), Maigret may add extra sites
 whose database field ``source`` names a **parent platform** that itself falls
-in the Alexa top *N* when ranking **including disabled** sites. For example,
-if ``Twitter`` ranks in the first 500 by Alexa, a mirror such as ``memory.lol``
+in the Majestic Million top *N* when ranking **including disabled** sites. For example,
+if ``Twitter`` ranks in the first 500 by Majestic Million, a mirror such as ``memory.lol``
 (with ``source: Twitter``) is included even though it has no rank and would
 otherwise be cut off. The same applies to Instagram-related mirrors (e.g.
 Picuki) when ``Instagram`` is in that parent top *N* by rank—even if the

diff --git a/docs/source/development.rst b/docs/source/development.rst
@@ -22,9 +22,15 @@ The supported methods (``checkType`` values in ``data.json``) are:
 - ``status_code`` - checks that status code of the response is 2XX
 - ``response_url`` - check if there is not redirect and the response is 2XX
 
+.. note::
+   Maigret natively treats specific anti-bot HTTP status codes (like LinkedIn's ``HTTP 999``) as a standard "Not Found/Available" signal instead of throwing an infrastructure Server Error, gracefully preventing false positives.
+
 See the details of check mechanisms in the `checking.py <https://github.com/soxoj/maigret/blob/main/maigret/checking.py#L339>`_ file.
 
-**Mirrors and ``--top-sites``:** When you limit scans with ``--top-sites N``, Maigret also includes *mirror* sites (entries whose ``source`` field points at a parent platform such as Twitter or Instagram) if that parent would appear in the Alexa top *N* when disabled sites are considered for ranking. See the **Mirrors** paragraph under ``--top-sites`` in :doc:`command-line-options`.
+.. note::
+   Maigret now uses the **Majestic Million** dataset for site popularity sorting instead of the discontinued Alexa Rank API. For backward compatibility with existing configurations and parsers, the ranking field in `data.json` and internal site models remains named ``alexaRank`` and ``alexa_rank``.
+
+**Mirrors and ``--top-sites``:** When you limit scans with ``--top-sites N``, Maigret also includes *mirror* sites (entries whose ``source`` field points at a parent platform such as Twitter or Instagram) if that parent would appear in the Majestic Million top *N* when disabled sites are considered for ranking. See the **Mirrors** paragraph under ``--top-sites`` in :doc:`command-line-options`.
 
 Testing
 -------
@@ -114,6 +120,8 @@ There are few options for sites data.json helpful in various cases:
 - ``headers`` - a dictionary of additional headers to be sent to the site
 - ``requestHeadOnly`` - set to ``true`` if it's enough to make a HEAD request to the site
 - ``regexCheck`` - a regex to check if the username is valid, in case of frequent false-positives
+- ``requestMethod`` - set the HTTP method to use (e.g., ``POST``). By default, Maigret natively defaults to GET or HEAD.
+- ``requestPayload`` - a dictionary with the JSON payload to send for POST requests (e.g., ``{"username": "{username}"}``), extremely useful for parsing GraphQL or modern JSON APIs.
 
 ``urlProbe`` (optional profile probe URL)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

diff --git a/docs/source/tags.rst b/docs/source/tags.rst
@@ -23,3 +23,19 @@ Usage
 ``--tags coding`` -- search on sites related to software development.
 
 ``--tags ucoz`` -- search on uCoz sites only (mostly CIS countries)
+
+Blacklisting (excluding) tags
+------------------------------
+You can exclude sites with certain tags from the search using ``--exclude-tags``:
+
+``--exclude-tags porn,dating`` -- skip all sites tagged with ``porn`` or ``dating``.
+
+``--exclude-tags ru`` -- skip all Russian sites.
+
+You can combine ``--tags`` and ``--exclude-tags`` to fine-tune your search:
+
+``--tags forum --exclude-tags ru`` -- search on forum sites, but skip Russian ones.
+
+In the web interface, the tag cloud supports three states per tag:
+click once to **include** (green), click again to **exclude** (dark/strikethrough),
+and click once more to return to **neutral** (red).
diff --git a/docs/source/usage-examples.rst b/docs/source/usage-examples.rst
@@ -13,7 +13,7 @@ Use Cases
 ---------
 
 
-1. Search for accounts with username ``machine42`` on top 500 sites (by default, according to Alexa rank) from the Maigret DB.
+1. Search for accounts with username ``machine42`` on top 500 sites (by default, according to Majestic Million rank) from the Maigret DB.
 
 .. code-block:: console
 

diff --git a/maigret/checking.py b/maigret/checking.py
@@ -61,30 +61,49 @@ def __init__(self, *args, **kwargs):
         self.headers = None
         self.allow_redirects = True
         self.timeout = 0
+        self.allow_redirects = True
+        self.timeout = 0
         self.method = 'get'
+        self.payload = None
 
-    def prepare(self, url, headers=None, allow_redirects=True, timeout=0, method='get'):
+    def prepare(self, url, headers=None, allow_redirects=True, timeout=0, method='get', payload=None):
         self.url = url
         self.headers = headers
         self.allow_redirects = allow_redirects
         self.timeout = timeout
         self.method = method
+        self.payload = payload
         return None
 
     async def close(self):
         pass
 
     async def _make_request(
-        self, session, url, headers, allow_redirects, timeout, method, logger
+        self, session, url, headers, allow_redirects, timeout, method, logger, payload=None
     ) -> Tuple[str, int, Optional[CheckError]]:
         try:
-            request_method = session.get if method == 'get' else session.head
-            async with request_method(
-                url=url,
-                headers=headers,
-                allow_redirects=allow_redirects,
-                timeout=timeout,
-            ) as response:
+            if method.lower() == 'get':
+                request_method = session.get
+            elif method.lower() == 'post':
+                request_method = session.post
+            elif method.lower() == 'head':
+                request_method = session.head
+            else:
+                request_method = session.get
+
+            kwargs = {
+                'url': url,
+                'headers': headers,
+                'allow_redirects': allow_redirects,
+                'timeout': timeout,
+            }
+            if payload and method.lower() == 'post':
+                if headers and headers.get('Content-Type') == 'application/x-www-form-urlencoded':
+                    kwargs['data'] = payload
+                else:
+                    kwargs['json'] = payload
+
+            async with request_method(**kwargs) as response:
                 status_code = response.status
                 response_content = await response.content.read()
                 charset = response.charset or "utf-8"
@@ -141,6 +160,7 @@ async def check(self) -> Tuple[str, int, Optional[CheckError]]:
                 self.timeout,
                 self.method,
                 self.logger,
+                self.payload,
             )
 
             if error and str(error) == "Invalid proxy response":
@@ -165,7 +185,7 @@ def __init__(self, *args, **kwargs):
         self.logger = kwargs.get('logger', Mock())
         self.resolver = aiodns.DNSResolver(loop=loop)
 
-    def prepare(self, url, headers=None, allow_redirects=True, timeout=0, method='get'):
+    def prepare(self, url, headers=None, allow_redirects=True, timeout=0, method='get', payload=None):
         self.url = url
         return None
 
@@ -191,7 +211,7 @@ class CheckerMock:
     def __init__(self, *args, **kwargs):
         pass
 
-    def prepare(self, url, headers=None, allow_redirects=True, timeout=0, method='get'):
+    def prepare(self, url, headers=None, allow_redirects=True, timeout=0, method='get', payload=None):
         return None
 
     async def check(self) -> Tuple[str, int, Optional[CheckError]]:
@@ -220,6 +240,11 @@ def detect_error_page(
     if status_code == 403 and not ignore_403:
         return CheckError("Access denied", "403 status code, use proxy/vpn")
 
+    elif status_code == 999:
+        # LinkedIn anti-bot / HTTP 999 workaround. It shouldn't trigger an infrastructure
+        # Server Error because it represents a valid "Not Found / Blocked" state for the username.
+        pass
+
     elif status_code >= 500:
         return CheckError("Server", f"{status_code} status code")
 
@@ -494,7 +519,9 @@ def make_site_result(
         for k, v in site.get_params.items():
             url_probe += f"&{k}={v}"
 
-        if site.check_type == "status_code" and site.request_head_only:
+        if site.request_method:
+            request_method = site.request_method.lower()
+        elif site.check_type == "status_code" and site.request_head_only:
             # In most cases when we are detecting by status code,
             # it is not necessary to get the entire body:  we can
             # detect fine with just the HEAD response.
@@ -505,6 +532,15 @@ def make_site_result(
             # not respond properly unless we request the whole page.
             request_method = 'get'
 
+        payload = None
+        if site.request_payload:
+            payload = {}
+            for k, v in site.request_payload.items():
+                if isinstance(v, str):
+                    payload[k] = v.format(username=username)
+                else:
+                    payload[k] = v
+
         if site.check_type == "response_url":
             # Site forwards request to a different URL if username not
             # found.  Disallow the redirect so we can capture the
@@ -521,6 +557,7 @@ def make_site_result(
             headers=headers,
             allow_redirects=allow_redirects,
             timeout=options['timeout'],
+            payload=payload,
         )
 
         # Store future request object in the results object
@@ -577,6 +614,7 @@ async def check_site_for_username(
                     allow_redirects=checker.allow_redirects,
                     timeout=checker.timeout,
                     method=checker.method,
+                    payload=getattr(checker, 'payload', None),
                 )
                 response = await checker.check()
 

diff --git a/maigret/maigret.py b/maigret/maigret.py
@@ -277,6 +277,12 @@ def setup_arguments_parser(settings: Settings):
     filter_group.add_argument(
         "--tags", dest="tags", default='', help="Specify tags of sites (see `--stats`)."
     )
+    filter_group.add_argument(
+        "--exclude-tags",
+        dest="exclude_tags",
+        default='',
+        help="Specify tags to exclude from search (blacklist).",
+    )
     filter_group.add_argument(
         "--site",
         action="append",
@@ -532,6 +538,11 @@ async def main():
     if args.tags:
         args.tags = list(set(str(args.tags).split(',')))
 
+    if args.exclude_tags:
+        args.exclude_tags = list(set(str(args.exclude_tags).split(',')))
+    else:
+        args.exclude_tags = []
+
     db_file = args.db_file \
         if (args.db_file.startswith("http://") or args.db_file.startswith("https://")) \
         else path.join(path.dirname(path.realpath(__file__)), args.db_file)
@@ -553,6 +564,7 @@ async def main():
     get_top_sites_for_id = lambda x: db.ranked_sites_dict(
         top=args.top_sites,
         tags=args.tags,
+        excluded_tags=args.exclude_tags,
         names=args.site_list,
         disabled=args.use_disabled_sites,
         id_type=x,
-Original file line number
+Diff line change
@@ Expand Up / @@ -13,7 +13,7 @@ Use Cases @@
     ---------
-. Search for accounts with username ``machine42`` on top 500 sites (by default, according to Alexa rank) from the Maigret DB.
+. Search for accounts with username ``machine42`` on top 500 sites (by default, according to Majestic Million rank) from the Maigret DB.
     .. code-block:: console
@@ Expand Down @@