docs/source/development.rst
.. _development:
The human-readable list of supported sites is available in the sites.md <https://github.com/soxoj/maigret/blob/main/sites.md>_ file in the repository.
It's been generated automatically from the main JSON file with the list of supported sites.
The machine-readable JSON file with the list of supported sites is available in the
data.json <https://github.com/soxoj/maigret/blob/main/maigret/resources/data.json>_ file in the directory resources.
The supported methods (checkType values in data.json) are:
message - the most reliable method, checks if any string from presenceStrs is present and none of the strings from absenceStrs are present in the HTML responsestatus_code - checks that status code of the response is 2XXresponse_url - check if there is not redirect and the response is 2XX.. note::
Maigret natively treats specific anti-bot HTTP status codes (like LinkedIn's HTTP 999) as a standard "Not Found/Available" signal instead of throwing an infrastructure Server Error, gracefully preventing false positives.
See the details of check mechanisms in the checking.py <https://github.com/soxoj/maigret/blob/main/maigret/checking.py#L339>_ file.
.. note::
Maigret now uses the Majestic Million dataset for site popularity sorting instead of the discontinued Alexa Rank API. For backward compatibility with existing configurations and parsers, the ranking field in data.json and internal site models remains named alexaRank and alexa_rank.
Mirrors and --top-sites: When you limit scans with --top-sites N, Maigret also includes mirror sites (entries whose source field points at a parent platform such as Twitter or Instagram) if that parent would appear in the Majestic Million top N when disabled sites are considered for ranking. See the Mirrors paragraph under --top-sites in :doc:command-line-options.
It is recommended use Python 3.10 for testing.
Install test requirements:
.. code-block:: console
poetry install --with dev
Use the following commands to check Maigret:
.. code-block:: console
make lint
make format
make test
open htmlcov/index.html
make speed
Site names are the keys in data.json and appear in user-facing reports. Follow these rules:
Product Hunt, Hacker News.kofi, note, hi5.calendly.com → Calendly), unless the domain is part of the recognized brand name: last.fm, VC.ru, Archive.org.VK, CNET, ICQ, IFTTT.www. or https:// prefix in the name.Star Citizen, Google Maps.{username}.tilda.ws.When in doubt, check how the service refers to itself on its homepage.
If you want to work with sites database, don't forget to activate statistics update git hook, command for it would look like this: git config --local core.hooksPath .githooks/.
You should make your git commits from your maigret git repo folder, or else the hook wouldn't find the statistics update script.
If you already know which site has a false-positive and want to fix it specifically, go to the next step.
Otherwise, simply run a search with a random username (e.g. laiuhi3h4gi3u4hgt) and check the results.
Alternatively, you can use the Telegram bot <https://t.me/osint_maigret_bot>_.
data.json <https://github.com/soxoj/maigret/blob/main/maigret/resources/data.json>_ file.If the checkType method is not message and you are going to fix check, update it:
message in checkTypeabsenceStrs a keyword that is present in the HTML response for an non-existing accountpresenceStrs a keyword that is present in the HTML response for an existing accountIf you have trouble determining the right keywords, you can use automatic detection by passing the account URL with the --submit option:
.. code-block:: console
maigret --submit https://my.mail.ru/bk/alex
To disable checking, set disabled to true or simply run:
.. code-block:: console
maigret --self-check --site [email protected]
To debug the check method using the response HTML, you can run:
.. code-block:: console
maigret soxoj --site [email protected] -d 2> response.txt
There are few options for sites data.json helpful in various cases:
engine - a predefined check for the sites of certain type (e.g. forums), see the engines section in the JSON fileheaders - a dictionary of additional headers to be sent to the siterequestHeadOnly - set to true if it's enough to make a HEAD request to the siteregexCheck - a regex to check if the username is valid, in case of frequent false-positivesrequestMethod - set the HTTP method to use (e.g., POST). By default, Maigret natively defaults to GET or HEAD.requestPayload - a dictionary with the JSON payload to send for POST requests (e.g., {"username": "{username}"}), extremely useful for parsing GraphQL or modern JSON APIs.protection - a list of protection types detected on the site (see below).protection (site protection tracking)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The protection field records what kind of anti-bot protection a site uses. Maigret reads this field and automatically applies the appropriate bypass mechanism.
Supported values:
tls_fingerprint — the site fingerprints the TLS handshake (JA3/JA4) and blocks non-browser clients. Maigret automatically uses curl_cffi with Chrome browser emulation to bypass this. Requires the curl_cffi package (included as a dependency). Examples: Instagram, NPM, Codepen, Kickstarter, Letterboxd.ip_reputation — the site blocks requests from datacenter/cloud IPs regardless of headers or TLS. Cannot be bypassed automatically; run Maigret from a regular internet connection (not a datacenter) or use a proxy (--proxy). Examples: Reddit, Patreon, Figma.js_challenge — the site serves a JavaScript challenge page (e.g. "Just a moment...") that cannot be solved without a browser. Maigret detects challenge signatures and returns UNKNOWN instead of a false positive.Example:
.. code-block:: json
"Instagram": {
"url": "https://www.instagram.com/{username}/",
"checkType": "message",
"presenseStrs": ["\"routePath\":\"\\/"],
"absenceStrs": ["\"routePath\":null"],
"protection": ["tls_fingerprint"]
}
urlProbe (optional profile probe URL)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
By default Maigret performs the HTTP request to the same URL as url (the public profile link pattern).
If you set urlProbe in data.json, Maigret fetches that URL for the presence check (API, GraphQL, JSON endpoint, etc.), while reports and url_user still use url — the human-readable profile page users should open.
Placeholders: {username}, {urlMain}, {urlSubpath} (same as for url). Example: GitHub uses url https://github.com/{username} and urlProbe https://api.github.com/users/{username}; Picsart uses the web profile https://picsart.com/u/{username} and probes https://api.picsart.com/users/show/{username}.json.
Implementation: make_site_result in checking.py <https://github.com/soxoj/maigret/blob/main/maigret/checking.py>_.
.. note::
The LLM/ directory at the root of the repository contains detailed instructions for editing site checks (in Markdown format): checklist, full guide to checkType / data.json / urlProbe, handling false positives, searching for public JSON APIs, and the proposal log for socid_extractor.
Main files:
site-checks-playbook.md <https://github.com/soxoj/maigret/blob/main/LLM/site-checks-playbook.md>_ — short checklistsite-checks-guide.md <https://github.com/soxoj/maigret/blob/main/LLM/site-checks-guide.md>_ — detailed guidesocid_extractor_improvements.log <https://github.com/soxoj/maigret/blob/main/LLM/socid_extractor_improvements.log>_ — template and entries for identity extractor improvementsThese files should be kept up-to-date whenever changes are made to the check logic in the code or in data.json.
.. _activation-mechanism:
The activation mechanism helps make requests to sites requiring additional authentication like cookies, JWT tokens, or custom headers.
Let's study the Vimeo site check record from the Maigret database:
.. code-block:: json
"Vimeo": {
"tags": [
"us",
"video"
],
"headers": {
"Authorization": "jwt eyJ0..."
},
"activation": {
"url": "https://vimeo.com/_rv/viewer",
"marks": [
"Something strange occurred. Please get in touch with the app's creator."
],
"method": "vimeo"
},
"urlProbe": "https://api.vimeo.com/users/{username}?fields=name...",
"checkType": "status_code",
"alexaRank": 148,
"urlMain": "https://vimeo.com/",
"url": "https://vimeo.com/{username}",
"usernameClaimed": "blue",
"usernameUnclaimed": "noonewouldeverusethis7"
},
The activation method is:
.. code-block:: python
def vimeo(site, logger, cookies={}):
headers = dict(site.headers)
if "Authorization" in headers:
del headers["Authorization"]
import requests
r = requests.get(site.activation["url"], headers=headers)
jwt_token = r.json()["jwt"]
site.headers["Authorization"] = "jwt " + jwt_token
Here's how the activation process works when a JWT token becomes invalid:
urlProbe with the invalid tokenactivation/marks fieldvimeo activation function is triggeredExamples of activation mechanism implementation are available in activation.py <https://github.com/soxoj/maigret/blob/main/maigret/activation.py>_ file.
Collaborats rights are requires, write Soxoj to get them.
For new version publishing you must create a new branch in repository with a bumped version number and actual changelog first. After it you must create a release, and GitHub action automatically create a new PyPi package.
.. code-block:: console
git checkout -b 0.4.0
CHANGELOG.md with a current date:.. code-block:: console
Choose a tag, enter v0.4.0 (your version)Create new tag+ Auto-generate release notesCHANGELOG.txt## What's Changed and ## New Contributors section if it exists.. code-block:: console
git add -p git commit -m 'Bump to YOUR VERSION' git push origin head
Merge pull request
Create new release
Choose a tagv0.4.0Release titleCreate new tag+ Auto-generate release notesDocumentations is auto-generated and auto-deployed from the docs directory.
To manually update documentation:
.rst files in the docs/source directory.pip install -r requirements.txt in the docs directory.make singlehtml in the terminal in the docs directory.build/singlehtml/index.html in your browser to see the result... warning:: This roadmap requires updating to reflect the current project status and future plans.
.. figure:: https://i.imgur.com/kk8cFdR.png
:target: https://i.imgur.com/kk8cFdR.png
:align: center