optional-skills/security/oss-forensics/references/github-archive-guide.md
GitHub Archive records every public event on GitHub as immutable JSON records. This data is accessible via Google BigQuery and is the most reliable source for forensic investigation — events cannot be deleted or modified after recording.
githubarchiveday.YYYYMMDD, month.YYYYMM, year.YYYY| Event Type | What It Records | Forensic Value |
|---|---|---|
PushEvent | Commits pushed to a branch | Force-push detection, commit timeline, author attribution |
PullRequestEvent | PR opened, closed, merged, reopened | Deleted PR recovery, review timeline |
IssuesEvent | Issue opened, closed, reopened, labeled | Deleted issue recovery, social engineering traces |
IssueCommentEvent | Comments on issues and PRs | Deleted comment recovery, communication patterns |
CreateEvent | Branch, tag, or repository creation | Suspicious branch creation, tag timing |
DeleteEvent | Branch or tag deletion | Evidence of cleanup after compromise |
MemberEvent | Collaborator added or removed | Permission changes, access escalation |
PublicEvent | Repository made public | Accidental exposure of private repos |
WatchEvent | User stars a repository | Actor reconnaissance patterns |
ForkEvent | Repository forked | Exfiltration of code before cleanup |
ReleaseEvent | Release published, edited, deleted | Malicious release injection, deleted release recovery |
WorkflowRunEvent | GitHub Actions workflow triggered | CI/CD abuse, unauthorized workflow runs |
SELECT
created_at,
type,
actor.login,
repo.name,
payload
FROM
`githubarchive.day.20240101` -- Adjust date
WHERE
repo.name = 'owner/repo'
AND type IN ('PushEvent', 'DeleteEvent', 'MemberEvent')
ORDER BY
created_at ASC
Force-pushes produce PushEvents where commits are overwritten. Key indicators:
payload.distinct_size = 0 with payload.size > 0 → commits were erasedpayload.before contains the SHA before the rewrite (recoverable)SELECT
created_at,
actor.login,
JSON_EXTRACT_SCALAR(payload, '$.before') AS before_sha,
JSON_EXTRACT_SCALAR(payload, '$.head') AS after_sha,
JSON_EXTRACT_SCALAR(payload, '$.size') AS total_commits,
JSON_EXTRACT_SCALAR(payload, '$.distinct_size') AS distinct_commits,
JSON_EXTRACT_SCALAR(payload, '$.ref') AS branch_ref
FROM
`githubarchive.month.*`
WHERE
_TABLE_SUFFIX BETWEEN '202401' AND '202403'
AND type = 'PushEvent'
AND repo.name = 'owner/repo'
AND CAST(JSON_EXTRACT_SCALAR(payload, '$.distinct_size') AS INT64) = 0
ORDER BY
created_at ASC
SELECT
created_at,
actor.login,
JSON_EXTRACT_SCALAR(payload, '$.ref') AS deleted_ref,
JSON_EXTRACT_SCALAR(payload, '$.ref_type') AS ref_type
FROM
`githubarchive.month.*`
WHERE
_TABLE_SUFFIX BETWEEN '202401' AND '202403'
AND type = 'DeleteEvent'
AND repo.name = 'owner/repo'
ORDER BY
created_at ASC
SELECT
created_at,
actor.login,
JSON_EXTRACT_SCALAR(payload, '$.action') AS action,
JSON_EXTRACT_SCALAR(payload, '$.member.login') AS member
FROM
`githubarchive.month.*`
WHERE
_TABLE_SUFFIX BETWEEN '202401' AND '202403'
AND type = 'MemberEvent'
AND repo.name = 'owner/repo'
ORDER BY
created_at ASC
SELECT
created_at,
actor.login,
JSON_EXTRACT_SCALAR(payload, '$.action') AS action,
JSON_EXTRACT_SCALAR(payload, '$.workflow_run.name') AS workflow_name,
JSON_EXTRACT_SCALAR(payload, '$.workflow_run.conclusion') AS conclusion,
JSON_EXTRACT_SCALAR(payload, '$.workflow_run.head_sha') AS head_sha
FROM
`githubarchive.month.*`
WHERE
_TABLE_SUFFIX BETWEEN '202401' AND '202403'
AND type = 'WorkflowRunEvent'
AND repo.name = 'owner/repo'
ORDER BY
created_at ASC
SELECT
type,
COUNT(*) AS event_count,
MIN(created_at) AS first_event,
MAX(created_at) AS last_event
FROM
`githubarchive.month.*`
WHERE
_TABLE_SUFFIX BETWEEN '202301' AND '202412'
AND actor.login = 'suspicious-username'
GROUP BY type
ORDER BY event_count DESC
--dry_run flag to bq query to see estimated bytes scanned before executing._TABLE_SUFFIX: Narrow the date range as much as possible. day.* tables are cheapest for narrow windows; month.* for broader sweeps.SELECT *. The payload column is large — only select specific JSON paths.LIMIT 1000 during exploration. Remove only for final exhaustive queries.type, repo.name, actor.login) before payload extraction.Cost estimation: A single month of GH Archive data is ~1-2 TiB uncompressed. Querying a specific repo + event type with _TABLE_SUFFIX typically scans 1-10 GiB ($0.006-$0.06).
Option A: BigQuery CLI (if gcloud is installed)
bq query --use_legacy_sql=false --format=json "YOUR QUERY"
Option B: Python (via execute_code)
from google.cloud import bigquery
client = bigquery.Client()
query = "YOUR QUERY"
results = client.query(query).result()
for row in results:
print(dict(row))
Option C: No GCP credentials available If BigQuery is unavailable, document this limitation in the report. Use the other 4 investigators (Git, GitHub API, Wayback Machine, IOC Enrichment) — they cover most investigation needs without BigQuery.