Back to Hermes Agent

GitHub Archive Query Guide (BigQuery)

optional-skills/security/oss-forensics/references/github-archive-guide.md

2026.6.55.9 KB
Original Source

GitHub Archive Query Guide (BigQuery)

GitHub Archive records every public event on GitHub as immutable JSON records. This data is accessible via Google BigQuery and is the most reliable source for forensic investigation — events cannot be deleted or modified after recording.

Public Dataset

  • Project: githubarchive
  • Tables: day.YYYYMMDD, month.YYYYMM, year.YYYY
  • Cost: $6.25 per TiB scanned. Always run dry runs first.
  • Access: Requires a Google Cloud account with BigQuery enabled. Free tier includes 1 TiB/month of queries.

The 12 GitHub Event Types

Event TypeWhat It RecordsForensic Value
PushEventCommits pushed to a branchForce-push detection, commit timeline, author attribution
PullRequestEventPR opened, closed, merged, reopenedDeleted PR recovery, review timeline
IssuesEventIssue opened, closed, reopened, labeledDeleted issue recovery, social engineering traces
IssueCommentEventComments on issues and PRsDeleted comment recovery, communication patterns
CreateEventBranch, tag, or repository creationSuspicious branch creation, tag timing
DeleteEventBranch or tag deletionEvidence of cleanup after compromise
MemberEventCollaborator added or removedPermission changes, access escalation
PublicEventRepository made publicAccidental exposure of private repos
WatchEventUser stars a repositoryActor reconnaissance patterns
ForkEventRepository forkedExfiltration of code before cleanup
ReleaseEventRelease published, edited, deletedMalicious release injection, deleted release recovery
WorkflowRunEventGitHub Actions workflow triggeredCI/CD abuse, unauthorized workflow runs

Query Templates

Basic: All Events for a Repository

sql
SELECT
  created_at,
  type,
  actor.login,
  repo.name,
  payload
FROM
  `githubarchive.day.20240101`  -- Adjust date
WHERE
  repo.name = 'owner/repo'
  AND type IN ('PushEvent', 'DeleteEvent', 'MemberEvent')
ORDER BY
  created_at ASC

Force-Push Detection

Force-pushes produce PushEvents where commits are overwritten. Key indicators:

  • payload.distinct_size = 0 with payload.size > 0 → commits were erased
  • payload.before contains the SHA before the rewrite (recoverable)
sql
SELECT
  created_at,
  actor.login,
  JSON_EXTRACT_SCALAR(payload, '$.before') AS before_sha,
  JSON_EXTRACT_SCALAR(payload, '$.head') AS after_sha,
  JSON_EXTRACT_SCALAR(payload, '$.size') AS total_commits,
  JSON_EXTRACT_SCALAR(payload, '$.distinct_size') AS distinct_commits,
  JSON_EXTRACT_SCALAR(payload, '$.ref') AS branch_ref
FROM
  `githubarchive.month.*`
WHERE
  _TABLE_SUFFIX BETWEEN '202401' AND '202403'
  AND type = 'PushEvent'
  AND repo.name = 'owner/repo'
  AND CAST(JSON_EXTRACT_SCALAR(payload, '$.distinct_size') AS INT64) = 0
ORDER BY
  created_at ASC

Deleted Branch/Tag Detection

sql
SELECT
  created_at,
  actor.login,
  JSON_EXTRACT_SCALAR(payload, '$.ref') AS deleted_ref,
  JSON_EXTRACT_SCALAR(payload, '$.ref_type') AS ref_type
FROM
  `githubarchive.month.*`
WHERE
  _TABLE_SUFFIX BETWEEN '202401' AND '202403'
  AND type = 'DeleteEvent'
  AND repo.name = 'owner/repo'
ORDER BY
  created_at ASC

Collaborator Permission Changes

sql
SELECT
  created_at,
  actor.login,
  JSON_EXTRACT_SCALAR(payload, '$.action') AS action,
  JSON_EXTRACT_SCALAR(payload, '$.member.login') AS member
FROM
  `githubarchive.month.*`
WHERE
  _TABLE_SUFFIX BETWEEN '202401' AND '202403'
  AND type = 'MemberEvent'
  AND repo.name = 'owner/repo'
ORDER BY
  created_at ASC

CI/CD Workflow Activity

sql
SELECT
  created_at,
  actor.login,
  JSON_EXTRACT_SCALAR(payload, '$.action') AS action,
  JSON_EXTRACT_SCALAR(payload, '$.workflow_run.name') AS workflow_name,
  JSON_EXTRACT_SCALAR(payload, '$.workflow_run.conclusion') AS conclusion,
  JSON_EXTRACT_SCALAR(payload, '$.workflow_run.head_sha') AS head_sha
FROM
  `githubarchive.month.*`
WHERE
  _TABLE_SUFFIX BETWEEN '202401' AND '202403'
  AND type = 'WorkflowRunEvent'
  AND repo.name = 'owner/repo'
ORDER BY
  created_at ASC

Actor Activity Profiling

sql
SELECT
  type,
  COUNT(*) AS event_count,
  MIN(created_at) AS first_event,
  MAX(created_at) AS last_event
FROM
  `githubarchive.month.*`
WHERE
  _TABLE_SUFFIX BETWEEN '202301' AND '202412'
  AND actor.login = 'suspicious-username'
GROUP BY type
ORDER BY event_count DESC

Cost Optimization (MANDATORY)

  1. Always dry run first: Add --dry_run flag to bq query to see estimated bytes scanned before executing.
  2. Use _TABLE_SUFFIX: Narrow the date range as much as possible. day.* tables are cheapest for narrow windows; month.* for broader sweeps.
  3. Select only needed columns: Avoid SELECT *. The payload column is large — only select specific JSON paths.
  4. Add LIMIT: Use LIMIT 1000 during exploration. Remove only for final exhaustive queries.
  5. Column filtering in WHERE: Filter on indexed columns (type, repo.name, actor.login) before payload extraction.

Cost estimation: A single month of GH Archive data is ~1-2 TiB uncompressed. Querying a specific repo + event type with _TABLE_SUFFIX typically scans 1-10 GiB ($0.006-$0.06).


Accessing via Hermes

Option A: BigQuery CLI (if gcloud is installed)

bash
bq query --use_legacy_sql=false --format=json "YOUR QUERY"

Option B: Python (via execute_code)

python
from google.cloud import bigquery
client = bigquery.Client()
query = "YOUR QUERY"
results = client.query(query).result()
for row in results:
    print(dict(row))

Option C: No GCP credentials available If BigQuery is unavailable, document this limitation in the report. Use the other 4 investigators (Git, GitHub API, Wayback Machine, IOC Enrichment) — they cover most investigation needs without BigQuery.