sql/validation/query_github_profiles.sql ·
GitHub Archive · 2021-01 – 2026-03
Scans every GitHub event from January 2021 through March 2026, keeps only events touching AI-relevant repositories matched by repo name via regex, then aggregates per user per calendar month into a weighted score.
load_scores.py.
| Column | Type | Description |
|---|---|---|
username | STRING | GitHub actor login |
period | DATE | First day of the calendar month |
score | FLOAT | Sum of event weight × repo multiplier |
total_events | INT | Raw event count for that user-month |
unique_repos | INT | Distinct repos touched |
event_variety | INT | Distinct event types used |
efficiency | FLOAT | Average weighted score per event (score / total_events) |
| Event type | Weight | Rationale |
|---|---|---|
ReleaseEvent | 30 | Shipping software publicly |
PullRequestReviewEvent | 25 | High-effort code review |
DeploymentEvent | 25 | Production deployment |
PullRequestEvent | 20 | Substantive contribution |
CreateEvent | 15 | Branch / tag / repo creation |
CheckRunEvent | 15 | CI automation |
PushEvent | 10 | Code commit |
WorkflowRunEvent | 10 | Workflow execution |
MemberEvent | 10 | Team management |
IssuesEvent | 5 | Issue opened/closed |
IssueCommentEvent | 4 | Discussion participation |
ForkEvent | 2 | Passive interest |
WatchEvent | 1 | Star / watch |
| all others | 1 | Baseline |
| Tier | Multiplier | Pattern examples |
|---|---|---|
| ELITE L5-L6 | 5.0× | grpo, vllm, sglang, triton-lang, sae-lens, unsloth, deepseek-r1, transformer-lens, cuda-mode |
| AGENTIC L4 | 3.5× | langgraph, pydantic-ai, mcp-server, model-context-protocol, fastmcp, crewai, dspy, dify, langflow |
| BUILDER L3 | 2.0× | vercel-ai, ai-sdk, langchain, llamaindex, huggingface, trl, axolotl, llama-factory, ragas, deepeval |
| VIBE/USER L1-L2 | 1.5× | cursor, aider, bolt-new, ollama, open-webui, lm-studio, lmstudio |
| BASELINE | 1.0× | everything else that passes the WHERE filter |
Any event whose repo.name matches at least one of these patterns is included. Two patterns are isolated for correct dot-escaping (llama\.cpp, llm\.c).
-- GitHub AI Talent - BigQuery Aggregated Monthly Scores
-- Source: githubarchive.month.* | Range: 202101–202603
SELECT
actor.login AS username,
DATE_TRUNC(DATE(created_at), MONTH) AS period,
-- score = SUM(event_weight × repo_multiplier)
SUM(
CASE
WHEN type = 'ReleaseEvent' THEN 30
WHEN type = 'PullRequestReviewEvent' THEN 25
WHEN type = 'DeploymentEvent' THEN 25
WHEN type = 'PullRequestEvent' THEN 20
WHEN type = 'CreateEvent' THEN 15
WHEN type = 'PushEvent' THEN 10
WHEN type = 'IssuesEvent' THEN 5
WHEN type = 'IssueCommentEvent' THEN 4
WHEN type = 'ForkEvent' THEN 2
WHEN type = 'WatchEvent' THEN 1
ELSE 1
END
*
CASE
WHEN REGEXP_CONTAINS(LOWER(repo.name), r'grpo|vllm|triton-lang|...') THEN 5.0 -- ELITE
WHEN REGEXP_CONTAINS(LOWER(repo.name), r'langgraph|crewai|...') THEN 3.5 -- AGENTIC
WHEN REGEXP_CONTAINS(LOWER(repo.name), r'langchain|llamaindex|...') THEN 2.0 -- BUILDER
WHEN REGEXP_CONTAINS(LOWER(repo.name), r'cursor|aider|ollama|...') THEN 1.5 -- VIBE/USER
ELSE 1.0
END
) AS score,
COUNT(*) AS total_events,
COUNT(DISTINCT repo.name) AS unique_repos,
COUNT(DISTINCT type) AS event_variety,
ROUND(score / total_events, 2) AS efficiency
FROM `githubarchive.month.*`
WHERE _TABLE_SUFFIX BETWEEN '202101' AND '202603'
AND (
REGEXP_CONTAINS(LOWER(repo.name), r'llama\.cpp')
OR REGEXP_CONTAINS(LOWER(repo.name), r'llm\.c')
OR REGEXP_CONTAINS(LOWER(repo.name), /* 12-group pattern */)
)
GROUP BY actor.login, DATE_TRUNC(DATE(created_at), MONTH)
ORDER BY username, period;