Skip to content
USE CASE

Build a Claude-Powered PR Code Review Bot in 2026

An automated PR review bot using Claude Sonnet 4.5 — diff parsing, severity tagging, comment posting, and CI integration. With code and Claudexia setup.

Pull request reviews are the bottleneck of modern engineering. Senior engineers spend hours skimming diffs that are 90% boilerplate, miss the 10% that actually matters, and ship subtle bugs anyway. In 2026, there's no good reason to keep doing this by hand. Claude Sonnet 4.5 is the strongest publicly available coding model, and pairing it with the GitHub API gives you an in-house reviewer that runs on every PR, costs cents, and never gets tired.

This guide walks through building one — not a toy, a real bot you can ship to production today. We'll use Claude Sonnet 4.5 via Claudexia (https://api.claudexia.tech/v1), which gives you the same Anthropic API surface with cheaper rates and no rate-limit drama.

Why Claude for Code Review

Three reasons, in order of importance:

  1. Coding ability. Sonnet 4.5 sits at the top of SWE-bench Verified and reads diffs the way a senior reviewer does — it tracks types across files, notices missing error handling, and catches off-by-ones that linters miss. GPT-5 and Gemini 3 are close, but for review (which rewards conservatism and explanation), Claude wins on signal-to-noise.
  2. Structured output reliability. Claude follows JSON schemas under tool_use with near-100% compliance. That matters because your bot's output has to parse cleanly every time — one malformed JSON in a thousand runs and you're paging on-call.
  3. Long context + prompt caching. 200K tokens lets you stuff the whole repo's style guide, architectural notes, and recent commits into the system prompt. Prompt caching makes that effectively free after the first call. See our Claude API pricing breakdown for cache economics.

For comparison vs hosted services like CodeRabbit and Greptile: those are fine, but you're handing them your source code, paying $30–$100 per dev/month, and accepting whatever prompt they wrote. A self-hosted bot costs $0.05–$0.30 per PR, runs on your own prompts, and your code never leaves your CI.

Architecture

The full pipeline is four boxes:

GitHub PR opened/updated
   ↓
GitHub Action triggers
   ↓
Fetch diff via GitHub API
   ↓
Send to Claude Sonnet 4.5 (via Claudexia)
   ↓
Parse JSON response
   ↓
Post review comments via GitHub API

That's it. No vector DB, no embeddings, no agent loop. The diff is small (usually <50K tokens), Claude reads it once, and emits structured comments. Anything more complex is over-engineering until you have data telling you otherwise.

The Prompt

The prompt is 80% of the work. Here's the template that's worked for us across Go, TypeScript, and Python codebases:

You are a senior staff engineer reviewing a pull request. Your job is to
catch bugs, security issues, and design mistakes — NOT to enforce style
(linters do that) and NOT to praise good code.

SCOPE:
- Only comment on lines changed in this diff.
- Do not comment on unchanged context lines.
- Do not request changes you cannot justify with a concrete failure mode.

SEVERITY:
- blocker: will cause production incident, data loss, security breach, or
  obvious correctness bug. Must be fixed before merge.
- warning: real issue but not blocking. Race conditions under unlikely
  load, missing error paths, unclear naming that will bite later.
- nit: minor suggestion. Use sparingly — at most 2 per PR.

OUTPUT:
Call the `post_review` tool with a JSON list of comments. Each comment
must have:
- file: path relative to repo root
- line: line number in the new file (after the change)
- severity: "blocker" | "warning" | "nit"
- comment: 1–3 sentences. Start with the problem, then the fix.
- justification: why this matters. If you can't write this, drop the comment.

If the PR looks good, return an empty list. Do not invent issues.

REPO CONTEXT:
{cached system block: style guide, architecture notes, recent decisions}

DIFF:
{the actual diff}

Two non-obvious things make this work:

The justification field is a forcing function against false positives. Claude (and every other model) will hallucinate concerns if you let it. Requiring a written justification cuts noise by ~60% in our internal eval — the model self-censors weak comments before emitting them.

tool_use instead of "respond with JSON". Defining post_review as a tool with a strict input schema means the API itself rejects malformed output. You never have to write a JSON repair function.

Tool Definition

const postReviewTool = {
  name: "post_review",
  description: "Post review comments on the PR. Pass an empty array if no issues found.",
  input_schema: {
    type: "object",
    properties: {
      comments: {
        type: "array",
        items: {
          type: "object",
          properties: {
            file: { type: "string" },
            line: { type: "integer" },
            severity: {
              type: "string",
              enum: ["blocker", "warning", "nit"]
            },
            comment: { type: "string" },
            justification: { type: "string" }
          },
          required: ["file", "line", "severity", "comment", "justification"]
        }
      }
    },
    required: ["comments"]
  }
};

Set tool_choice: { type: "tool", name: "post_review" } to force the model to call it. No prose responses, no leakage.

The Bot (TypeScript)

import Anthropic from "@anthropic-ai/sdk";
import { Octokit } from "@octokit/rest";

const claude = new Anthropic({
  apiKey: process.env.CLAUDEXIA_API_KEY,
  baseURL: "https://api.claudexia.tech/v1",
});

const gh = new Octokit({ auth: process.env.GITHUB_TOKEN });

async function reviewPR(owner: string, repo: string, pull_number: number) {
  // 1. Fetch the diff
  const { data: files } = await gh.pulls.listFiles({
    owner, repo, pull_number, per_page: 100,
  });

  // 2. Chunk by file — parallelize for large PRs
  const reviews = await Promise.all(
    files
      .filter(f => f.patch && f.status !== "removed")
      .map(file => reviewFile(file))
  );

  // 3. Flatten + dedupe by (file, line, severity)
  const all = reviews.flat();
  const seen = new Set<string>();
  const unique = all.filter(c => {
    const k = `${c.file}:${c.line}:${c.severity}`;
    if (seen.has(k)) return false;
    seen.add(k);
    return true;
  });

  // 4. Post as a single review
  await gh.pulls.createReview({
    owner, repo, pull_number,
    event: unique.some(c => c.severity === "blocker")
      ? "REQUEST_CHANGES"
      : "COMMENT",
    comments: unique.map(c => ({
      path: c.file,
      line: c.line,
      body: `**[${c.severity}]** ${c.comment}\n\n_${c.justification}_`,
    })),
  });
}

async function reviewFile(file: any) {
  const response = await claude.messages.create({
    model: "claude-sonnet-4.5",
    max_tokens: 4096,
    system: [
      {
        type: "text",
        text: REPO_CONTEXT,           // style guide, arch notes
        cache_control: { type: "ephemeral" },
      },
      {
        type: "text",
        text: REVIEW_INSTRUCTIONS,    // the prompt above
      },
    ],
    tools: [postReviewTool],
    tool_choice: { type: "tool", name: "post_review" },
    messages: [
      {
        role: "user",
        content: `File: ${file.filename}\n\nDiff:\n\`\`\`diff\n${file.patch}\n\`\`\``,
      },
    ],
  });

  const toolUse = response.content.find(b => b.type === "tool_use");
  return toolUse ? (toolUse.input as any).comments : [];
}

Note the cache_control on the repo context block. That's prompt caching — the first PR pays full price for those tokens, every subsequent PR within 5 minutes pays 10%. On a busy repo this drops cost by 70%+.

Large Diff Handling

A 5,000-line PR will blow your token budget if you send it as one blob. The chunking strategy above (one Claude call per file) handles this naturally:

  • Per-file calls run in parallel — wall time stays at ~3–5 seconds even for 30-file PRs.
  • Each call gets focused context — Claude reviews auth.ts without package-lock.json in the way.
  • Dedupe across calls — sometimes Claude flags the same pattern in two files; the (file, line, severity) key strips duplicates.

For genuinely massive PRs (>100 files), add a pre-filter that skips lockfiles, generated code, and vendor/. Most "huge PRs" shrink to 10 reviewable files once you do.

GitHub Actions Setup

name: Claude Review
on:
  pull_request:
    types: [opened, synchronize]

jobs:
  review:
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
      contents: read
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: "20"
      - run: npm ci
      - run: npx tsx scripts/review.ts
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          CLAUDEXIA_API_KEY: ${{ secrets.CLAUDEXIA_API_KEY }}
          PR_NUMBER: ${{ github.event.pull_request.number }}
          REPO_OWNER: ${{ github.repository_owner }}
          REPO_NAME: ${{ github.event.repository.name }}

Drop your Claudexia key into repo secrets, commit the YAML, and you're live.

Cost Per PR

Real numbers from our own deployment over 200 PRs:

PR sizeTokens inTokens outCost (cached)
Small (<10 files)~8K~500$0.04
Medium (10–30)~25K~1.5K$0.12
Large (30–80)~70K~3K$0.28

That's with Claude Sonnet 4.5 pricing on Claudexia and prompt caching active. A 50-engineer team doing 100 PRs/day spends roughly $300/month — less than one CodeRabbit Pro seat.

Self-Hosted vs SaaS Reviewers

DimensionYour bot (Claude + Claudexia)CodeRabbit / Greptile
Cost$0.05–$0.30/PR$30–$100/dev/month
Prompt controlFullNone
Data residencyYour CITheir servers
CustomizationAnything you can promptTheir feature set
Setup timeOne dayOne hour
MaintenanceYou own itThey own it

The trade is real: SaaS is faster to set up and you don't maintain it. Self-hosted gives you control, cheaper unit cost at scale, and the ability to encode your team's actual taste.

Bottom Line

You can ship this in a day. The pieces — Claude Sonnet 4.5 via Claudexia, GitHub Actions, the Octokit SDK — are all stable, well-documented, and cheap. The only hard part is the prompt, and the template above is 90% of what you need.

Start simple: one prompt, one file at a time, blocker/warning/nit. Run it on your last 20 merged PRs and read the output. You'll iterate the prompt twice and have something better than most $50/month services by the end of the week.