· Launch mock · for FAR.AI review

Rigorous, understandable, and viral in the same artifact.

That's the difference between a durable public reference and a flash in the pan. The scope below is built around it.

01Understandable

Land beyond the AI-safety bubble.

Audiences first. The launch only matters if it reaches them. Reporters who don't cover AI safety, policymakers without a technical background, the general public arriving from a single embed inside a Times article.

Failure mode · weak public uptake
The story stays inside AI-safety circles.
3 mitigations: visceral hero frame, Tier-1 exclusive, op-ed second wave

If the leaderboard reads as a research paper instead of a news event, only the AI-safety beat picks it up. General-audience reach disappears, and the regulatory pressure mechanism never fires.

  • Visceral hero-ranking frame: defense-in-depth made visual at first glance, not a leaderboard of percentages.
  • Tier-1 print exclusive 48 hours before launch (WIRED / MIT Tech Review / Axios / The Information as floor; NYT / Atlantic / Bloomberg as realistic ceiling given CBRNE/cyber + frontier-lab-naming).
  • Adam G. op-ed coordination through existing T11 scope, timed to the second wave.
#
Audience
Primary surfaces
What's tailored
A
Tier-1 print reportersNYT · WSJ · WIRED · MIT TR · Atlantic
60-second pull · story angles · copy-ready stats · embed widgets · plain-English embargo · fact-check email
Deadline density. Ready-to-paste. Visuals as PNG + iframe.
B
General-audience mediaCNN · NPR · podcasts · columnists
Child-safety hook · defense-in-depth metaphor (locks, layers) · what-if sandbox · Adam G. op-ed · hero video
Zero jargon. Story-first. Locks-vs-one-lock metaphor lands without methodology context.
C
AI safety researchersAcademic · peer reviewer mindset
Methodology page · methodology FAQ · limitations card · source-span citations · full-report PDF · prompts under credentialed access
Rigor-first. Operational definitions. Pre-empted peer-review questions.
D
Policymakers / regulatorsFederal staffers · AI safety · child-safety
Policymaker briefing CTA · CBRNE framing · child-safety framing · responsible-disclosure protocol
Regulatory-implication framing. Briefing-deck format. Staff-level technical depth.
E
Frontier lab safety/commsAnthropic · OpenAI · xAI · Google DeepMind
14-day disclosure window · full technical report under embargo · remediation recommendations · engagement-state badge
Actionable. Opportunity-to-fix-before-public framing, not gotcha.
F
FAR.AI internal commsStaff micro-audience · most launches forget this
Internal staff briefing kit · talking-points doc · launch-day Slack pre-brief · share-friendly social cards for personal accounts
Org-pride angle. Pre-cleared answers to anticipated personal-network questions. Personal social cards distinct from official ones.
G
FAR.AI funders / donorsMission proof-point
Thesis card · post-launch impact view (citations, embeds, policymaker engagement) · durable-reference framing
ROI-on-mission narrative. Longer-term durability evidence.
H
General publicSocial shares · embeds in articles · search
Hero (defense-in-depth at first glance) · what-if sandbox · simplified explainer · social cards
Zero pre-existing knowledge required. Methodology link surfaced if curious, not pushed.
Same findings for every audience. Methodology, four labs ranked, seven risk domains, three defense layers, 14-day disclosure protocol, funding disclosure, spokespeople. All identical across the board. What changes per audience: framing, depth, and entry point. One set of findings. Eight ways in.
Message testing in action

We don't guess what lands. We test it before launch and audit what actually moved attention after. Two examples from our work for Northwell Health:

02Viral

One source. Many forms. A loop, not a launch.

The artifact has to travel and produce signal — citations, embeds, lab patches, policymaker briefings — the kind of feedback that comes back and shapes the next cycle.

Failure mode · results too flat to drive attention
Frontier labs cluster, the news angle disappears.
3 mitigations: per-attack-class breakdowns, defense-in-depth framing, divergence fallback

If the headline number suggests "all models look about the same," reporters won't bite. The structural finding has to survive the possibility that aggregate scores are close.

  • Per-attack-class breakdowns surface divergence even when averages are similar.
  • Defense-in-depth framing: Anthropic deploys layered defense, Google deploys none. That gap doesn't show up in any single overall score.
  • Frontier-divergence fallback ready if Gemini outperforms expectations, plus model-vs-model variance ready as the lede.
1
source
Technical summary
Defense-in-depth findings Robustness data Co-authored methodology
1
engine
Claim library + production stack
Source-span grounding Multi-model orchestration Reviewer surface (SSO) Custom evals before ship
10+
artifacts
Launch outputs
Leaderboard site Methodology page Press kit + spokespeople Hero video + infographics Embed widgets + social cards Op-ed + podcast talking points
See the engine on a single research paper Paper example →

Same engine adapts to other source types too — video segments (interview cuts → editorial), podcast appearances (transcript → talking points + clip kits), research datasets (data → embedded explainer + press visuals), multi-language (single source → localized variants). Beyond the leaderboard launch but available when FAR.AI's catalog grows that direction.

The leaderboard, end to end.

Open the full mock below, then explore the surfaces inside. Each one is a concrete output of the engine.

Open the leaderboardLaunch →

A concrete mock of the AI Safety Leaderboard.

Built off the FAR.AI farai/ template: defense-in-depth framing across four frontier models × seven risk domains × three defensive layers. The full launch surface, not a landing page.

farai.makeitresonate.com/leaderboard

Each cycle compounds. The engine gets sharper, not just newer.

Most launches ship and walk away. Ours is built so the signal that comes back — what got cited, what got embedded, what fell flat — tunes the engine for the next batch. Variants we generate by default; experiments we bake in; underperformers we retire fast.

01 · Source
Research findings
FAR.AI technical work Methodology Robustness data
02 · Engine
Variants generated by default
Multi-format production A/B-ready hooks Source-span grounded Reviewer in the loop
03 · Distribution
Right artifact, right door
Tier-1 embargo Press kit + embeds Audience-tailored ship
04 · Signal → tune
Measurement, then iterate
Citations + embeds + briefings Per-variant performance Retire underperformers Double on what worked
Signal flows back to the engine. Next batch ships sharper hooks, tuned voice, retired losers. Cost per cited artifact drops; hit rate rises. Signal also flows back to FAR.AI — next research is calibrated by what landed.

A durable story arc, not a one-day hit.

The earned-media plan is built around four waves. Each new model release re-enters the cycle. The leaderboard becomes the reference, not the launch.

Wave 1 · Pre-launch
T −30 → T −1
  • Tier 1 print exclusive locked under embargo
  • T −14: formal lab disclosure with 14-day factual-review window
  • Government briefings, influencer mapping, asset finalization
Wave 2 · Launch day
Tuesday, July 14 · 9:00 AM ET
  • Embargo lifts. Print exclusive breaks.
  • Tier 2 simultaneous pitch (~25 named outlets)
  • Owned channels go live. Spokespeople on standby.
Wave 3 · Second wave
T +1 → T +30
  • Adam G. op-ed placed (post-coverage cycle)
  • Podcast circuit + influencer engagement
  • Vertical follow-ups: cyber, science, policy, international
Wave 4 · Living rhythm
T +30 → ongoing
  • Each new frontier-model release = new test cycle = new media hook
  • Quarterly state-of-defense-in-depth update
  • Embed and citation tracking compounds the reference
A living leaderboard

“The results reflect a point in time. When a model is released, we test it and capture how it performs at that moment. As new models come out, we test those as well. Over time, that creates a record of how systems evolve and whether safety is actually improving.”

Ed Yee · Head of Strategic Projects, FAR.AI

Not every finding is a launch. The engine has a fast-response track too.

The leaderboard’s waves cover the planned launch and its quarterly updates. Between cycles, FAR.AI’s researchers will keep finding things — new jailbreaks, new exploits, new red-team results. Adam Penenberg’s framing: those shouldn’t go silent or wait for the next big drop. A simple, repeatable flow turns each meaningful finding into a potential news event.

01
Finding identified
Short internal summary: what it is, why it matters, who it affects.
02
Company outreach
Optional but documented. Track response or non-response.
03
Rapid comms draft
3–5 sentence plain-English summary, one takeaway, one quote (Adam G. or relevant researcher).
04
Internal handoff
Research → FAR.AI comms → Thunder11 (PR) + Newsroom Studio (Adam P., long-form). Roles defined upfront, not negotiated each time.
05
Distribution
Targeted reporters and newsletter writers. Framed as a discrete “finding” or “incident,” not a full report.
The point isn’t urgency for its own sake. Many findings won’t have a breaking-news clock. The shift is treating each meaningful finding as a potential news event, with a lightweight but defined path from discovery to disclosure. Most won’t go public — but when something should go public, the system is ready instead of scrambling.
Framing · Adam Penenberg
03Rigorous

Survive scrutiny on day one.

If the work doesn't hold up to a frontier-lab pushback, the conversation pivots from what the findings show to whether FAR.AI is fair. That story doesn't recover.

Failure mode · methodology disputes dominate the story
Labs push back the moment the embargo lifts.
3 mitigations: co-authoring, source-span grounding, friendly red team

If a frontier lab can credibly contest the testing approach on day one, the conversation becomes about whether FAR.AI is fair, not what the findings show. That story doesn't recover.

  • Methodology page co-authored with a named FAR.AI team from draft one. Co-owned from the start, not reviewed at the end.
  • Source-span grounding on every claim. Every numeric statement traces to a verbatim research excerpt.
  • Pre-launch friendly red team: three outside reviewers (Anthropic / METR / Redwood placeholder) see visual + methodology under embargo before launch.
One workflow that supports rigor at scaleTry the demo →

Technical reviewer in the loop. Optional, scoped tooling.

A workflow layer: a researcher reviews each paraphrased claim against its source span once, and the approval propagates to every downstream artifact. Built to support methodology rigor without grinding the research team down by month four. Knowledge structure that scales beyond the leaderboard, and could empower the broader FAR.AI brand and comms team. Full walkthrough in the appendix below.

farai.makeitresonate.com/reviewer-tool

Foundation, launch, measurement.

Three months. Stage-gated against the tech summary. LOI now, full SOW within five business days of the tech summary landing. The pricing reflects existing-T11-retainer-client rates.

Month 1 · Foundation

Visual language + methodology + reviewer-tool design.

  • Visual language sprint + website information architecture
  • Methodology co-authoring plan with named FAR.AI team
  • Reviewer tool design + Paper Pilot kickoff
  • T11 retainer handles core narrative + embargo + media list — we coordinate, not re-bill
Month 2 · Build → launch

Leaderboard + reviewer tool + launch artifacts.

  • Leaderboard website built off the farai/ template
  • Co-authored methodology page
  • Press kit, hero video, attack-class infographics, embed widgets
  • Reviewer tool deployed at FAR.AI URL with SSO per researcher
Month 3 · Post-launch + measurement

Performance, durable-reference signal, Phase 2 scoping.

  • Leaderboard analytics, embed-widget tracking, engagement
  • Durable-reference signals: citations, share velocity, adoption
  • Lessons-learned + Phase 2 scoping recommendation

Domain expertise plus model-driven automation.

Not just prompt engineering. Not just tech. Two layers, intentionally separated. The human layer decides what a good artifact is. The model layer scales that judgment across a research library.

Layer 01 · Human

Editorial judgment + calibrated relationships.

  • Adam Penenberg — investigative journalism judgment + Thunder11 reporter network.
  • Mike Schneider — narrative architecture + interactive surface.
  • Researcher interview → claim library — the source models are allowed to pull from.
Layer 02 · Model

Multi-model orchestration + source-span grounding.

  • Multi-model: different models for narrative, paper parsing, structured outputs.
  • Source-span grounding on every claim. Subagents check fidelity + voice.
  • Captions at the source — ADA built in. Custom evals before ship.
Models scale judgment. They don't define it.

Five things we want your judgment on.

Anchored on the work in our scope: visual + interactive + content engine. PR sequencing, reporter pitch, and op-ed placement are Thunder11’s lane — we’re not asking you to litigate those here.

  1. Child safety prominence. The heatmap has a toggle: default treats child safety as one of seven domains, “child safety forward” pins it as the lead row. Which version is right for the public surface, and how forward should that finding be in the visual framing?
  2. Living-leaderboard framing. Ed’s “point in time, evolves with each model release” is the through-line for our content engine. Does that hold up when reporters and policymakers actually use it? Where does it strain?
  3. Audience layering. Four audience messages shipped in the press kit: general public, policy, technical, frontier labs. Where does the framing break, and which audience is hardest to reach with a single visual surface?
  4. Tone in the visual treatment. “Industry improvement, not embarrassment” is the North Star. Does our visual treatment of the gap (Anthropic at #1, Google at #4) read that way, or does it slide into mockery? What would you adjust?
  5. Highest-risk visual or framing move. What’s the surface, claim, or visual choice in our work most likely to backfire under reporter or lab pressure? What would you change before launch?
If we’ve got bandwidth: where does the content-engine concept need to flex to land with FAR.AI’s actual programming cadence (events, papers, ongoing research)?
Appendix · Optional reading

Technical review in the loop — the full version.

Brief plug appears in 03 · Rigorous. This appendix walks through how the workflow actually runs — what's reviewed where, how researcher time stays scoped, and why it could empower the broader FAR.AI brand and comms team beyond the leaderboard.

Try the reviewer toolOpen demo →

Time-boxed queues. Source-span always visible. One review propagates everywhere.

Sign in as Ed, Kellin, or Adam G. Walk through a queue of paraphrased claims paired with verbatim source spans. Approve, suggest edit inline, flag, or veto. Each action propagates to every downstream artifact. Audit trail timestamps every decision.

farai.makeitresonate.com/reviewer-tool
How review works in practice
About 2 hours of researcher review per major paper. Scoped upfront, not open-ended.
6 principles: long-form veto, methodology co-ownership, claim library, embed templates, time, revision scope
  • Long-form artifacts (interactive site, article, video): the lead researcher signs off at two checkpoints (outline before build, finished artifact before ship). ~30 minutes each. Outline-stage rejection kills the build. No sunk-cost arguments.
  • Methodology page + press-kit Q&A: co-authored with a named FAR.AI team from draft one. Co-owned, not reviewed. The structural answer to the methodology-disputes failure mode.
  • Social posts: claim library built with the researcher once. Every post pulls from that library. Researcher scans the week's set in a single Friday pass. No researcher ever sees the same phrasing twice.
  • Embed widgets: every numeric claim carries an inline methodology caveat. Reviewed at the template level, once, not per deployment.
  • Researcher time: ~2 hours per major paper across the full lifecycle, scoped into the engagement upfront. If that number's wrong for FAR.AI's team, we want to know before it's baked in.
  • Revision scope: two review rounds per artifact (outline + finished). Anything beyond is a change order with new timeline. The guardrail against open-ended comment threads that kill engagements at month four.