7 Conclusions from 250+ Decks. 7 Learnings: how expert feedback increases real value of AI.
A pitch deck is rarely the whole story — but it is usually where the founder–investor transaction starts. What happens next is a chain of reads, pushes, corrections, and calibrations. This article unpacks that chain: what 262 decks in 60 days reveal, and how 183 expert feedback loops turn raw AI output into sharper founder–VC judgment — because expert feedback is what increases the real value of AI in diligence, not the model pass alone.
Most diligence still begins with slides. A founder frames the opportunity; a VC tests the claims. The gap between those two moves is where quality is won or lost — and where most AI tools stop at a single pass. We wanted to see what changes when you design for several expert feedback loops instead of one generic review: humans correcting specific layers, distilling what they know, and feeding that back into the next audit.
TechTruth runs scoped loops after each deck: VC scrutiny on AI depth and moat, founder pushback on verification, red/blue AI judges on battle chapters, operator calibration on scorecards, and annotation passes on reality gaps. Each loop closes with a lesson stored for the next company. The same tension you feel in a partner meeting — “is this wrapper or infrastructure?” — becomes structured input, not hallway opinion. That is how AI earns trust in this workflow: not by sounding confident, but by getting corrected by people who have seen the pattern before.
That setup is the lens for everything below. Seven conclusions describe patterns in the decks themselves: verdicts, sectors, intake rhythm, where conviction breaks. Seven learnings describe how expert feedback increases the real value of AI — how scrutiny compounds, how pushback sharpens the next audit, why the system gets better deal by deal without opaque retraining. Every chart is tied to live storage; explore the same numbers in our interactive 7+7 dashboard.
Read it as a benchmark report if you like — but we hope you read it as an invitation. Where would you push back on a verdict? Which loop would you add? If you have sat on either side of a deck review, bring that instinct to the charts and see whether the data matches your experience.
107 companies with full benchmark scorecards · 101 distilled lessons in the store · 73% of loops driven by VC scrutiny · 27% by founder pushback.
Last 60 days · live intake
262 decks
183 feedback loops · avg 4.4/day
Tool improvement
9%AI class resolved now
↑vs early pipeline months
Red vs blue team
25red team closes
10blue team closes
Feedback tension
73%VC scrutiny
27%founder pushback
Weekday vs weekend
5.5weekday avg
1.8weekend avg
Reality gaps
54%wrapper signal
44%founder claim gap
Classification + climate depth
20% climate/energy mix in benchmarks · feedback loops compound per vertical
7 conclusionsWhat we see in the first 250+ decks
Patterns across verdicts, pillars, AI class, sectors, intake rhythm, channels, and conviction gaps.
01
Cautious is the default — and that is the point
53.3% of benchmark decks land in Cautious verdict territory. That is not a bearish market call — it marks teams that are directionally credible but still light on proof. The best investors use Cautious as a structured pause: which pillar, if fixed, would move this to Optimistic?
Optimistic and Strong verdicts exist — but they are the exception. Most decks sit in a grey zone where one more data point on moat or infrastructure would shift the meeting.
Verdict distribution (benchmark scorecards)
02
Founders win the narrative; technology still has to win the room
Founder Team averages 6.9/10 — the highest scorecard pillar. Technical Moat (5.5), AI Asset Depth (4.7), and Infrastructure (4.8) trail behind. Decks are polished. The diligence conversation has moved from ‘strong team’ to ‘show me what compounds.’
We see the same split in every sector: charismatic founders with slides that outrun the repo. The scorecard is designed to surface that gap explicitly — not to punish storytelling, but to locate risk.
Executive scorecard pillars (mean ×10)
03
Wrapper language is still everywhere (32.7% Pure Wrapper)
Nearly one in three decks classifies as Pure Wrapper — the largest AI-class bucket in the set. Many are viable businesses. The gap is not ambition; it is defensibility: workflow lock-in, proprietary data, or distribution — not ‘agent’ vocabulary on slide three.
‘Agentic’ replaced ‘AI-powered’ as the default adjective in 2025–2026. The classification loop exists precisely because narrative upgrades faster than architecture.
The same pattern shows up in AI security: across 267 decks in the benchmark set, 3 cite Deeploy-style guardian agents as proof — prompt-injection defence, guardrails, and related controls — often marketed ahead of a production trace or threat model we could verify in diligence.
AI class mix
04
Climate and energy punch above their weight
21 climate and energy decks make up 19.6% of the benchmark mix — disproportionate to general tech (44.9% in pure software). Grid, storage, and industrial AI attract heavier technical scrutiny; decks in this vertical face a higher proof bar by default.
44 learned lessons are already tagged energy or climate — the vertical where feedback loops closed fastest. Hardware-adjacent claims (grid, storage, sensing) trigger more battle-chapter overrides than pure SaaS.
Industry mix
05
Deal flow is accelerating — but it respects the calendar
262 decks in 60 days (4.4/day average). Weekdays: 5.5/day. Weekends: 1.8/day. Second-half volume is +54% vs the first half — pipeline is growing, not flat.
VC partners batch-review Tuesday through Thursday. Founders upload Sunday night. The intake chart is not noise — it is how the market actually behaves, and the benchmark window reflects that.
60-day intake — weekdays vs weekends
06
Most decks arrive via Angel syndicates (15.6%)
The largest intake channel in the 60-day window is Angel syndicates & individual angels at 15.6% of volume. Angel syndicates and founder-direct referrals fill the rest. Diligence quality matters most when deal flow is diversified — the same red flags show up regardless of who forwarded the deck.
Channel mix shapes urgency, not truth. A founder-direct deck and a VC-forwarded deck fail for the same technical reasons — thin moat, missing data strategy, wrapper claims — even when the intro email sounds completely different.
Deck sources (60 days)
07
The recurring reality gap: moat appears thin or wrapper-like
Reality-gap tables flag where narrative outruns evidence. The most repeated theme: moat appears thin or wrapper-like. Mean WCCAA sits at 6.3/10 — not weak, but uneven. Proof is often present somewhere in the data room; it is rarely assembled as one auditable story in the deck.
That concludes the deck-side picture: strong stories, uneven proof, wrappers still common, energy decks held to a higher bar. What changes next is not the deck — it is whether the diligence engine learns from every override.
WCCAA score distribution
7 learningsExpert feedback that increases the real value of AI
How the learning loop works
Tech diligence cannot be scored with a single thumbs-up. It needs expert text on specific parts of an audit — that is what turns AI from a draft into something you can act on. We built a simple loop — no model fine-tuning, no crowd labels — where each correction increases the real value of the next AI pass:
Correct — Someone with domain context fixes one section: scorecard, battle chapter, founder check, reality gap, or industry/AI tag.
Explain — They say what was wrong and why in plain language (critique + directive), not a 1–5 score.
Distil — TechTruth turns that into one anonymised sentence: no founder names, no company names, safe to reuse.
Reuse — The next deck in a similar sector, stage, or AI class pulls that rule into the audit prompt. The Intelligence Trace shows which rules fired.
Four kinds of expert input feed the same loop — each learning below is one way that loop shows up in the data:
183 loop closes in 60 days (~70% of decks). The seven learnings are what that compounding looks like in the numbers — expert feedback turning AI output into reusable diligence rules, not theory.
01
Five scoped loops beat one generic human review
TechTruth runs separate loops for scorecard calibration, founder verification, battle chapters, reality-gap annotations, and classification tagging. In the last 60 days they closed 183 times (~70% of 262 decks reviewed). Quality improves when an expert fixes one layer without rewriting the whole audit.
Methodology step: scoped loops mean each correction stays in its lane — classification fixes do not overwrite battle chapters.
Feedback loop closes by type
02
VC feedback makes the next deck stricter (73% of loops)
133 loop closes in 60 days reflect the investor lens — generous moat scores, thin AI depth, missing diligence questions. VCs and operating partners teach the system where automated decks look too optimistic.
After distil + reuse: the next comparable deck inherits stricter moat and wrapper checks — sourced from VC critique text, not a harsher default model.
Where VCs pushed hardest
03
Founder feedback makes the next deck fairer (27% of loops)
50 loop closes reflect founder pushback: false-positive verification flags, harsh battle outcomes when context was missing, executive summaries that misread the build. Founders are not the other team — they correct precision.
Same loop, opposite direction: founder explain → distil → reuse improves verification fairness on the next deck.
Where founders pushed back
04
AI judges stress-test claims before humans commit (35 battle closes)
Every audit runs Red Team (sceptical investor) and Blue Team (founder advocate). In 60 days: 25 red-side and 10 blue-side battle loop closes. Humans override either voice; overrides distill into battle lessons.
AI judges draft the adversarial chapter; human override → distil → reuse stops the same battle mistake repeating.
Red team vs blue team loop closes
05
Classification resolution is the clearest before/after
Resolved AI class share rose from 0% to 53% as classification loops closed. 9.4% of benchmark decks now carry a definitive AI class.
The loop’s output metric: more decks leave classification with a resolved AI class because expert tags became reusable rules.
Loops closed (bars) vs AI class resolved % (line)
06
Operator calibration: 49 scorecard closes in 60 days
Scorecard & WCCAA calibration accounts for 26.9% of loop volume (49 closes). Founder verification added 24 more. Why Commit calibration turns ‘this deck was wrong’ into rules for every Energy pre-seed wrapper or Series A deep-tech deck.
Operator calibration is the same four steps — my override becomes a scoring rule keyed to stage, sector, and AI class.
Operator calibration loop closes (60 days)
07
Distilled lessons reuse — 101 rules the next deck reads
101 anonymised lessons in the store; 10 classification loops and 44 energy/climate-tagged lessons — the vertical with the most overrides. Each commit compresses to one sentence, retrieved on the next comparable audit.
End state of the loop: 101 rules in the store; the chart below shows which voices produced them.
Distilled lessons by source type
Grounded in research — adapted by us
We read the alignment literature on expert feedback in fuzzy domains. We did not copy a paper’s training stack. Below: what we took from each source, and what we changed for tech due diligence.
We took: “Fuzzy” domains need rich expert language, not scalar crowd labels; small expert volume can compound if you run an online loop (fresh feedback → update supervision → next batch).
We changed: no proxy reward model training and no RL fine-tuning. Our path is distil → store → retrieve in-context on the next audit. Plus five scoped loops and four expert voices (VC, founder, red/blue judges, operator) — not in the paper.
We took: expert feedback should be auditable — which inputs matter, where labeler bias creeps in.
We changed: we do not run influence functions on a reward model. We use operational impact: lesson reuse counts, VC vs founder split, Intelligence Trace per deck. In June 2026 we added lesson–audit attribution so reuse is measured, not inferred.
Related work · not our stack
Rafailov et al., Direct Preference Optimization (DPO)
We took: the idea that human preferences can steer models — useful context for readers from an RLHF background.
We changed: DPO bakes preferences into weights. TechTruth bakes them into prompt context (learned lessons). Simpler to operate; fits a diligence product where rules must stay inspectable.
Our own additions (not from any single paper): red vs blue battle chapters before human commit; WCCAA calibration by stage and AI class; energy/climate lesson density; anonymised one-sentence distillation; benchmark reporting tied to 262 decks and 183 loops in this window.
What we do not claim: reproducing academic training results, running RLHF/DPO in production, or full influence-function attribution. We claim a practical loop that matches how VCs and founders actually review decks — and numbers that move when it works.
7 conclusions from the decks. 7 learnings on expert feedback that increases the real value of AI.
Stay tuned. Next time we go deeper on how to run this way of working — expert feedback loops and live data ingestion — safely and responsibly. What we anonymise before a lesson enters the store, how founder and VC input is handled, retention boundaries, and what “inspectable AI” means when rules reuse across decks. Same loop, same compounding quality — with guardrails you can explain to an LP, a founder, or your own compliance team.