Product
Parth Shah
Key Insights from Goodfin x NYU Stern: AI Passing CFA Level III and What it Means for Finance
We’re proud to share that our research on advanced financial reasoning was recognized with the Best Oral Presentation Award at FinLLM @ IJCAI 2025 and featured by CNBC, Financial Times, Entrepreneur, Yahoo Finance and WealthManagement among others.
This work marks a major step forward in understanding how LLMs handle complex financial tasks beyond multiple choice, where nuance, judgment, and strategy are critical.
Essays Are the New Frontier:
Top AI models are now commoditized on multiple-choice questions, with leading models clustering tightly around 71-75% accuracy. The real differences emerge in complex essay tasks, where models must construct nuanced reasoning rather than select from options.
We observed this firsthand: while MCQ scores bunched narrowly (~71-75%), essay performance spread materially—o4-mini achieved 79.1% and Gemini 2.5 Pro reached 75.9%, approaching human expert levels (79.9% and 83.2% respectively). This divergence reveals where true reasoning capability lives.
Why it matters for Goodfin: MCQs test recognition; essays test articulation. For our AI Concierge, this reinforces that competitive advantage isn't in retrieving correct answers - it's in explaining why an answer holds under specific financial regulations and client circumstances. The Concierge must articulate reasoning paths that withstand professional scrutiny.

Performance Has a Price:
Advanced prompting (CoT-SC) boosts accuracy by +7.8 points but comes at a steep ~11x cost and latency increase, highlighting the critical trade-off between accuracy and efficiency.
Reasoning models clearly outperform non-reasoning ones on both MCQs and essays 73.1% vs. 69.4% MCQ accuracy, and 75.9% vs. 62.5% essay accuracy (human-graded). But these gains require significantly more processing time: ~59s vs. 18s for MCQs, and ~103s vs. 32s for essays.
Why it matters: This highlights the trade-off between accuracy and speed - reasoning models deliver more reliable answers on high-stakes financial reasoning tasks, but at the expense of latency. In practice, this suggests deploying them selectively where precision outweighs speed.

Human Validation is Crucial:
Our certified human graders scored essays +5.6 points higher than our "LLM-as-a-judge," proving that human expertise is irreplaceable for calibrating automated evaluation.
To ensure rigor, we validated model performance using a blind human CFA Level III grader alongside an LLM-as-judge framework. Essay response (989 in total) was double-scored, enabling a direct correlation analysis. The results showed that while both graders agreed exactly on ~70% of responses, humans awarded essays +5.6 points higher on average, reflecting more leniency and nuance compared to the stricter rubric adherence of the LLM judge
Why it matters: This dual-track validation highlights that relying solely on LLM judges risks underestimating performance on complex, judgment-heavy tasks - and underscores the importance of calibrating AI evaluation pipelines against human expert standards.

The practical takeaway isn't to "always use the biggest model," but to implement a tiered deployment strategy-routing routine queries to faster, cheaper models while reserving state-of-the-art reasoning models for high-stakes, judgment-heavy tasks.
We built cfabenchmark.com to reflect this principle: a framework that reflects this pipeline and built for the next generation of intelligent systems in finance. We invite researchers, practitioners, and institutions to explore and contribute.