Product
Parth Shah

We’re proud to share that our research on advanced financial reasoning was recognized with the Best Oral Presentation Award at FinLLM @ IJCAI 2025 and featured by CNBC, Financial Times, Entrepreneur, Yahoo Finance and WealthManagement among others.
This work marks a major step forward in understanding how LLMs handle complex financial tasks beyond multiple choice, where nuance, judgment, and strategy are critical.
Essays Are the New Frontier:
Top AI models are now commoditized on multiple-choice questions, with leading models clustering tightly around 71-75% accuracy. The real differences emerge in complex essay tasks, where models must construct nuanced reasoning rather than select from options.
We observed this firsthand: while MCQ scores bunched narrowly (~71-75%), essay performance spread materially—o4-mini achieved 79.1% and Gemini 2.5 Pro reached 75.9%, approaching human expert levels (79.9% and 83.2% respectively). This divergence reveals where true reasoning capability lives.
Why it matters for Goodfin: MCQs test recognition; essays test articulation. For our AI Concierge, this reinforces that competitive advantage isn't in retrieving correct answers - it's in explaining why an answer holds under specific financial regulations and client circumstances. The Concierge must articulate reasoning paths that withstand professional scrutiny.

Performance Has a Price:
Advanced prompting (CoT-SC) boosts accuracy by +7.8 points but comes at a steep ~11x cost and latency increase, highlighting the critical trade-off between accuracy and efficiency.
Reasoning models clearly outperform non-reasoning ones on both MCQs and essays 73.1% vs. 69.4% MCQ accuracy, and 75.9% vs. 62.5% essay accuracy (human-graded). But these gains require significantly more processing time: ~59s vs. 18s for MCQs, and ~103s vs. 32s for essays.
Why it matters: This highlights the trade-off between accuracy and speed - reasoning models deliver more reliable answers on high-stakes financial reasoning tasks, but at the expense of latency. In practice, this suggests deploying them selectively where precision outweighs speed.

Human Validation is Crucial:
Our certified human graders scored essays +5.6 points higher than our "LLM-as-a-judge," proving that human expertise is irreplaceable for calibrating automated evaluation.
To ensure rigor, we validated model performance using a blind human CFA Level III grader alongside an LLM-as-judge framework. Essay response (989 in total) was double-scored, enabling a direct correlation analysis. The results showed that while both graders agreed exactly on ~70% of responses, humans awarded essays +5.6 points higher on average, reflecting more leniency and nuance compared to the stricter rubric adherence of the LLM judge
Why it matters: This dual-track validation highlights that relying solely on LLM judges risks underestimating performance on complex, judgment-heavy tasks - and underscores the importance of calibrating AI evaluation pipelines against human expert standards.

The practical takeaway isn't to "always use the biggest model," but to implement a tiered deployment strategy-routing routine queries to faster, cheaper models while reserving state-of-the-art reasoning models for high-stakes, judgment-heavy tasks.
We built cfabenchmark.com to reflect this principle: a framework that reflects this pipeline and built for the next generation of intelligent systems in finance. We invite researchers, practitioners, and institutions to explore and contribute.



