How Goodfin Orchestrates Multi-Model AI for CFA-Level Reasoning

Building Reliable AI Systems for High-Stakes Financial Judgment

Large language models have made remarkable progress in recent years. However, most real-world financial problems do not fail due to lack of knowledge — they fail due to lack of structure, consistency, and evaluability.

The CFA Level III exam is a representative example of this challenge. It requires candidates to:

Reason under uncertainty;
Apply judgment rather than recall;
Explain decisions clearly and defensibly; and
Meet strict grading and formatting expectations.

These requirements expose the limits of single-model, single-pass LLM deployments.

At Goodfin, we approached the CFA exam not as a benchmark to optimize for, but as a systems problem: How do you design an AI that reasons, evaluates itself, and produces answers that are reliable under professional grading standards?

From Single Models to Agentic Systems

Most LLM-based applications rely on a single model prompted to generate an answer in one pass. While effective for many tasks, this approach struggles when reasoning paths matter as much as final answers, confidence and structure are graded, and errors must be explainable and traceable.

Goodfin’s CFA framework adopts a different approach. Rather than scaling a single model, we built an agentic system composed of multiple specialized models, coordinated through a deterministic orchestration layer. This system explicitly separates question understanding, reasoning strategy selection, answer generation, evaluation and grading, and final answer selection. This separation is intentional. Reliability emerges not from larger models, but from controlled interaction between components.

Deterministic Orchestration as a Reliability Primitive

Before generating any text, the system first classifies the type of CFA question—for example, whether it is multiple-choice, essay-based, or requires calculation, synthesis, or structured selection. Based on this classification, the system then routes the question to the appropriate reasoning and answering strategy, ensuring that each question is handled in the most effective way.

This analysis is deterministic and explainable, dictating both the reasoning approach and the model selection that follows. In high-stakes systems, determinism at the orchestration layer is not a limitation — it is a prerequisite. It ensures repeatability, debuggability, and predictable behavior across runs.

Essay Questions: Self-Consistent, Multi-Model Reasoning

CFA essay questions demand structured explanation rather than short answers. To address this, Goodfin employs a multi-stage reasoning pipeline:

Baseline Generation: A general-purpose LLM (Gemini 2.5 Pro) produces an initial essay response based on the vignette and question.
Self-Consistent Reasoning: This response, along with embedded few-shot examples, is passed to a reasoning-optimized model (o4-mini). The system performs multiple independent reasoning passes, each generating a refined answer.
Confidence-Weighted Selection: The system evaluates these candidates and selects the response that demonstrates the strongest alignment with CFA-style grading expectations.

By separating generation, reasoning, and evaluation, the system avoids single-pass failure modes and improves consistency across complex prompts. Under rubric-based grading, this approach achieves 84.56% essay accuracy.

Multiple-Choice Questions: Precision Through Constraint

Multiple-choice questions present a different challenge: over-explanation. Goodfin treats MCQs as a constrained decision task, ensuring that the model produces precise, unambiguous answers.

The system enforces zero-shot answering without intermediate reasoning, strict output constraints (single answer letter only), deterministic parsing, and sampling tuned for decisiveness. By limiting what the model can produce, the system avoids ambiguity, hedging, and formatting drift. This constraint-first approach delivers 78.33% MCQ accuracy and demonstrates the reliability of the system under professional evaluation standards.

Evaluation Aligned With Professional Standards

Evaluation is a first-class component of the system, not an afterthought. Goodfin mirrors the CFA exam in its scoring framework: MCQs are graded against ground truth, essays are graded using CFA-style rubrics (0–4 scale), and both are weighted equally.

Essay evaluation combines structural content coverage, lexical similarity to reference answers, and independent rubric-based grading. This closed-loop evaluation enables continuous measurement, iteration, and improvement, transforming reasoning into a measurable system property.

Results and Implications

Under exam-aligned evaluation, the Goodfin CFA Agentic Framework achieves:

78.33% MCQ accuracy
84.56% rubric-based essay scores
An overall score above the CFA Level III passing threshold of 65%.

More importantly, this framework demonstrates a broader principle: reliable AI systems are built through orchestration, constraint, and evaluation — not by scaling a single model. While the CFA exam serves as a proving ground, the architecture generalizes to any domain where judgment, explanation, and accountability matter.

Introducing Goodfin GO

Product

Jan 22, 2026

Q4 2025 Goodfin Community Intelligence Report

Community

Dec 29, 2025

The Goodfin Guild Grows: Welcoming Our Newest Guild Partner

Community

Dec 9, 2025