Product
Ayushi Bhujade
How Goodfin Orchestrates Multi-Model AI for CFA-Level Reasoning
Building Reliable AI Systems for High-Stakes Financial Judgment
Large language models have made remarkable progress in recent years. However, most real-world financial problems do not fail due to lack of knowledge — they fail due to lack of structure, consistency, and evaluability.
The CFA Level III exam is a representative example of this challenge. It requires candidates to:
Reason under uncertainty;
Apply judgment rather than recall;
Explain decisions clearly and defensibly; and
Meet strict grading and formatting expectations.
These requirements expose the limits of single-model, single-pass LLM deployments.
At Goodfin, we approached the CFA exam not as a benchmark to optimize for, but as a systems problem: How do you design an AI that reasons, evaluates itself, and produces answers that are reliable under professional grading standards?
From Single Models to Agentic Systems
Most LLM-based applications rely on a single model prompted to generate an answer in one pass. While effective for many tasks, this approach struggles when reasoning paths matter as much as final answers, confidence and structure are graded, and errors must be explainable and traceable.
Goodfin’s CFA framework adopts a different approach. Rather than scaling a single model, we built an agentic system composed of multiple specialized models, coordinated through a deterministic orchestration layer. This system explicitly separates question understanding, reasoning strategy selection, answer generation, evaluation and grading, and final answer selection. This separation is intentional. Reliability emerges not from larger models, but from controlled interaction between components.
Deterministic Orchestration as a Reliability Primitive
Before generating any text, the system first classifies the type of CFA question—for example, whether it is multiple-choice, essay-based, or requires calculation, synthesis, or structured selection. Based on this classification, the system then routes the question to the appropriate reasoning and answering strategy, ensuring that each question is handled in the most effective way.
This analysis is deterministic and explainable, dictating both the reasoning approach and the model selection that follows. In high-stakes systems, determinism at the orchestration layer is not a limitation — it is a prerequisite. It ensures repeatability, debuggability, and predictable behavior across runs.
Essay Questions: Self-Consistent, Multi-Model Reasoning
CFA essay questions demand structured explanation rather than short answers. To address this, Goodfin employs a multi-stage reasoning pipeline:
Baseline Generation: A general-purpose LLM (Gemini 2.5 Pro) produces an initial essay response based on the vignette and question.
Self-Consistent Reasoning: This response, along with embedded few-shot examples, is passed to a reasoning-optimized model (o4-mini). The system performs multiple independent reasoning passes, each generating a refined answer.
Confidence-Weighted Selection: The system evaluates these candidates and selects the response that demonstrates the strongest alignment with CFA-style grading expectations.
By separating generation, reasoning, and evaluation, the system avoids single-pass failure modes and improves consistency across complex prompts. Under rubric-based grading, this approach achieves 84.56% essay accuracy.
Multiple-Choice Questions: Precision Through Constraint
Multiple-choice questions present a different challenge: over-explanation. Goodfin treats MCQs as a constrained decision task, ensuring that the model produces precise, unambiguous answers.
The system enforces zero-shot answering without intermediate reasoning, strict output constraints (single answer letter only), deterministic parsing, and sampling tuned for decisiveness. By limiting what the model can produce, the system avoids ambiguity, hedging, and formatting drift. This constraint-first approach delivers 78.33% MCQ accuracy and demonstrates the reliability of the system under professional evaluation standards.
Evaluation Aligned With Professional Standards
Evaluation is a first-class component of the system, not an afterthought. Goodfin mirrors the CFA exam in its scoring framework: MCQs are graded against ground truth, essays are graded using CFA-style rubrics (0–4 scale), and both are weighted equally.
Essay evaluation combines structural content coverage, lexical similarity to reference answers, and independent rubric-based grading. This closed-loop evaluation enables continuous measurement, iteration, and improvement, transforming reasoning into a measurable system property.
Results and Implications
Under exam-aligned evaluation, the Goodfin CFA Agentic Framework achieves:
78.33% MCQ accuracy
84.56% rubric-based essay scores
An overall score above the CFA Level III passing threshold of 65%.
More importantly, this framework demonstrates a broader principle: reliable AI systems are built through orchestration, constraint, and evaluation — not by scaling a single model. While the CFA exam serves as a proving ground, the architecture generalizes to any domain where judgment, explanation, and accountability matter.




