January 9, 2025

Building an AI Financial Analysis System for Copilot Money

AI personal-finance MCP chrome-extension

Is it cheaper for two people to go to the movies at Red Cinemas or Golden Ticket?

Copilot Money (which I strongly recommend for personal financial management) can tell you exactly what you spent at each theater, down to the penny. What it can’t do is group those transactions by visit, include the Fandango ticket purchases and parking fees that belong to the same outing, and compare the all-in cost per person across venues. Its budget-vs.-actuals only works for the current month. Custom analysis means exporting to a spreadsheet.

Over the holiday break, I decided to build something better: a natural-language financial reporting system using Claude Code. Ask it a question in plain English. Get a real answer.

Three Layers, One Rule

The system has three parts:

API Integration Layer: I reverse-engineered the Copilot Money GraphQL API to get programmatic access to transaction data
Agent Layer: An MCP server embedded in a Chrome extension that translates natural language queries into structured API calls
Analysis Layer: Domain-specific tools that perform all calculations and aggregations

high-level-uml-diagram

The rule: the AI orchestrates, the tools compute, and the LLM never does math. If you ask Claude to add up twelve transaction amounts, it will get it wrong often enough to make the results useless for financial analysis. Every arithmetic operation runs in deterministic code. The AI decides what to calculate. The code does the calculating.

The Movie Theater Problem

The hard part wasn’t wiring up APIs. It was encoding financial domain knowledge into something an AI could use.

initial-prompt

A traditional system would search for exact merchant name matches. But “movies” isn’t a merchant. It’s a concept. Going to the movies might mean a Fandango ticket purchase, a parking fee at a garage near the theater, and a concession stand transaction. Three different merchants, one outing. I built the agent to take any query and research alternative keywords. When you ask about “movies,” it searches for cinema, theater, ticket, and related terms automatically.

first-response

red-cinema-transactions

Notice the formatting. That’s not accidental. Clean, readable output required careful prompt engineering. The agent found Red Cinemas right away. But Golden Ticket is missing. Its name is unusual enough that standard fuzzy matching didn’t catch it.

So what happens when the AI misses something?

You tell it.

second-prompt

second-result

Now look at what’s happening. The system is grouping related transactions: multiple purchases during a single movie visit are automatically associated together. This wasn’t emergent AI behavior. I specifically designed the getMultiMerchantTransactions tool for this exact pattern, because it applies to any entertainment expense. Concerts. Sporting events. Any activity where tickets, parking, and concessions are separate charges but conceptually one outing.

Look at the November 5th visit. Two different merchant names: “Golden Ticket Cinegreensboro” and “Golden Ticket Cine.” The fuzzy logic that handles search queries also has to handle vendor name matching. Same place.

grouped-transactions

When I mentioned Fandango, the agent understood that even though Fandango isn’t a theater, those transactions belong to the movie visit cost. It also maintained context from the earlier prompts, so it knew which theaters we were already discussing. That doesn’t happen automatically. Context management had to be carefully programmed.

The result:

third-prompt

third-result

movie-cost-bar-chart

The AI groups four raw bank transactions from three different merchants into a single Golden Ticket outing and compares it to a single Red Cinemas charge.

Red Cinemas vs. Golden Ticket, all-in cost per person, with every related transaction attributed correctly. A question that would have taken 30 minutes of spreadsheet work, answered in three conversational turns.

Eleven Tools and One Calculator

I designed 11 specialized MCP tools, each doing one thing:

Tool Name	Purpose	Design Rationale
getCategoryBudgetVsActual	Compare budgeted amounts to actual spending	Extends Copilot’s built-in feature to support historical analysis
getAccountBalances	Retrieve balances with filtering	Foundation for net worth calculations
getNetWorth	Calculate and track net worth over time	Aggregates across accounts with temporal analysis
getMonthlySpending	Spending patterns by category and month	Enables trend detection and comparative analysis
getTransactions	Advanced transaction filtering	Core search primitive for all queries
getMultiMerchantTransactions	Multi-merchant searches with cost attribution	Handles the “movie theater” problem: groups related transactions from different merchants
discoverMerchantsFromTransactions	Find merchants by business type	Enables exploratory analysis: “Where did I buy groceries last year?”
getUpcomingBills	Recurring payment tracking	Cash flow planning support
getTags	Transaction tag enumeration	Enables tag-based analysis
getCategoryGroups	Category hierarchy discovery	Supports budget rollup queries
getBudgetUtilizationTrend	Multi-month budget performance	Long-term budget adherence tracking
calculate	All mathematical operations	The AI never touches arithmetic

The calculate tool is the one that matters most. By mandating that all math happens in deterministic code, I eliminated an entire class of AI hallucination. The agent can reason about what to calculate, but it cannot perform the calculation itself.

Why Copilot Money Hasn’t Built This

If this is so useful, why doesn’t Copilot offer it natively?

Because it’s genuinely hard. Production systems need comprehensive error handling, edge case management, security considerations, and user interface polish. This prototype took a few days because I own the problem space and built exactly what I need. A commercial product would require orders of magnitude more effort. My prototype handles movie budgets. What if it was handling taxes?

The pattern itself (agent orchestration + specialized tools + deterministic processing) scales. I’ve since used the same architecture for test automation analysis in a load testing application, and it holds up in far more complex domains. But every domain brings its own version of the movie theater problem, and those are the parts the AI can’t solve for you.