Is it cheaper for two people to go to the movies at Red Cinemas or Golden Ticket?
Copilot Money (which I strongly recommend for personal financial management) can tell you exactly what you spent at each theater, down to the penny. What it can’t do is group those transactions by visit, include the Fandango ticket purchases and parking fees that belong to the same outing, and compare the all-in cost per person across venues. Its budget-vs.-actuals only works for the current month. Custom analysis means exporting to a spreadsheet.
Over the holiday break, I decided to build something better: a natural-language financial reporting system using Claude Code. Ask it a question in plain English. Get a real answer.
Three Layers, One Rule
The system has three parts:
- API Integration Layer: I reverse-engineered the Copilot Money GraphQL API to get programmatic access to transaction data
- Agent Layer: An MCP server embedded in a Chrome extension that translates natural language queries into structured API calls
- Analysis Layer: Domain-specific tools that perform all calculations and aggregations

The rule: the AI orchestrates, the tools compute, and the LLM never does math. If you ask Claude to add up twelve transaction amounts, it will get it wrong often enough to make the results useless for financial analysis. Every arithmetic operation runs in deterministic code. The AI decides what to calculate. The code does the calculating.
The Movie Theater Problem
The hard part wasn’t wiring up APIs. It was encoding financial domain knowledge into something an AI could use.

A traditional system would search for exact merchant name matches. But “movies” isn’t a merchant. It’s a concept. Going to the movies might mean a Fandango ticket purchase, a parking fee at a garage near the theater, and a concession stand transaction. Three different merchants, one outing. I built the agent to take any query and research alternative keywords. When you ask about “movies,” it searches for cinema, theater, ticket, and related terms automatically.


Notice the formatting. That’s not accidental. Clean, readable output required careful prompt engineering. The agent found Red Cinemas right away. But Golden Ticket is missing. Its name is unusual enough that standard fuzzy matching didn’t catch it.
So what happens when the AI misses something?
You tell it.


Now look at what’s happening. The system is grouping related transactions: multiple purchases during a single movie visit are automatically associated together. This wasn’t emergent AI behavior. I specifically designed the getMultiMerchantTransactions tool for this exact pattern, because it applies to any entertainment expense. Concerts. Sporting events. Any activity where tickets, parking, and concessions are separate charges but conceptually one outing.
Look at the November 5th visit. Two different merchant names: “Golden Ticket Cinegreensboro” and “Golden Ticket Cine.” The fuzzy logic that handles search queries also has to handle vendor name matching. Same place.

When I mentioned Fandango, the agent understood that even though Fandango isn’t a theater, those transactions belong to the movie visit cost. It also maintained context from the earlier prompts, so it knew which theaters we were already discussing. That doesn’t happen automatically. Context management had to be carefully programmed.
The result:



Red Cinemas vs. Golden Ticket, all-in cost per person, with every related transaction attributed correctly. A question that would have taken 30 minutes of spreadsheet work, answered in three conversational turns.
Eleven Tools and One Calculator
I designed 11 specialized MCP tools, each doing one thing:
| Tool Name | Purpose | Design Rationale |
|---|---|---|
| getCategoryBudgetVsActual | Compare budgeted amounts to actual spending | Extends Copilot’s built-in feature to support historical analysis |
| getAccountBalances | Retrieve balances with filtering | Foundation for net worth calculations |
| getNetWorth | Calculate and track net worth over time | Aggregates across accounts with temporal analysis |
| getMonthlySpending | Spending patterns by category and month | Enables trend detection and comparative analysis |
| getTransactions | Advanced transaction filtering | Core search primitive for all queries |
| getMultiMerchantTransactions | Multi-merchant searches with cost attribution | Handles the “movie theater” problem: groups related transactions from different merchants |
| discoverMerchantsFromTransactions | Find merchants by business type | Enables exploratory analysis: “Where did I buy groceries last year?” |
| getUpcomingBills | Recurring payment tracking | Cash flow planning support |
| getTags | Transaction tag enumeration | Enables tag-based analysis |
| getCategoryGroups | Category hierarchy discovery | Supports budget rollup queries |
| getBudgetUtilizationTrend | Multi-month budget performance | Long-term budget adherence tracking |
| calculate | All mathematical operations | The AI never touches arithmetic |
The calculate tool is the one that matters most. By mandating that all math happens in deterministic code, I eliminated an entire class of AI hallucination. The agent can reason about what to calculate, but it cannot perform the calculation itself.
Why Copilot Money Hasn’t Built This
If this is so useful, why doesn’t Copilot offer it natively?
Because it’s genuinely hard. Production systems need comprehensive error handling, edge case management, security considerations, and user interface polish. This prototype took a few days because I own the problem space and built exactly what I need. A commercial product would require orders of magnitude more effort. My prototype handles movie budgets. What if it was handling taxes?
The pattern itself (agent orchestration + specialized tools + deterministic processing) scales. I’ve since used the same architecture for test automation analysis in a load testing application, and it holds up in far more complex domains. But every domain brings its own version of the movie theater problem, and those are the parts the AI can’t solve for you.