I wanted to find an efficient algorithm for finding missing or problematic transactions with YNAB Debt Sync. The naïve O(n²) approach of comparing all transactions from both budgets would obviously not scale. This post explains how I achieved an average time complexity of O(n·log(n)).
Thinking about the problem, it occurred to me that I could use the mirrored nature of double-entry bookkeeping to my advantage by sorting each budget's transactions by their amount. The trick is to sort the second budget's transactions in opposite order to the first budget's transactions and, for each pair of transactions, compare the first budget's transaction amount with the inverse of the second budget's transaction amount. For the best-case scenario where the budgets tally up, this results in linear time complexity.
You can see an example in the next figure showing two budgets, called "This budget" and "Other budget", whose transactions are sorted in this mirrored fashion. This is the best-case scenario: both budgets have the same number of transactions, and each transaction in one budget has its opposite transaction in the other budget.
Using a common sort algorithm, like quicksort, the complexity for the algorithm's best-case scenario is O(2(n·log(n)) + n), or O(n·log(n)).
The best-case scenario isn't of much interest, though, as the use-case for the algorithm is precisely when the budgets don't tally up. If one (or more) transaction(s) is missing from one of the budgets, the previous comparisons will fail at some point, due to the transaction amounts in the incorrect budget "skipping" past the expected transaction amount. You can see this in the next figure, where "Other budget" is missing the transaction of amount -9,80.
The first comparison is successful: This budget's first transaction of amount 10 is mirrored by Other budget's first transaction of amount -10. The comparison fails for the second pair though: This budget's second transaction of amount 9,80 is not mirrored by Other budget's second transaction of amount -6,27. Looking at the figure, it's easy to see that Other budget is the one missing a transaction. Specifically, the mirror transaction to This budget's 9,80 transaction. Programmatically deciding which budget is missing a transaction is a bit more complicated, and is documented in YnabBudgetComparer's get_missing_transactions method. The cases depend on the combination of the transaction amount signs and on the absolute value comparison of the amounts.
In this case, because This budget's transaction is greater than zero and Other budget's transaction is less than zero, coupled with This budget's transaction's absolute value being greater than Other budget's transaction's absolute value, we know that Other budget is missing a transaction to mirror the transaction of amount 9,80 from This budget.
Information such as Payee or Memo (in the case of YNAB transactions) will help the user to understand the discrepancy between the budgets, but This budget may have multiple transactions of amount 9,80, each with different information. The next step is to find the best candidate transaction.
To do so, first the subset of transactions of amount 9,80 from This budget is extracted and sorted by date. The same is done for transactions of amount -9,80 from Other budget. If there are m transactions of this amount, this is O(m·log(m)) operations. It is common that, for n total transactions, n ≫ m, as there are relatively few transactions that share the exact same amount.
A transaction subset comparison is triggered each time there is a mismatch in a transaction pair's amounts so, for p skips, the time complexity of this operation is O(pm·log(m)). The value of p depends entirely on how thorough the users have been at recording transactions. In my experience, there is roughly one skip for every ten transactions, so we can also consider n ≫ p, giving n ~ pm. The loosely average time complexity is then O(n·log(n) + n·log(n)), or O(n·log(n)).
Once the two subsets have been created, the relative complement of Other budget transactions in This budget transactions is then calculated, i.e. which transactions exist in This budget but not in Other budget. This is a linear operation of, at worst, O(n-1), where n is the amount of transactions from This budget. It is n-1 because if Other budget has no transactions of amount -9,80, the whole set of This budget's transactions of amount 9,80 is returned, and is a constant-time O(1) operation. The order of magnitude of this operation doesn't affect the previous time complexity of O(n·log(n)).
For the budgets in this example, the subset from This budget contains one transaction of amount 9,80 with date 2015-08-16. The subset from Other budget contains zero mirror transactions of amount -9,80, as you can see in the following figure. The transaction from This budget is returned as Other budget's missing transaction, required for both budgets to tally up.
For extreme cases the algorithm could degenerate to a worst-case scenario of O(mn·log(n)) complexity, for example if none of the transactions were mirrored: each budget containing only its own side of transactions.