So I hate doing manual entries in my notebook or excel sheet (or even the app that I semi-built for expense tracking). And these days 99% of my transactions are digital and happen from my daily use account which is with ICICI bank. So if I could somehow get my transaction history, I can just auto add stuff to my tracker which can be edited/sorted in categories later. But of course, nothing is as easy as it seems.
Banks are very stingy with this kind of stuff, no public APIs as far as I could find. So I had to find a workaround, which is that I get SMS alerts for every transaction, I can just parse those! But these alerts use different numbers and formats so first I need to build a database of these messages. Since the default SMS app didn't allow export, I thought I will setup SMS forwarding and store messages for a week and then get started.
To implement SMS forwarding, I used Automate, which is a brilliant app by LlamaLab which lets you automate a bunch of things with no-code (or very-low-code if you are a power user). I set up a flow to read all my SMS, if it contained the words ICICI, send the message in an HTTP request to my server where I setup a simple fastapi app to store incoming messages in a sqlite db. To protect against people spamming my db, I just added a password key in the body of the request which must match a constant on the server. Since the request will be HTTPS, the body will be encrypted and I don't need to worry about securing my password.
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import sqlite3
import os
from datetime import datetime
app = FastAPI()
# Constant password
MESSAGE_PASSWORD = os.getenv("MESSAGE_PASSWORD")
if MESSAGE_PASSWORD is None:
raise Exception("MESSAGE_PASSWORD environment variable not set.")
class Message(BaseModel):
"""Model to store the incoming message."""
message: str
password: str
def init_db():
"""Initialize the database."""
conn = sqlite3.connect("messages.db")
cursor = conn.cursor()
cursor.execute(
"""
CREATE TABLE IF NOT EXISTS messages (
id INTEGER PRIMARY KEY,
message TEXT NOT NULL,
timestamp DATETIME NOT NULL
)
"""
)
conn.commit()
conn.close()
init_db()
@app.get("/")
def hello_there():
return "Hello there!"
@app.post("/submit-message/")
def submit_message(message_data: Message):
if message_data.password != MESSAGE_PASSWORD:
raise HTTPException(status_code=401, detail="Unauthorized")
conn = sqlite3.connect("messages.db")
cursor = conn.cursor()
cursor.execute(
"""
INSERT INTO messages (message, timestamp)
VALUES (?, ?)
""",
(message_data.message, datetime.now()),
)
conn.commit()
conn.close()
return {"detail": "Message stored successfully."}
As you can see, it is pretty basic. Since I am the only one using it (for now), I don't need to do anything fancy as long as it gets the job done.
Now the impatient me couldn't wait a week to start experiment and I switched to an OSS SMS app using fdroid and it let me export my messages as JSON. Jackpot, I now have historical data to work with.
I did hear good things about the data analysis GPT4 so I thought lets try this there. Despite repeated attempts, it could not load my JSON file and I had to upload a few samples by hand. But due to the variety of templates used by the bank, I couldn't find all of them and sent it to GPT.
So I had a big brain moment, what if I grouped text messages based on similarity? I can already filter out ICICI bank messages by just searching for ICICI
in the string. Now I can just group these together and see how many templates I end up with. My first thought was k-means clustering with elbow method but well, that meant I would have to do feature extraction from the text and word2vec etc which is too much pain for this easy peasy job. So I landed on fuzzywuzzy, which uses levenshtein distance between 2 strings to find similarities. To reduce the impact of variable amounts, I removed numbers from the strings and then set fuzzy wuzzy to work, it gave me 100 groups from 500 relevant messages. My filter on similarity was too strict, once I loosened it enough, I ended up with 6 groups (which probably could have been 5 but eh).
import json
import re
import os
from fuzzywuzzy import fuzz
file_path = "icic_messages.json"
with open(file_path, "r") as file:
sms_data = json.load(file)
# Filter messages related to ICICI Bank transactions
icici_bank_messages = [sms["body"] for sms in sms_data if "ICICI" in sms["body"]]
print(len(icici_bank_messages))
debit_messages = [
msg
for msg in icici_bank_messages
if ("debit" in msg or "spent" in msg or "successfully processed the payment" in msg)
and "requested money" not in msg
and "to be debited" not in msg
and "will be debited" not in msg
]
credit_message = [
msg for msg in icici_bank_messages if "credit" in msg and "debit" not in msg
]
# Function to preprocess messages
def preprocess_message(message):
# Remove all punctuation and numbers
message = re.sub(r"[^\w\s]", "", message) # Removes punctuation
message = re.sub(r"\d+", "", message) # Removes numbers
return message.lower()
# Preprocess all messages
processed_messages = [preprocess_message(message) for message in debit_messages]
# Fuzzy Matching
similarity_threshold = 80 # Adjust as needed
grouped_messages = []
assigned = set()
for i, message1 in enumerate(processed_messages):
if i in assigned:
continue
current_group = [i]
assigned.add(i)
for j, message2 in enumerate(processed_messages):
if j not in assigned and i != j:
similarity_score = fuzz.token_set_ratio(message1, message2)
if similarity_score > similarity_threshold:
current_group.append(j)
assigned.add(j)
grouped_messages.append(current_group)
# Printing groups
for group, message_indices in enumerate(grouped_messages):
# check if directory output exists and create it if it doesnt
if not os.path.exists("output"):
os.makedirs("output")
with open(f"output/{group}.json", "w") as file:
json.dump([debit_messages[i] for i in message_indices], file, indent=4)
I skimmed through them and captured a couple of unique samples for each and went to work on regextester using GPT by my side. At this point, I ditched openai premium for kagi as it was a search engine as well and the only gpt4 extension i used was dall e and that wasn't frequent enough to justify paying 24$ (because 4$ taxes). Kagi had pretty much all I wanted and with its code assisatant (which does gpt4 calls under the hood), I devised regexes, it took me around 4 hours, but now I can regex everything from all SMSes i.e. I can get the amount, the currency, the date and who i was paying to. UPI, bank transfers, standing instructions, credit cards, all of these were captured now.
import json
import re
file_path = "icic_messages.json"
with open(file_path, "r") as file:
sms_data = json.load(file)
icici_bank_messages = [sms["body"] for sms in sms_data if "ICICI" in sms["body"]]
debit_messages = [
msg
for msg in icici_bank_messages
if ("debit" in msg or "spent" in msg or "successfully processed the payment" in msg)
and "requested money" not in msg
and "to be debited" not in msg
and "will be debited" not in msg
]
def parse_currency_and_amount(message) -> tuple[str | None, str | None]:
matches = re.findall(
r"([A-Z]{3}|Rs\.*)\s*(\d{1,}(?:\,?\d{2,3})*(?:\.\d+)?)", message
)
if len(matches) == 0:
return None, None
if len(matches[0]) != 2:
raise ValueError(f"Unable to parse string: `{message}`, got match: `{matches}`")
return matches[0][0], matches[0][1]
def parse_date_and_creditor(message) -> tuple[str | None, str | None]:
matches = re.findall(
r"(\d{2}-\w{3}-\d{2})[;.\s]?\s?(?:(?:for|at|towards)\s)?(.*?)(?:(?:\scredited\.)|\.\s|\sfor|Av)",
message
)
# if you found a match, return it and gtfo
if len(matches) != 0:
if len(matches[0]) != 2:
# weird partial matches
raise ValueError(
f"Unable to parse string: `{message}`, got match: `{matches}`"
)
return matches[0][0], matches[0][1]
# trying matching for format of "successfully processed the payment" messages
matches = re.findall(r"for (.*), as.*on (\d{2}\/\d{2}\/\d{4}) for", message)
if len(matches) == 0:
# no matches :')
return None, None
elif len(matches[0]) != 2:
# weird partial matches
raise ValueError(f"Unable to parse string: `{message}`, got match: `{matches}`")
# reverse return here because in this case date comes first and then creditor
return matches[0][1], matches[0][0]
for msg in debit_messages:
curr, amount = parse_currency_and_amount(msg)
date_str, creditor = parse_date_and_creditor(msg)
print(msg)
print(curr, amount, date_str, creditor, "\n")
With this code, I get results like:
USD 42.42 spent on ICICI Bank Card XX4242 on 24-Feb-23 at GAMESSSSS. Avl Lmt: INR 4,24,424.24. To dispute,call 42424242/SMS BLOCK 4242 to 4242424242
USD 42.42 24-Feb-23 GAMESSSSS
ICICI Bank Acct XX424 debited for Rs 4242.42 on 02-Apr-23; Person Name credited. UPI:424242424242. Call 42424242 for dispute. SMS BLOCK 424 to 4242424242.
Rs 4242.42 02-Apr-23 Person Name
Dear Customer, we have successfully processed the payment of INR 424.24 for Google Workspace, as per the Standing Instruction ABCDEFGHIJ, on 04/02/4242 for your ICICI Bank Debit Card 4242. To manage your Standing Instructions, visit www.icicibank.com >Cards >Debit Cards>Standing Instructions and Recurring Charges>Manage Standing Instructions.
INR 424.24 04/02/4242 Google Workspace
What I will do from here, I do not know.