Really enjoyed reading this! I love how you documented both the wins and the challenges—especially the Heroku slug and migration issues. It’s a great reminder that production environments are a whole different beast than local testing. Also, the pivot to OpenAI embeddings is a smart move—simple, cost-effective, and scalable. Your approach shows how experimentation, flexibility, and learning from mistakes are key to building systems that actually work.
Can’t wait to test it out! It’s great to know that api costs can be reasonable. I tend to avoid them because I’m worried about getting an unexpected bill, but this seems very affordable. Lot easier to pay a few bucks to utilize someone else’s hard work than to try to build it from scratch.
Embeddings are cheap to create for someone like OpenAI, who already build embeddings to power everything they do. These newer embeddings are a bit more intelligent than the simpler word embeddings that were first developed about 10 years ago. Those simpler embeddings could be built on your desktop using Wikipedia (or anything) text but they are fixed dimensions (usually a few hundreds). The newer generation embeddings are at a sentence level and also trained such that you can only take the first N dimensions of the ~2k dims if you wanted to keep your database size smaller but you get slightly degraded performance.
BTW, vector embeddings "databases" are just a collection of vectors (think one array for each "document") and perhaps optimized to enable "nearest neighbor" searches. In that sense, they are not that different from traditional Search indexes used for relevance ranking--which is also optimized for "nearest neighbor" searches.
This is awesome, Karen. What surprised me most was the cost of embedding over 2100 articles 36 cents 😮 Thanks for sharing so much detail and your learnings around RAG.
This was an incredible amount of work, Karen, truly impressive!
What inspires me most is your dedication to exploring all kinds of possibilities and openly sharing what you learned along the way. First you surprised me with Celery, then again with vector DB, you keep raising the bar!
I also deeply resonate with your two points:
• Open-source tools are amazing, but not always production-friendly
• Sometimes, migrating is more painful than building from scratch
Awesome job, Karen! Can’t wait to roll out your feature next week — in a new format too 😉
Right now, StackDigest runs on two databases: a relational database that stores user profiles and newsletter collections and a vector database for semantic search. The relational database is updated every time you add a newsletter to your library. The vector database, which only contains English-language sources right now, will be updated every several days to fetch new articles/retire old ones and add user-nominated newsletters.
This means the Spanish-language newsletters you uploaded were added to your personal collection, so they will appear in all digests except for the semantic digest.
Apologies for the confusion! I’m thinking I need to add an FAQ and some help copy to the app!
Yes, a FAQ will be good. I can help, if you want (no techical issues). Robin Good recommended your tool.
In case you're interested, the system is a bit strange when the language isn't English. Although the newsletters are in Spanish, the summaries appear in English. But if I translate the summary into Spanish and put it at the beginning of the post, the next digest shows that Spanish translation as the summary of the article. Curious.
Your initiative is very good. Have you thought about when you'll charge for its use?
Quick update…I’ve heard from more people who are interested in Spanish-language digests, so I’m adding it as a feature. I’d love to get your feedback, and I’ll tag you when it’s online…I’m not really fluent enough to know if the translated digests make sense!
I haven’t really thought about monetizing it yet…I want to see how my production environment fares with a few hundred users and seriously clean up the mobile experience before I consider that!
You’ve brilliantly turned a technical headache into a cracking story with heart, humour, and hands-on learning. The 36-cent embedding cost stat is wild - according to OpenAI pricing
, it's bang on. Love that you didn’t get romantic about your first approach and pivoted fast when it made sense.
Did you consider any chunking strategy when creating embeddings for the Substack articles?
I worked on a project in my last 9-to-5 job to feed PDF files into a vector database for semantic search, and due to the large variety of file sizes, I had to build chunking to improve search accuracy.
I haven't implemented chunking yet, but I suspect that creating multiple 500-word chunks per article would improve accuracy quite a bit, since many articles are 1,500 words or more.
I love all of these updates! One week ago, I logged in to StackDigest and saw the discovery feature and it gave me an error. Was about to DM you to ask about that but I didn't have to do that it seems haha. I love it. Thank you so much for all you do! <3
Great insight. I have recently used the idea of semantic search in one of my Hackathon project, to match the current season idea with all of the previous season ideas, for the possible duplication and it works like a charm.
Yeah, Semantic search using Vector embedding match the keywords with their semantic meanings (as you have explained in the post). It is far better than normal keyword search.
Wow Karen, that was a huge chunk of work for one person. Very impressive! And I love the depth of your articl, there's a lot to learn from.
Thank you! 🙏 🤗
Really enjoyed reading this! I love how you documented both the wins and the challenges—especially the Heroku slug and migration issues. It’s a great reminder that production environments are a whole different beast than local testing. Also, the pivot to OpenAI embeddings is a smart move—simple, cost-effective, and scalable. Your approach shows how experimentation, flexibility, and learning from mistakes are key to building systems that actually work.
So glad you enjoyed it! Yeah, it was quite the surprise when I realized the sentence transformers would just not work in production. 🤣
Haha, I can totally relate! Production always has a way of humbling even the simplest plans. Glad you figured out a solution that works!
Can’t wait to test it out! It’s great to know that api costs can be reasonable. I tend to avoid them because I’m worried about getting an unexpected bill, but this seems very affordable. Lot easier to pay a few bucks to utilize someone else’s hard work than to try to build it from scratch.
I was really surprised by just how cheap OpenAI's Embeddings API actually was. Maybe that's why they're losing so much money? 😆
Embeddings are cheap to create for someone like OpenAI, who already build embeddings to power everything they do. These newer embeddings are a bit more intelligent than the simpler word embeddings that were first developed about 10 years ago. Those simpler embeddings could be built on your desktop using Wikipedia (or anything) text but they are fixed dimensions (usually a few hundreds). The newer generation embeddings are at a sentence level and also trained such that you can only take the first N dimensions of the ~2k dims if you wanted to keep your database size smaller but you get slightly degraded performance.
BTW, vector embeddings "databases" are just a collection of vectors (think one array for each "document") and perhaps optimized to enable "nearest neighbor" searches. In that sense, they are not that different from traditional Search indexes used for relevance ranking--which is also optimized for "nearest neighbor" searches.
Thank you for the helpful context! 🙏
This is awesome, Karen. What surprised me most was the cost of embedding over 2100 articles 36 cents 😮 Thanks for sharing so much detail and your learnings around RAG.
Thank you very much!
Very interesting
Truly impressive!
Thank you! 🤗
This was an incredible amount of work, Karen, truly impressive!
What inspires me most is your dedication to exploring all kinds of possibilities and openly sharing what you learned along the way. First you surprised me with Celery, then again with vector DB, you keep raising the bar!
I also deeply resonate with your two points:
• Open-source tools are amazing, but not always production-friendly
• Sometimes, migrating is more painful than building from scratch
Awesome job, Karen! Can’t wait to roll out your feature next week — in a new format too 😉
💯 on the migration and open source tools! This experience will definitely affect how I approach future projects!
And looking forward to next week’s feature! ❤️
Thank you for this great post! I truly loved reading this and learning more about vector databases.
You’re very welcome!
So many takeaways and learnings from this article Karen! Especially because I am currently learning more about vector databases myself.
Happy to help test StackDigest, its been really helpful in managing my newsletter subscriptions!
So glad you enjoyed it, and I’m curious about how you’ll be using your vector DB. 😁
And thank you so much for testing SD! 🙏 I’m always glad to hear feedback…there’s definitely room for improvement?
Dear Karen. I add 4 newslettes to the digest in Spanish and I tried the semantic search, but is not working.
I love the scoring feature.
Great point about language support!
Right now, StackDigest runs on two databases: a relational database that stores user profiles and newsletter collections and a vector database for semantic search. The relational database is updated every time you add a newsletter to your library. The vector database, which only contains English-language sources right now, will be updated every several days to fetch new articles/retire old ones and add user-nominated newsletters.
This means the Spanish-language newsletters you uploaded were added to your personal collection, so they will appear in all digests except for the semantic digest.
Apologies for the confusion! I’m thinking I need to add an FAQ and some help copy to the app!
Yes, a FAQ will be good. I can help, if you want (no techical issues). Robin Good recommended your tool.
In case you're interested, the system is a bit strange when the language isn't English. Although the newsletters are in Spanish, the summaries appear in English. But if I translate the summary into Spanish and put it at the beginning of the post, the next digest shows that Spanish translation as the summary of the article. Curious.
Your initiative is very good. Have you thought about when you'll charge for its use?
Quick update…I’ve heard from more people who are interested in Spanish-language digests, so I’m adding it as a feature. I’d love to get your feedback, and I’ll tag you when it’s online…I’m not really fluent enough to know if the translated digests make sense!
I haven’t really thought about monetizing it yet…I want to see how my production environment fares with a few hundred users and seriously clean up the mobile experience before I consider that!
Ok, thanks.
You’ve brilliantly turned a technical headache into a cracking story with heart, humour, and hands-on learning. The 36-cent embedding cost stat is wild - according to OpenAI pricing
, it's bang on. Love that you didn’t get romantic about your first approach and pivoted fast when it made sense.
Thank you! 🙏 Yes, whenever I'm getting really frustrated with a technical issue, I try to pause and ask myself (and Claude), "Is there a better way?"
Great article, Karen!
Did you consider any chunking strategy when creating embeddings for the Substack articles?
I worked on a project in my last 9-to-5 job to feed PDF files into a vector database for semantic search, and due to the large variety of file sizes, I had to build chunking to improve search accuracy.
That's a great thought! 💡
I haven't implemented chunking yet, but I suspect that creating multiple 500-word chunks per article would improve accuracy quite a bit, since many articles are 1,500 words or more.
I love all of these updates! One week ago, I logged in to StackDigest and saw the discovery feature and it gave me an error. Was about to DM you to ask about that but I didn't have to do that it seems haha. I love it. Thank you so much for all you do! <3
You're welcome! 🤗
But definitely let me know if you see any strange behavior or functionality that's not quite right. There's plenty of room for improvement!
Great insight. I have recently used the idea of semantic search in one of my Hackathon project, to match the current season idea with all of the previous season ideas, for the possible duplication and it works like a charm.
Interesting application! It really is so much more accurate than keyword matching.
Yeah, Semantic search using Vector embedding match the keywords with their semantic meanings (as you have explained in the post). It is far better than normal keyword search.
Hi I am an Arab