I used machine learning to make sense of 39,000 Substack articles
Using semantic search and clustering algorithms to understand what's trending on Substack right now
Disclosure: This article was written by me, a human, with an assist from Claude because it’s Friday. Also, the beginning of this article really sounds like an ad…but I need to explain the product I’m building, so you know why I have 39,000 Substack newsletter articles just lying around.
Disclaimer: This article focuses only on the data story and not on the politics of newsletters and themes surfaced by the analysis. I’m pretty sure my database could be used to build a map of red vs purple vs blue Substack…but I’m not sure if that’s such a good idea.
Update: The StackDigest project has ended. Thanks again to everyone who tested the tool and made building it such a wonderful experience!
Let’s begin with a little backstory.
Over the past few months, I’ve been building StackDigest, a newsletter discovery and digest tool that helps readers quickly surface high-quality articles from their subscriptions and discover new Substack content through semantic search. Unlike traditional keyword search, semantic search understands the meaning behind your queries, delivering more targeted results. It also surfaces recent content and content from smaller creators, if it’s a good match, whereas Google is strongly biased towards the most established publications.
To power this tool, I’ve assembled a database of approximately 2,500 Substack newsletters and 39,000 articles, which is refreshed every several days. As I watched this collection grow, it occurred to me that this dataset could tell us a lot about what’s happening on Substack right now. Like many Substack publishers, I’m curious about what others are writing, what resonates with readers, and what drives engagement.
So I decided to run some analyses on the database and add a few basic analytics features to StackDigest. This post covers some initial findings based on several database queries and results from my tool.
About the database and its limitations
As noted above, the database was initially developed to support semantic search functionality. It includes many newsletters that appear in Substack’s Bestseller and Rising lists across various categories. Here’s the breakdown:
Newsletter distribution by category (Top 15):
AI & Machine Learning: 505 newsletters
General Interest: 262 newsletters
Personal Development: 209 newsletters
Entrepreneurship: 138 newsletters
Business & Strategy: 135 newsletters
Literature & Writing: 101 newsletters
Product Management: 100 newsletters
Education & Learning: 83 newsletters
Marketing & Growth: 66 newsletters
Creator Economy: 64 newsletters
Technology: 62 newsletters
Innovation: 54 newsletters
Arts & Culture: 52 newsletters
Philosophy: 51 newsletters
Mental Health: 51 newsletters
Total: 2,510+ newsletters across 58 categories
Article timeline and volume
The database covers roughly the past three months of Substack publishing history (going back to July), with a handful of older articles. New articles are added every several days, and articles older than three months are gradually being archived to manage storage requirements.
Recent activity (last 30 days):
Average: ~360 articles/day
Peak: 493 articles (September 16)
Total last 30 days: ~10,800 articles
Paywall status
Free articles: 32,165 (81.8%)
Paywalled articles: 7,143 (18.2%)
Total: 39,308 articles
The majority of content in the database is freely accessible, though this analysis doesn’t filter by paywall status.
How the database is structured
The newsletters and articles are stored in a vector database that uses embeddings to enable semantic search. But what the heck does that mean?
When an article is added to the database, it goes through a transformation process:
Raw HTML content (typically 5,000+ characters): The Substack API returns the full HTML markup, including image tags, divs, links, and other code.
HTML cleaning: The content extractor strips away all HTML tags, removes scripts and styling, and extracts just the text content. This typically reduces the character count by 70-80%.
Structured for embedding: The cleaned text is formatted with metadata:
Title: [article title] | Subtitle: [subtitle] | Content: [cleaned text] |
Author: [name] | Newsletter: [name] | Engagement: [stats]Creating the embedding: This structured text (around 1,000-2,000 characters for shorter articles) is sent to OpenAI’s embedding API, which converts it into a high-dimensional vector that captures the semantic meaning of the content.
These embeddings allow the search algorithm to find articles based on conceptual similarity rather than just keyword matching, so searching for “dealing with anxiety” might surface articles about “managing stress” or “mental health practices” even if they don’t use those exact words.
Right now, I’m generating one embedding per article, and so far, that’s worked well for semantic search. But if I extend the analytics feature, I may move towards embedding articles in “chunks” for more granular analysis.
What this database doesn’t offer
This is not a real-time database, nor is it comprehensive. It’s designed to capture a snapshot of active, popular newsletters across diverse categories. Records aren’t yet tagged for podcasts or videos, only written articles. The data skews toward newsletters with existing audiences, so emerging or niche newsletters may be underrepresented.
And technology categories are likely overrepresented, because I’m a bit of a tech nerd.
Understanding engagement
For this analysis, I defined engagement as reactions + comments (likes + comments). This is the standard metric used throughout the analytics.
Here’s how articles break down by engagement level:
0 engagement: 8,431 articles (21.4%)
1-10 engagement: 14,770 articles (37.6%)
11-50 engagement: 10,144 articles (25.8%)
51-100 engagement: 2,423 (6.2%)
101-200 engagement: 1,470 (3.7%)
200+ engagement: 2,070 (5.3%)
Key insight: 59% of articles receive 10 or fewer engagements, while only 15% receive 50+ engagements. This suggests that even among relatively established newsletters, high engagement is the exception rather than the rule.
Does length matter? Word count vs. engagement
One of the first questions I wanted to explore was, “Do longer articles perform better?” The data reveals some interesting patterns:
The sweet spot
When looking at articles across all categories over a 30-day period, there appears to be an engagement sweet spot around 1,500-2,000 words. Articles in this range tend to generate solid engagement without overwhelming readers.
The celebrity effect
However, the data also shows numerous outliers, which often include very short articles (sometimes just a few words) with exceptionally high engagement. Many of these are from high-profile newsletters where the author’s existing audience drives engagement regardless of article length. This is the “celebrity effect”: when you have an established following, even brief posts can generate thousands of reactions.
This skew makes it challenging to draw definitive conclusions about optimal length without filtering for newsletter size and audience, which would require additional data enrichment.
Category differences
Looking at specific categories reveals different patterns:
AI & Machine Learning: Technical content in this category shows a wide distribution. Some highly technical, longer pieces (5,000-10,000 words) perform well, but there’s also strong engagement with shorter, more accessible explainers in the 1,500-2,000 word range.
Politics & Policy: Political commentary tends to generate high engagement even with moderate length (1,000-2,000 words). The highest-engagement articles often tackle current events with strong perspectives that drive discussion.
Mental Health: Articles in this category cluster around 1,000-1,500 words, suggesting readers prefer focused, digestible advice over lengthy explorations.
Reactions vs. comments: What’s driving discussion?
An interesting secondary analysis looked at the relationship between reactions (likes) and comments. Most articles receive far more reactions than comments, but certain pieces spark disproportionate discussion.
One standout example is a “Sunday caption contest” from Robert Reich’s newsletter that generated 832 reactions but nearly 3,000 comments, showing how interactive formats can drive exceptional engagement through participation rather than passive consumption.
Discovering trends with Theme Radar
One of the most powerful features I added to StackDigest is the “Theme Radar,” an automated system that uses a K-means clustering algorithm to automatically identify trending topics across thousands of newsletter articles.
Here’s how it works:
Finding similar articles: First, the algorithm groups articles with similar embeddings together, forming themes.
Measuring similarity: Next, the system uses cosine similarity to measure relationships between conceptually related articles. Unlike simple distance measurements, cosine similarity measures the angle between embedding vectors, which is ideal for understanding whether two pieces of text are about similar topics, even when they use different words.
Why themes change slightly between runs
If you generate themes multiple times on the same date range, you might notice that they shift a bit. This is normal and expected due to random initialization.
K-means clustering starts by randomly selecting initial positions for cluster centroids (the center points of each theme group). The algorithm then iteratively refines these positions, moving them toward the center of their nearest articles. Because the starting positions are random, the algorithm can converge to different local optima, slightly different but equally valid groupings.
Think of it like organizing books. Imagine sorting books into piles by topic. Depending on which books you pick as initial anchor points, you might group “technology and society” with “AI ethics” in one pass, but separate them in another. Both groupings make sense because the boundaries between topics are naturally fuzzy.
What stays consistent across runs:
Major themes (politics, AI, personal growth) always emerge
Articles with very high cosine similarity (0.9+) almost always cluster together
The relative size and engagement levels of themes remain stable
Temporal patterns (like event-driven spikes) are consistent across runs
What might change:
Exact theme labels and descriptions
Articles near the boundaries between related clusters
Article counts within themes (typically varying by 5-15%)
The specific centroid positions in embedding space
Top 20 themes (Past 30 days)
Here are the most active themes detected across Substack as of this afternoon, ranked by a combination of article count and engagement:
AI Ethics and Impact (494 articles, 27,070 reactions, 6,516 comments)
Political Unrest and Resistance (396 articles, 655,918 reactions, 86,567 comments)
Personal Growth and Self-Reflection (324 articles, 30,064 reactions, 4,258 comments)
Substack Community Growth Strategies (320 articles, 16,502 reactions, 4,265 comments)
Reflections on Time and Life (272 articles, 25,570 reactions, 6,163 comments)
Changing Landscape of Work and Money (260 articles, 20,446 reactions, 2,554 comments)
Cultural Shifts in Society (233 articles, 44,325 reactions, 6,675 comments)
Political Turmoil and Violence (211 articles, 196,910 reactions, 29,551 comments)
Freedom of Speech and Democracy Advocacy (181 articles, 73,413 reactions, 18,629 comments)
Personal Updates and Reflections (171 articles, 33,422 reactions, 4,803 comments)
Exploring Hate and Controversy (137 articles, 16,513 reactions, 3,408 comments)
Pop Culture Reflections (133 articles, 24,383 reactions, 6,965 comments)
Controversy Surrounding Jimmy Kimmel (119 articles, 223,930 reactions, 26,983 comments)
Political Commentary and Criticism (114 articles, 72,979 reactions, 5,493 comments)
Artistic Expression and Reflection (109 articles, 15,436 reactions, 3,938 comments)
Optimizing Mental Health and Productivity (106 articles, 9,168 reactions, 1,569 comments)
Exploring Philosophical Ideologies (80 articles, 7,407 reactions, 1,069 comments)
Royal Family Drama (67 articles, 9,387 reactions, 3,251 comments)
Debunking Medical Misinformation (64 articles, 34,430 reactions, 7,617 comments)
Global Political Unrest (45 articles)
What the themes reveal
Politics dominates engagement: While “AI Ethics and Impact” has the most articles (494), political themes generate far more engagement. “Political Unrest and Resistance” accumulated over 655,000 reactions despite having fewer articles. The “Controversy Surrounding Jimmy Kimmel” theme, with just 119 articles, generated an astounding 223,930 reactions, suggesting a few viral pieces can define a theme’s impact.
Self-improvement is evergreen: “Personal Growth and Self-Reflection” consistently produces content (324 articles), showing that introspective, practical advice remains a Substack staple.
Meta-content thrives: “Substack Community Growth Strategies” ranks #4, revealing that Substack writers are actively learning from each other and sharing insights about building audiences on the platform.
Medical misinformation matters: Despite ranking lower in article count (64 articles), “Debunking Medical Misinformation” generates substantial engagement (34,430 reactions), indicating readers value authoritative health information.
Visualizing theme evolution over time
The similarity score scatter plots reveal fascinating patterns about how themes evolve:
Sustained vs. event-driven themes
Looking at the temporal distribution of articles within themes, we can distinguish between two types:
Sustained themes show consistent article publication throughout the 30-day period with relatively even similarity scores. Examples include:
Personal Growth and Self-Reflection: Articles spread evenly across the entire month with similarity scores ranging from 0.3 to 0.95, showing this is an ongoing conversation rather than a response to specific events.
Optimizing Mental Health and Productivity: Similar pattern of consistent publishing throughout September, with articles maintaining moderate to high similarity scores (0.4-0.95), indicating writers consistently return to core concepts.
AI Ethics and Impact: Despite being the most article-heavy theme, it shows remarkably consistent publishing cadence with articles distributed throughout the period, suggesting AI discourse is sustained rather than reactive.
Event-driven themes show concentrated bursts of highly similar articles around specific dates:
Political Turmoil and Violence: Shows a clear spike of very high similarity scores (0.8-0.95) concentrated between September 10-16, suggesting multiple writers responded to the same political event during this period. After mid-September, similarity scores drop and spread out, indicating the conversation moved on.
Debunking Medical Misinformation: Shows interesting clustering patterns when comparing different time windows. The 14-day view reveals tighter, more focused conversations (September 20-October 2), while the 30 and 60-day views show the same high-similarity articles but with more temporal spread, suggesting periodic waves of medical misinformation that prompt coordinated responses.
Freedom of Speech and Democracy Advocacy: Demonstrates increasing similarity scores toward late September and early October, indicating this theme gained momentum and coherence as writers converged on similar arguments.
The power of controversy
Exploring Hate and Controversy presents a particularly interesting pattern: it maintains high similarity scores (0.7-0.95) throughout the entire period, with dense clustering around mid-September. This suggests that while controversial topics generate sustained interest, certain events or discussions create moments where the conversation becomes especially focused.
Community-building content
Substack Community Growth Strategies shows one of the most distributed patterns, with similarity scores ranging from 0.3 to 0.95 spread evenly across the month. This indicates that while writers frequently discuss growth tactics, each approaches the topic differently—there’s less convergence on specific advice compared to news-driven themes.
What time windows reveal about trends
Comparing the same theme across different time periods offers insights into trend longevity:
Debunking Medical Misinformation across three time windows (14, 30, and 60 days) shows virtually identical clustering patterns, just with different temporal scales. This suggests the theme consists of a core set of recurring medical topics that writers address repeatedly.
Optimizing Mental Health and Productivity shows less density in the 14-day window compared to 30 days, indicating this theme benefits from looking at longer time periods to see the full conversation. Mental health discussions appear to be more cyclical, with weekly or bi-weekly patterns rather than daily responses.
Why some themes cluster tighter than others
The clustering density (how tightly articles group together in time) reveals how writers respond to themes:
Tight clusters with high similarity: Multiple writers responding to the same external event (news, controversy, viral moment)
Loose clusters with high similarity: Multiple writers independently arriving at similar topics or arguments
Even distribution: Ongoing themes that don’t require external triggers
Low similarity scores across time: Broad thematic category containing diverse perspectives
Top performers
Looking at the highest-engagement articles over a two-week period reveals the power of established platforms:
“Jimmy Kimmel is gonna sue Disney for a billion dollars...” (Gavin Newsom 2028): 18,250 reactions, 727 comments
“The sleeping giant is awakening” (Robert Reich): 12,702 reactions, 1,886 comments
“September 20, 2025” (Letters from an American): 12,777 reactions, 831 comments
These top-performing articles share common traits: timely topics, strong perspectives on current events, and pre-existing large, engaged audiences.
What this tells us about Substack
After analyzing 39,000 articles, several patterns emerge:
Most articles receive modest engagement: Most articles get fewer than 50 engagements, reminding us that building an audience takes time.
Length matters, but not as much as you might think: While there’s a sweet spot around 1,500-2,000 words, audience and topic matter more than word count.
Interactive content drives comments: Posts that invite participation (like caption contests) can generate disproportionate discussion.
Political content drives extreme engagement: While AI has the most articles, political themes generate 10-20x more engagement per article, suggesting politics remains the most engagement-dense category on Substack.
Trends cluster around events: Machine learning reveals how writers collectively respond to external events, with high-similarity articles clustering around specific dates when news breaks.
Some conversations are sustained, others are reactive: Themes like personal growth and AI ethics show consistent publishing patterns, while political themes spike around specific events then dissipate.
Controversy concentrates attention: The Jimmy Kimmel controversy generated more engagement than themes with 4x as many articles, showing that divisive topics punch above their weight.
Meta-content thrives: Substack writers actively write about Substack itself, creating a self-reinforcing community learning loop.
Try it yourself
I’ve made these analytics available on StackDigest so others can explore the data…and so I can get some feedback on this feature. You can filter by category, time period, and engagement levels to discover patterns relevant to your niche. The Theme Radar feature lets you explore any of the 20 identified themes, view their similarity score distributions over time, and see which specific articles are driving each conversation.
The database continues to grow and update every few days, so the insights will evolve as new content is published and new trends emerge. I’m particularly interested in tracking how themes evolve week over week, which topics maintain momentum, which fade, and which new conversations emerge.
StackDigest is a work in progress and free to use. Please DM me with questions, feedback, suggestions, and bug reports.







Fascinating statistical analysis. It makes me wonder if same/different themes would emerge on LinkedIn/Medium, ie how Substack specific is this or does the general public all gravitate to the same topics at the same times.
Thanks for doing the setup, analysis, and distillation!
This is absolutely brilliant!! Thanks for doing this work and writing it up.