I collected abstracts from 11,303 recent research papers on AI/ML
Then I used them to build a trend-spotting prototype
It was just last week that I shut down StackDigest, my newsletter discovery and digest tool. And I really missed being able to semantically search for content and create reading lists on niche topics. Instead, it was back to, sigh, Googling, and Substack’s limited interface, both of which I used to start researching a potential story on how jailbreaking techniques have evolved along with AI.
One of my Google search results included an article with links to a few interesting papers…and I noticed that they lived on something called arXiv. I’d heard of arXiv before (mostly from researchers sharing preprints on Twitter), and even tried it years ago, but I’d found it difficult to search.
And then it occurred to me. What if I used arXiv, or something like it, as a data source? And then applied semantic search and machine learning to more easily surface valuable papers and identify research trends?
Wait, what’s arXiv?
For those of you who haven’t spent much time in academic circles or hunting for research papers, arXiv is a free, open-access repository for scientific papers. It was created back in 1991 by physicist Paul Ginsparg as a way for researchers to share preprints—papers that haven’t yet gone through peer review—quickly and openly.
Today, it hosts over 2.4 million papers across physics, mathematics, computer science, and more. The AI/ML research community has basically adopted it as their primary distribution channel. When a team at Google DeepMind or Meta AI or some university lab makes a breakthrough, they typically post it to arXiv before (or alongside) submitting to formal conferences.
There are similar platforms, too, which include Semantic Scholar (built by the Allen Institute for AI), PubMed (for biomedical research), bioRxiv (biology), and SSRN (social sciences). Each serves a different research domain, but they all share the same mission: making cutting-edge research accessible to everyone.
Even better, arXiv has a completely open, well-documented API. You don’t even need an API key. It doesn’t require payment even for commercial tools (though they recommend becoming an affiliate). It’s low-budget researcher heaven.
Validating the problem (for me)
Was it that easy, I wondered? Had I actually discovered a new problem to solve with the machine learning techniques I’d picked up while building StackDigest? I went back to the arXiv search interface to see if it had evolved since the last time I wrestled with it.
After playing with it for about half an hour, I concluded that it was just as challenging to work with as I remembered. It required stringing individual keywords together. The results were sometimes missing context. And I realized that finding trends would be incredibly difficult and require cross-referencing multiple searches.
For example, if I wanted to understand what’s happening in “AI for healthcare,” I’d have to:
Search for healthcare-related keywords
Filter by date ranges
Read through dozens of abstracts
Mentally cluster similar work together
Try to identify which themes were gaining traction vs. fading
Hope I didn’t miss important papers because they used different terminology
Plus, the metadata describing individual papers was deeply technical and hard to navigate for a non-scientist.
Yes, I decided, this problem was real.
Validating the problem (for others)
My next question was, “Does anyone else think this is really a problem?”
I thought about it and made some guesses. I was pretty sure that other writers (especially in the tech space) want to know about trends in AI/ML and other research-intensive domains like healthcare or climate science.
Researchers might want to see what’s trending in their space without manually combing through hundreds of papers. And, depending on the domain, even investors, consultants, and corporate R&D leaders might want this kind of information.
Of course, I don’t have a lot of familiarity with most of these audiences (except for writers! Hello, friends! 👋). So how could I test interest in this idea, to see if there are people out there who WANT a new, easier way to look for common themes and trends in cutting-edge research?
I decided there was only one way to find out. Build a quick, narrowly focused prototype and share its outputs widely across my Substack and LinkedIn networks to see what people think.
Selecting a research domain to test
While there are many different open data sources for scientific papers in different domains—PubMed for medicine, bioRxiv for biology, SSRN for social sciences, etc.—I decided to go with arXiv for AI/ML, simply because I’ve met a lot of people who write in this space (hello She Writes AI community! 👋).
And since AI is a trending topic on Substack, getting people to read about common themes across AI research (with links to interesting papers) should be easier than, say, getting them excited about trends in computational fluid dynamics. (No offense to the CFD folks.)
Choosing the data source
I went with arXiv for a few reasons:
A wealth of AI/ML papers (about 200+ new papers per day in the categories I track)
An open API that doesn’t require authentication
No payment required (even for commercial use)
Rich abstracts that allow meaningful analysis without needing full PDFs
Because the abstracts are content-rich (typically 150-300 words summarizing the entire paper), I believed I could conduct meaningful analysis by combining them with the article metadata. This would keep things simple and avoid potential copyright issues.
What about legal and compliance issues?
I read arXiv’s Terms of Use carefully. The key points include:
Bulk access is allowed as long as you respect rate limits (I use 3 seconds between requests)
Content can be redistributed under Creative Commons licenses (most papers use CC BY)
No authentication required, but they recommend becoming an arXiv affiliate for large projects
Attribution required when sharing content
The bottom line: I had nothing to worry about from a legal perspective. arXiv was designed for this kind of open access and reuse.
Scoping it out
I decided to create only one feature for my prototype: a research theme clustering and analysis module. Here’s what I needed:
Data pipeline
Collect about a month of arXiv metadata on AI/ML topics to test with
Store papers, abstracts, authors, and categories in a database
Generate embeddings for semantic search
Analysis engine
Use semantic search to find papers relevant to user queries
Cluster similar papers together using cosine similarity
Analyze each cluster with AI to identify themes
Generate a main report + deep-dive reports for each theme
Create visualizations showing relationships between themes
User workflows
The user flow is pretty straightforward:
User enters a search query (e.g., “Applications of LLMs for healthcare”)
Backend enriches the query using Claude Haiku. Note: This was an important addition after early testing. I found that simple queries like “healthcare AI” would return too many generic papers because the embeddings weighted all terms equally. So now, Haiku analyzes the query and expands domain-specific terminology while de-weighting generic terms like “AI” or “LLM”.
User chooses a “Standard” report option or a less technical “General audience” option.
Semantic search runs using OpenAI’s embedding model. The enriched query gets embedded, then we find papers with high cosine similarity (>0.40 threshold—I increased this from 0.35 after testing showed it improved precision)
Results get clustered by calculating cosine similarity between paper embeddings using pgvector, which groups similar papers together based on their semantic proximity
Claude Haiku analyzes each cluster, reading titles and abstracts to identify the theme, key findings, important papers, and technical approaches
Reports are generated in markdown with two writing style options: Standard (for researchers) or General Audience (simplified language with explanations)
Interactive scatter plot shows relationships between themes, with each cluster plotted in 2D space using UMAP dimensionality reduction
Co-developing with Claude Code
I decided to build this with Claude Code, Anthropic’s CLI tool for AI-assisted development. After building StackDigest with heavy AI assistance, I knew this was the fastest and easiest way for me to work—but I also knew I needed to set some ground rules.
I created a .claude/CLAUDE.md file in my project with specific instructions about:
Memory-efficient code (no loading 11,000 papers into memory at once)
Using APIs rather than self-hosted solutions for machine learning operations, like semantic search and embedding creation.
Avoiding massive database queries (use select_related, prefetch_related, pagination)
Efficient API handling (batch OpenAI embedding calls, handle rate limits properly)
Running in a virtual environment (keep dependencies isolated)
Clean UI with minimal emojis (I wanted a professional, monochrome aesthetic)
And…so far, so good! Claude Code has been careful to reference these reminders, although it did try several times to add open-source machine learning models with huge disk space requirements to my tech stack.
The tech stack
Before unleashing Claude, I also sketched out a preliminary tech stack.
The backend
For the prototype, I decided to use Django + PostgreSQL because I know them well and had just used them for StackDigest. Could I switch to FastAPI or Flask later? Sure. But for rapid prototyping, I kept it simple and familiar.
I also installed pgvector, a PostgreSQL extension that adds vector similarity search directly in the database. This means I can run semantic searches with a single SQL query instead of pulling all papers into memory and calculating cosine similarity in Python. The performance difference is enormous: sub-second searches even with 11,000+ papers.
For deployment, I’m still deciding between Heroku (where StackDigest lived) and newer options like Railway or Vercel. From what I’ve seen in forums, Django users have mixed feelings about this:
Heroku is expensive but familiar, with excellent Postgres support
Railway is cheaper and has better developer experience, but newer/less proven
Vercel is great for Next.js but awkward for Django (needs separate database hosting)
For now, the prototype runs locally, and I’ll probably choose Heroku or Railway for the initial beta phase. I’ll figure out longer-term production hosting after I validate there’s actual interest.
The frontend
I decided to use Tailwind CLI to avoid the CSS framework conflicts I’d dealt with on StackDigest (Bootstrap + custom CSS = conflict nightmare).
One thing worth noting: I’m using Tailwind CLI, not the CDN version. The CDN is fine for prototyping, but it includes all of Tailwind’s utility classes, which results in a massive CSS file (~3MB). The CLI version scans your HTML templates and only includes the classes you actually use, resulting in a tiny production bundle (~10KB in my case).
This is different from Bootstrap, where using the CDN in production is perfectly reasonable because the library is much smaller and you’re likely using most of it anyway.
Creating the framework
I set up all the data models, database schema, some basic Tailwind templates, and the backend logic. I recycled the magic link authentication system from StackDigest (why build the same login flow twice?).
But it was lifeless without data. Time to actually collect some papers.
Building the test dataset
It took about half a day to create the arXiv ingestion scripts and generate embeddings for about a month of data.
The process:
Fetch papers from arXiv API using date range and category filters
Parse XML responses (arXiv returns Atom 1.0 format)
Store in PostgreSQL with JSONField for authors and categories
Generate embeddings by sending abstracts to OpenAI in batches
Store embeddings in pgvector column for semantic search
I also set up a daily pipeline for incremental updates, so when I deploy to production, new papers will automatically be ingested and processed overnight.
One fun technical challenge: arXiv’s API pagination is unreliable for large result sets (>5,000 papers). To get this working more reliably, I asked Claude to automatically chunk date ranges into 7-day segments. Instead of requesting 11,000 papers at once, I request 5 weekly chunks of ~2,000 papers each. This gives 100% reliable pagination with zero empty responses.
Overview
11,303 papers indexed (Oct-Nov 2025)
All papers have semantic search embeddings (1536 dimensions)
Research areas (Top categories)
Machine Learning (cs.LG): 45.7%
Artificial Intelligence (cs.AI): 45.1%
Computer Vision (cs.CV): 27.2%
Natural Language Processing (cs.CL): 23.8%
Yes, categories overlap—papers can be tagged with multiple categories
Model/architecture mentions (% of papers)
Transformer-based: 8.2% (926 papers)
Diffusion models: 7.4% (832 papers)
GPT variants: 4.5% (503 papers)
LLaMA/Llama: 2.5% (283 papers)
BERT: 1.6% (183 papers)
Gemini: 1.5% (164 papers)
Claude: 0.9% (100 papers)
Hot research topics
Evaluation & benchmarking: 30.3% (most common theme!)
Efficiency & compression: 22.3%
Reasoning & planning: 17.9%
Fine-tuning & adaptation: 11.3%
Safety & alignment: 10.3%
Application domains
Text/language: 39.8%
Image/vision: 18.7%
Multimodal: 7.9%
Healthcare/medical: 7.6%
Code generation: 3.3%
One interesting finding was that 30% of papers focus on evaluation and benchmarking. This makes sense, as the field is moving so fast that researchers likely need standardized ways to compare approaches. But it also suggests we’re in a “measurement era” where understanding what we’ve built is as important as building new things.
Early testing (and a critical bug fix)
I ran some test queries to see how the tool performed. The clustering looked great—papers were being grouped into coherent themes. But when I tested “Applications of LLMs for healthcare,” something weird happened.
The tool returned 621 papers. But when I read through the themes, about 70% were generic LLM evaluation papers that maybe mentioned healthcare as one example domain. They weren’t actually about healthcare applications.
The problem was that semantic search was treating “LLM” and “healthcare” as equally weighted terms. Papers that mentioned healthcare once in a generic context scored almost as high as papers entirely focused on medical AI.
The fix: Query enhancement
I added an intelligent query enhancement step using Claude Haiku. Before generating the embedding, Haiku analyzes the query and:
Detects query type (domain-specific, model-specific, or general)
Expands relevant terminology for domain/model queries
De-weights generic terms like “AI,” “LLM,” “model”
For “Applications of LLMs for healthcare,” it now expands to emphasize clinical terminology: “Healthcare applications clinical medicine patient care medical diagnosis treatment EHR analysis clinical decision support medical AI health informatics”
I also increased the similarity threshold from 0.35 to 0.40 (stricter matching).
This dramatically improved results. The healthcare query now returns themes actually focused on medical applications, not just papers that mention “healthcare” once in passing.
See it in action
After a few days of development and testing, everything came together. The prototype looks clean. The reporting is rich and, at least for me, valuable. I even tentatively named it Future Scan.
Goals for next week
But building the prototype is only the beginning. The next step is to share it—and its outputs—with other people and get some honest feedback.
And to make that happen, I need to:
Deploy to production
Recruit beta testers (this tool is less general-interest than StackDigest, so I’m expecting a smaller group)
Set up a pilot newsletter to share interesting findings from automated analyses
If this sounds interesting to you, DM me or reply to this post. I’m looking for ~10 beta users to test the tool and give feedback before I decide whether to build it into something bigger.
Looking ahead
If there’s real interest, here’s what I’m thinking about:
Expanding to other domains
This concept could work with PubMed (medical research), bioRxiv (biology), or SSRN (social sciences). The infrastructure is the same; I would just swap the data source.
Adding basic semantic search
Right now, the tool does trend analysis with clustering. But I could also add simple semantic search without clustering for a more directed search experience. Think: “Find papers about X” without waiting for full analysis.
Temporal trend tracking
With more historical data, I could track how research themes evolve over time. What was hot 6 months ago? What’s emerging right now? What’s declining?
Cross-domain analysis
What if I ingested papers from multiple domains and looked for cross-pollination? Are techniques from computer vision making their way into biology? Are healthcare researchers borrowing from robotics?
But first, I need to see if anyone actually wants this.
Stay tuned. And please DM me if you’d like to beta test. 🎉








Wow this is awesome. I would love to beta test.
Genuinely fascinating. Thanks for building out your process and thinking framework for us to read!