Thread - Hyper

Announcing Pensieve and Nostr Stats: an open-source analytics solution that aggregates events from across Nostr's decentralized relay network, now live

Nostr's distributed architecture is a feature, not a bug. No central server harvesting your data. No single point of failure. No corporate gatekeeper deciding what you can post or who can see it. Events flow across hundreds of independent relays, replicated unevenly based on where users publish and subscribe.

But this design creates a surprising challenge: How do you get network-wide analytics when there's no central server?

Each relay sees only a fraction of the network. Events spread unevenly—some relays hold millions of events, others just thousands, and the overlap between them varies wildly. Questions that any centralized platform could answer instantly become genuinely hard. How many people use Nostr? What content is popular? How healthy is the network overall?

Take a simple question: How many daily active users does Nostr have? On a traditional platform, this is trivial—count the sessions. On Nostr, we can't see people browsing. No login event, no session tracking. We can only count users who publish something. So "daily publishing users" on the dashboard will always look lower than centralized services report. I'm measuring people creating, not lurking.

When I talk to people about this, the reaction is the same: everyone wants data, but nobody has done the work to get a complete picture. Most assume it's too big a problem—one you'll never quite solve. But the Pareto principle applies. We'll never get a perfect view of the network, but a very good picture is within reach.

Before Pensieve, most people relied on stats.nostr.band. But that project isn't open source, so there's no visibility into how those numbers are calculated. Over time, I found myself trusting those metrics less and less. I wanted something transparent.

Introducing Pensieve and Nostr Stats

I built an open-source solution: Pensieve, an archive-first Nostr indexer that aggregates data across relays, and Nostr Stats, a public dashboard that visualizes network health.

Pensieve isn't just another indexer. It solves the aggregation problem: pulling events from across the network, deduplicating them, validating them, and storing them both for the long-term and for analytics uses. The stats dashboard sits on top, but is just one way of making the data accessible.

How Pensieve Works

I took an archive-first approach. Rather than piping everything directly into an analytics database, Pensieve first stores canonical events in compressed archive files using a format called notepack. Notepack was created by @jb55 and is a fast and highly compressable binary data format for nostr events. Why the extra step? Durability and portability. These archives can be backed up, shared with researchers who need historical data, or used to bootstrap a new instance. They're the source of truth.

Only validated, deduplicated events make it into the archive. Events get checked for valid signatures and proper formatting before they're stored—garbage doesn't get in.

Pensieve pulls data from multiple sources:

Live relay streaming with intelligent relay selection, maintaining persistent connections to relays and receiving events as they're published
Historical backfills from JSONL and protobuf archives—I've already processed over a terabyte and a half of historical data, including the full archives from Primal's relay
Automatic relay discovery via NIP-65, finding new relays as users update their relay lists
Negentropy syncing with the Damus relay for efficient event reconciliation

That last one deserves an explanation.

Imagine you have a million events and a relay has a million events. How do you figure out which ones you're each missing? The naive approach: request everything and deduplicate locally. But that's expensive—you're transferring vast amounts of data you already have.

Negentropy, created by @Doug Hoyte, solves this elegantly. It's a set-reconciliation protocol based on range-based comparison. Both sides compute short fingerprints (hashes) of the event IDs within a given range. If the fingerprints match, those ranges are identical—no need to check individual events. If they differ, the range gets subdivided and compared again, recursively narrowing down to just the events that are actually different.

The result: Pensieve asks for only the IDs it's missing. No redundant transfers. And when a relay connection drops, negentropy lets Pensieve catch up efficiently on everything that was published while it was disconnected—no need to re-fetch the entire history.

I also built a relay quality scoring system. The logic is simple: not all relays are equally valuable for data collection. Some relays surface events I've never seen before. Others mostly send duplicates of what I already have. Some stay online reliably; others drop connections constantly.

The scoring system tracks all of this—novel event rates, connection success, uptime—and dynamically prioritizes which relays to focus on. Resources flow toward relays that consistently deliver unique data. It's naive right now, but even simple heuristics help when you're trying to efficiently cover a network of hundreds of relays.

ClickHouse is the secret weapon here. It's a column-oriented database built for real-time analytics on massive datasets. Traditional row-based databases struggle when you ask questions like "how many unique users published each day for the last year?"—they have to scan entire rows even when you only need two columns. ClickHouse stores data by column, so aggregation queries that touch billions of events come back in milliseconds. It's open source, battle-tested at companies like Cloudflare and Uber, and designed exactly for the kind of time-series analytics dashboards need. When you're slicing and dicing millions of Nostr events by kind, by timestamp, by author—ClickHouse doesn't break a sweat.

I built it in Rust for a reason. When you're processing terabytes of messy, inconsistent data—files with invalid formats, malformed tags, events that claim to be from January 1970—you need a system that's both fast and reliable. Rust gives you both. Its memory safety and type system mean that once the code compiles, you've got a pretty good sense it's going to be stable. Edge cases get handled. The program doesn't crash at 3am on a Sunday.

What You Can See Today

The dashboard is live at stats.andotherstuff.org. Here's what you can explore:

Publishing users tracked over time (excluding ephemeral keys)
New user growth and retention patterns
Event kind distribution: see what types of content dominate the network
Zaps and Lightning activity: real money flowing through Nostr
Relay distribution from NIP-65 data, showing which relays users prefer
Time-series charts for all metrics, letting you spot trends

One insight that surprised me: the sheer number of users with dead relays still listed in their relay preferences. If you haven't checked your relay list lately, now might be a good time.

The data tells the story of a nascent network. New user numbers are noisy—there are plenty of ephemeral keys and profiles that never get filled out, which might be bots or might be people who tried Nostr once and bounced. There's work ahead to prove Nostr's value to a broader audience.

But that's exactly the point. Seeing real numbers—honest, transparent, calculated from verifiable data—is what the community needs to make good decisions about where to build next. No more guessing. No more trusting opaque metrics from closed systems.

What's Next and Getting Involved

Pensieve is currently live-streaming roughly one million events per day into the system. The historical data goes back to what I call "Nostr genesis"—the creation of @fiatjaf 's original design document—and I've already backfilled archives from Primal.

I'm looking for more historical data. If you're a relay operator with archives, I'd love to backfill from your data. The easiest format: JSONL files (one event per line), compressed with standard compression (gzip, zstd, etc.). Send me the files or share a link, and I'll handle the rest.

On the roadmap: deeper event kind breakdowns, more relay analytics, and eventually a BI tool that would let researchers run custom queries against the full dataset. I'd also like to see multiple Pensieve instances running, tuned for different geographical regions or relay types, with negentropy syncing between them to ensure comprehensive coverage.

The code is fully open source:

Pensieve: github.com/andotherstuff/pensieve
Nostr Stats: github.com/andotherstuff/nostr-stats

For developers who want to contribute: the relay discovery and quality scoring system needs work. The current heuristics are naive, and there's room to make them smarter. I'd also love to expand how we're using negentropy syncing and sync with more relays.

Built as part of And Other Stuff, an open-source collective focused on public infrastructure for Nostr. Having a large analytics data source doesn't just serve researchers—it opens the door to services like aggregation APIs that let client developers get fast counts on followers, likes, and replies. That's the kind of shared infrastructure this network needs.

Check out the live dashboard at stats.andotherstuff.org, explore the code, and if you've got historical data to share—reach out.

Counting the uncountable

Introducing Pensieve and Nostr Stats

How Pensieve Works

What You Can See Today

What's Next and Getting Involved

Replies (2)