New Open Source Stream Summarizing Java Library

Here at Clearspring we like to count interesting things. How many users have visited a site? How many unique URLs are shared every hour? It turns out that counting is a non-trivial problem when there are several billion things to count. And counting is only the first step. What about the frequency of the things you are counting? Maintaining a complete multiset — with billions of elements indexed by multiplicity — for each dimension of interest is rarely practical. So, we’ve developed a set of utilities to help make counting to a billion easy.

Today we are pleased to release those utilites under an open-source license. Stream Lib is a Java library for summarizing streams of data. Included are classes for estimating:

  • Cardinality (aka “counting things”): Instead of storing an entire set of elements it is possible to instead construct a compact binary object that will provide a tunable estimate to how many distinct elements have been seen. This reduces memory requirements by orders of magnitude. There is a significant body of academic literature on approaches to this problem. We have tried to provide useful implementations of those ideas.
  • Set membership: Bloom filters provide a space efficient way to test for set membership. They have the useful property that their are no false negatives, only false positives within whatever bounds you specify. The wikipedia page has a good section on the interesting variants that have been developed over the recent years. We have adapted Apache Cassandra’s well tested implementation for standalone use.
  • Top-k elements: While counting the number of distinct elements with a cardinality estimator is cool, sometimes (well often) you also want to know something about those elements (such as the most frequent ones). We have some early work in this area (a stochastic topper).

There is a Readme to get your started with the code. We hope that others find this as useful as we have. Feedback, comments, patches, bugs, and forks are all welcome.

Think this is cool? Apply for a job to join the team!

  • UserNetSite

    I will try it.

  • @UserNetSite, Anxious to hear what you think.

  • Wow, you guys are pure genius. With your blog, training seminars and networking experience, I had the single most productive 4 days in my professional career. Multiple market business networking is the wave of the future. I can think of no better way for ANY start up to get the word up and get smarter in the process. I’m blown away!

  • Pingback: New Open Source Stream Summarizing Java Library | Sum of the Web()

  • I took a look at the README on GitHub and realized that I’m in way over my head when trying to use this tool to analyze and count website visitor data and actions across a number of our sites.

    I think I’ll take a look at the cardinality papers and if I can’t get anything else out of it, I’ll probably outsource the project. Thanks for putting this together though!