Dustin Sallings home
big data

Thanks for the memories

A while back, there was a leak of a LinkedIn password data in the form of a list of unsalted SHA-1 hashes. A few sites had password check tools up such that you could provide your password or a hash of it and it’d tell you whether it was found in the leaked information.

These sites were all really slow, and would sometimes report database errors. I found it curious that anyone would even consider a database for a fixed-size single record lookup of a small amount of immutable data.

I downloaded the data set and played with it during a meeting. In about a half hour, I had a small server that could load the data set into memory in a couple seconds and serve responses from memory stupidly fast an with perfect horizontal scalability.

Capacity planning

I think in the LinkedIn case, there were 6.5 million hashes leaked. SHA-1 hashes are 20 bytes.

That’s 130MB of data.

I have tabs in Chrome right now that are using more memory than this, yet people deployed multi-tier infrastructure to answer simple presence queries. They’re fragile, complex and slow.

What database do I need?

This brings me to my motivation for writing this. People have gone a little bit overboard with thinking up big solutions to small problems. I’m not trying to pick on any particular user, but I did have an example that helps make the point pretty clearly.

I picked up a Stack Overflow question yesterday about checking for hash presence that was almost an identical problem. Note that the question is tagged bigdata. In this case, it was “a few million” SHA-256es.

Initially, the user attempted to use both MySQL and Couchbase to solve the problem. Apparently Couchbase used too much memory (presumably using hex encoded keys) and MySQL was too slow, so he tried sharding the table by first nibble, but it still was too slow, so he asked for help.

Two of the answers were (reasonably sensible) suggestions for MySQL. Some schema suggestions and configuration parameters that will help with efficiency.

Another was suggesting some combination of hbase, redis, and cassandra. That’s just… overkill. This is a super small scale problem.

The spec said “several million” of these hashes. I wrote a small test with 50 million hashes. 50e6 * 32 is about a gig and a half of RAM. It’d be unusual to find a computer that couldn’t spare 1.5GB of RAM for such processing. You have to get up to about a billion hashes before it starts to get a little harder.

“But that won’t scale!” you say? An EC2 instance that can hold about 8 billion such hashes in memory costs about $3.50 per hour. By the time you get to that level, you can think about something better anyway.

But I don’t want to write a lot of code!

I pointed to the code I’d written in that meeting as an example to get started. It contains both the text -> binary format convert thingy as well as a web server that loads that file into memory and returns an HTTP status that indicates presence. That was a distracted half hour of work.

However, I realized that I’d since written go-hashset. That makes this type of problem much easier.

Below is a complete server using go-hashset and net/http that will return HTTP 204 on a hit and HTTP 410 on a miss given a GET request to /[sha-256].

package main

import (
	"encoding/hex"
	"log"
	"net/http"
	"os"

	"github.com/dustin/go-hashset"
)

const (
	hashSize   = 32
	listenAddr = ":8080"
)

func loadFile(fn string) *hashset.Hashset {
	f, err := os.Open(fn)
	if err != nil {
		log.Fatalf("Error opening hash file: %v", err)
	}
	defer f.Close()
	hs, err := hashset.Load(hashSize, f)
	if err != nil {
		log.Fatalf("Error loading hashes: %v", err)
	}

	return hs
}

func main() {
	hs := loadFile("hashes.bin")

	http.HandleFunc("/", func(w http.ResponseWriter, req *http.Request) {
		b, _ := hex.DecodeString(req.URL.Path[1:])
		if len(b) == hashSize && hs.Contains(b) {
			w.WriteHeader(204)
		} else {
			w.WriteHeader(410)
		}
	})

	log.Printf("Listening on %v", listenAddr)
	log.Fatal(http.ListenAndServe(listenAddr, nil))
}

Half the code is loading the file and the other half is specific to this HTTP API. It’s easy enough to imagine another protocol if this doesn’t work for you.

Conclusion

“big data” isn’t all that clearly defined, but as a rule of thumb, here are indicators that you’re definitely not working with big data:

One might even argue that if you can fit the data into a single computer, it’s not worth calling it big data, though big data processing tools can benefit even on smaller scale.

In the meantime, enjoy the smaller data in life. It’s fun and easy.


blog comments powered by Disqus