Dustin Sallings

GoProFS - All My Video

2023-10-16T00:00:00+00:00

GoProFS

So, I end up shooting a lot of video. As I mentioned way back in my GoPro Plus post, I mostly shoot with a GoPro and use the GoPro+ service.

I’d been backing up all of my cloud stuff to a USB hard drive on a Raspberry pi. That allows me to keep all of my Davinci Resolve projects around and reference media without having to do anything weird.

However, I ran into a problem.

Looks like I just crossed 18.5TB raw media. The 20TB hard drive I’ve got seems to provide a bit over 17TB of raw storage.

Obvious Solution(s)

Now, there are a few obvious things one might do in this situation. I’ll briefly describe why I didn’t do any of these things.

Create a Raid0 Thing

I could, but if any disk fails, I lose all the things. Also, my Raspberry Pi doesn’t have enough USB (or memory for that matter).

Create a Proper RAIDZ Pool

In addition to the USB/Memory issues, this is now getting rather expensive. I’m already paying GoPro to store my media, creating an expensive local solution isn’t desirable.

I do have a TrueNAS box with raidz2 (and am using it for some overflow), but that’s expensive storage. This is mostly a backup.

Use Davinci Resolve Media Management

I could ask Davinci Resolve to pull out only the parts of media that I actually use and give me a much smaller set of media to work with. That works for some projects, but I sometimes go back and find things that I missed when I was first reviewing footage.

But in general, I want everything easily available in source form.

OK, So What is GoProFS?

My initial thought was to build a FUSE-based filesystem that provides access to data stored in the GoPro cloud as a single magic mechanism. I built that and it’s pretty great. I wrote the bulk of it in go because I didn’t have a Haskell FUSE library that worked on OS X and wanted to get a prototype up and running instead of solving this library deficiency, while still being able to use the bulk of my Haskell code via the web interface.

It does more than just provide cloud access, but let’s just start with that.

Cloud File Access

goprofs will create a local directory that will give you a tree like this:

ls -l $GOPROFS/2014/07
total 0
dr-xr-xr-x 1 root wheel 0 Jul  1  2014 m0ao4NWMQBWvn
dr-xr-xr-x 1 root wheel 0 Jul  1  2014 pJ0WlypDKXJX1

where each directory contains the source media you uploaded to the service.

The contents are fetched dynamically in chunks. For example, if you open an mp4 file with QuickTime Player, you’ll find that it reads a bit of the beginning of the file and then a bit from the end of the file, then a few other parts near the end, then the beginning again, then more near the end for a while, etc…

If you’re opening a 10GB file, it’d be rather unfortunate to have to wait to pull the entire content down just to see the first frame, so the underlying mechanism slices the file into 8MB (arbitrary number) chunks and maps each byte range requested to a series of chunks, blocking reads until all of those chunks are satisfied. The chunks themselves are stored in a sparse file for the duration of the reader’s session, pulling any new blocks the user may request.

When the application using the file is done with it, we check to see if we’ve found all the blocks. If we have, then we rename the file to a permanent “this is all the parts” name. Otherwise, we write down which blocks we had and reload them for the next file read.

This means you only need the parts of the video you are actually using.

But I Already Downloaded Everything

Since I’ve been using my backuplocal command, I already had a bunch of the footage locall (until I ran out of disk), so if I’ve already got a file, I should be able to just use that instead of bothering the GoPro Plus service for a signed URL to download it again. goprofs has a -source flag allowing it to check for a local file (in the same file layout as the backups) before going to the internet. Better yet, you can supply source multiple times to have it look in multiple places.

Why Is My Data in Multiple Places?

It turns out, it was quite easy to get the backuplocal command to, itself, look in multiple places to see what it has, and then write to one of those. So I can just add another storage location and do something kind of like striping where the backup itself can be over multiple distinct filesystems.

This is similar to the raid0 solution discussed above, but without any direct filesystem support or even needing the sources to be on the same host. In my case, I’ve got my old big USB disk on my Raspberry Pi and a bit of overflow on my raidz2 volume until I can get another cheap disk.

tl;dr: Where Does it Look?

For a given artifact, it checks:

Each -source location
Its download cache location
Uses the cloud and grabs individual chunks as they’re read.

Sounds Like it Does a Lot!

It does! But there’s one more feature: Proxies

Davinci Resolve will automatically link a proxy for F.MP4 if it finds a file named Proxy/F.MP4 (or Proxy/F.mov or whatever). While most of the filesystem is read-only (since it’s showing you your own prerecorded content with no ability to break stuff), we do want to be able to create these Proxy directories. So we do that with a read/write loopback for the Proxy directory of any medium overlaying a different disk location. This is pretty important when you are trying to pull the sizes of stuff we’re talking about.

Sounds Awesome. How’s it Work?

So far, it works as I’d expect, but Blackmagic Proxy Generator seems to hang sometimes while generating proxies. As far as I can tell, it’s not my bug, but I can’t tell what it’s doing because OS X is pretty hostile to system introspection these days. I may just try different software for this.

It’s in the goprofs tree of my gopro project on github. Note that it does require my Haskell GoPro web services to be running as an endpoint. I could make it talk directly to the GoPro Plus service if I wanted to, but that’s a lot of work to reproduce what I’ve already been running for years and doing so in a language that’s harder to work in.

Monads are Tedious in Go

2021-06-23T00:00:00+00:00

Monads are Tedious in Go

Most of my personal projects are written in Haskell these days. I’ve heard people say “Haskell is hard” or whatever for a long time, but the reason I write most of my projects in Haskell isn’t because I’m smart and want to do the most impressive smart person thing possible, but because I’m dumb and want better tools to help me understand things more easily and avoid the kind of bugs that dumb people like me write a lot.

On any given work day, I review at least one piece of go code. Go kind of has a similar thing in theory about not having “clever” features that might confuse people. Some of this is really nice, but some of it is tedious. I’m going to get into the latter a bit here.

Much of the code I end up reviewing contains many reimplementations of <$> or >>=, sometimes buggy. These are scary looking things to someone who doesn’t write any Haskell, but they’re so fundamental to what most code is doing that you just absorb them quickly.

This post isn’t meant to be a tutorial on Haskell operators, but <$> is also spelled fmap and basically means “apply this function inside that thing.” e.g., you might apply a function to each element of a list, or to the value inside an optional (think nullable pointer). >>= is the monadic “bind” operator and basically is used to combine monadic actions.

At this point, you’re either confused by jargon or angry for how inaccurate my descriptions are. I intend to make things clearer as I write. I’m actually intending to write more about go, so let’s get going.

Error Handling in Go

I actually rather like how error handling works in go. Mostly. I’ve worked in lots of languages with exceptions and I’ve disliked most exception handling I’ve encountered (including Haskell). You either end up with a completely opaque error path that you are unable to reason about (e.g., any line of code may fail with any exception) or you end up with an incomplete list of exceptions you might have to care about, but generally can’t do anything about at a particular site (e.g., java checked exceptions which is always an incomplete list).

In go, a function that might fail will return an error. This is really nice. You can see what might fail and decide what to do about it. In most cases, you just pass it up, but you don’t have the situation where you’ve forgotten to handle a particular exception type and your program crashes instead of just failing gently.

There are two downsides, though:

You have to add the dreaded if err != nil { return err } code everywhere.
You have to check errors and ignore values on error by convention.

The first point is mildly annoying. It seems unnecessary and there’s been exhaustive discussions around how to improve the particular case. Looking at it with my Haskell glasses on, it seems really weird to even consider writing a special case built-in just to cover what is a super generic concern. This all for what is just a special use case of >>=.

The second point comes up a lot in code review. You’re not supposed to use values if you also got an error. You’re not supposed to return a useful value if you wish to return an error. This is a bit beyond the scope of what I wanted to discuss here, but it’s something you have to consider every single time you return a (T,error) and every time you receive one. The easy way to demonstrate what this code could be accidentally doesn’t have this problem either, so I wanted to bring it up.

Idiomatic Go is the Either Monad (sort of)

In go, if you want to return a value of type T or an error, you generally return (T, error) and expect the user to only use one of those values. I’m going to contrast this to Haskell’s Either type which is almost the same. Either a b can give you either Left a or Right b (a and b are types here). The primary difference is the second point above… you can’t get both values. You either get an error or value.

As a Monad, Either a will effectively short-circuit any failure and continue forward with any value.

Let’s look at an example where we have a function that takes two numbers as strings, adds them together, and returns the value as a string:

func readInt(s string) (int, error) {
	return strconv.Atoi(s)
}

func add(a, b string) (string, error) {
	ai, err := readInt(a)
	if err != nil {
		return "", err
	}
	bi, err := readInt(b)
	if err != nil {
		return "", err
	}
	return strconv.Itoa(ai + bi), nil
}

(I included readInt just so the types are visible)

This is pretty straightforward, idiomatic go. The rough equivalent in Haskell would look something like this:

readInt :: String -> Either String Int
readInt = readEither

add :: String -> String -> Either String String
add a b = case readInt a of
            Left x -> Left x
            Right ai -> case readInt b of
                          Left x -> Left x
                          Right bi -> Right (show (ai + bi))

That’s kind of worse in that it seems to march off to the right. What if we wanted to add three numbers!?

But Either a is a monad, so we can use >>= to get things done. This is the equivalent function without case:

addM :: String -> String -> Either String String
addM a b =
  readInt a >>= \ai ->
  readInt b >>= \bi ->
  pure . show $ ai + bi

This does the same thing – we’ll either get ai as an int, or it’ll short-circuit the rest of the function and return the error we got from trying to parse the value. i.e., it does all the if err != nil { return err } bits for you.

Much like in the initial version, ai is the Int form of of the String a as bi is for b. ai and bi are arguments to lambda functions that do stuff with those Int values. It should be clear here that there’s no possible way to get ai if readInt returned a Left (error), so the only thing the code can do is return that error and not push the value into the next lambda. There’s not a way to get this wrong.

In the wild, you’d probably be more likely to see this written in the do syntax, which is just syntactic sugar for the above:

addM' :: String -> String -> Either String String
addM' a b = do
  ai <- readInt a
  bi <- readInt b
  pure . show $ ai + bi

Note that this isn’t a Haskell language feature that knows how to do fancy stuff with Either a – that’s just how the library is defined. You can make your own monad that works differently (as long as it’s lawful). The definition for Either is just this:

instance Monad (Either e) where
    Left  l >>= _ = Left l
    Right r >>= k = k r

i.e., if we get a Left l, we ignore our second param (the function) and return the Left l. If we get a Right r, we pass r to that function (named k here).

Of course, I wouldn’t write it that way either, since monads are also applicative functors. My brain automatically rewrites that using liftA2:

addA :: String -> String -> Either String String
addA a b = show <$> liftA2 (+) (readInt a) (readInt b)

Again, same error handling, etc…

Monadic Go

Now, imagine we had a similar monadic functionality in go. We’d write something like:

func add(a, b string) (ErrorOr[String]) {
	ai ⩴ readInt(a)
	bi ⩴ readInt(b)
	return strconv.Itoa(ai + bi)
}

This is similar to the try specification mentioned earlier, but with the dream of having an arbitrary binding mechanism that lets specific types (in this case, ErrorOr[T]) decide what it means to either move values forward or fail.

Tedium

I wasn’t thinking about this because I want to sell people on Haskell. It’s more about how I see the same stuff written in go (and other languages) every day and have to be super vigilant to make sure nobody is introducing bugs in their error handling. I catch bugs in code like this regularly. These types of bugs can’t be expressed in code that will compile if they just embraced the monads they were reinventing and got to work at a higher level.

The terminology is probably confusing to folks from strange lands, but much like the olde Gang of Four patterns, you see the same stuff a lot and name the patterns. Except folks also make libraries that just do the patterns, so we don’t all have to reinvent things so frequently.

SubTree

2021-04-20T00:00:00+00:00

SubTree - An MQTT Subscription Tree

I do a lot of iot junk and use mqtt quite a bit. Somehow I ended up writing my own mqtt broker in Haskell along the way. There was one core part of the server that I thought would be really hard, but ended up being one of my favorite pieces of code, though it’s only about 50 lines long. I wanted to describe why.

First, a bit of background.

MQTT

MQTT is basically a pubsub service for IoT like things. Little devices connect to an MQTT broker so other devices connected to the MQTT broker can pick up those messages via subscriptions. It’s a bit complicated and there are a lot of details, but I’ll try to simplify to the relevant points.

Publishing

Messages are published to topics that are slash-separated strings, e.g. a/b/c. The broker’s job is to decide who wants to receive messages published to a/b/c and delivers a copy of the message to that subscriber.

Subscribing

Multiple clients can subscribe to topics either specifically, e.g., a/b/c or by a couple different wildcards. + can replace any single path element, e.g., +/b/c or a/+/c or +/+/+ or whatever. # can appear as the last element in a path and matches anything below the current path. Also: topics whose first character is $ are not automatically matched by a toplevel #. i.e., $x/blah will not be matched by # but it will be matched by $x/#.

The specific rules aren’t too important, but it’s complicated enough that you can’t just use a simple Map to locate subscribers and there may be multiple subscribers for every published topic.

Other Fancy Things

There’s also a concept of “shared subscriptions” that allows multiple subscribers to round robin messages and we need to be able to deal with unsubscribing or timing out clients and forgetting subscriptions, so we need to handle a couple other cases specially. These are all doable without excessive consideration of the data structure.

Interesting Type Classes

Setting aside the specific goals for a bit, let’s look into a few type classes that we’d like our SubTree type to satisfy to make things easier to think about.

Semigroup

A Semigroup at a high level just means you have an associative binary operation that is used to “combine” two things. This is a bit hand wavy, but if you think about it as set union or list concatenation, you’re on the right track for most of the purposes here.

I wanted my SubTree to be a semigroup such that a <> b does whatever you might think of as a natural combination of two SubTrees. SubTree itself is * -> *, so it’s parameterized on a type. A Semigroup of SubTree a is only meaningful when a itself is a Semigroup. i.e., if you have a SubTree [Int] then combining two SubTree values would produce a new SubTree with all of the same subscribers for all of the topics from both values.

Monoid

A Monoid is basically just a Semigroup with an “identity” value. The “identity” value can be combined to either end via the semigroup <> operator and you’ll end up with the same value. e.g., "a" <> "" == "a" and "" <> "a" == "a".

Similarly to Semigroup, when I have a SubTree a and a is a Monoid, I wanted SubTree a to also be a Monoid, making mempty do the right thing for constructing an empty SubTree.

This doesn’t look like much so far, but it gets more powerful as we go.

Functor

Since my SubTree is parameterized and conceptually a container, it makes a lot of sense to have it be a Functor.

A Functor gives you a way to take a SubTree a and convert it to a SubTree b if you have a function (a -> b). One way to think of this is to imagine your function f :: a -> b being placed in front of every a in the SubTree a which naturally makes every a into a b.

e.g., if I have a SubTree [String] where I map subscription filters to a list of strings and I want to have a SubTree Int where I map filters to the number of subscribers, then that’s just fmap length and we’re done.

The laws require the “shape” of the structure not change while performing such a transformation. Mapping a list over a function gives you a new list with the same number of elements. Simiarly, mapping a SubTree over a function gives you the same subscription structure, but changes only the values.

Foldable

Foldable is an abstraction for doing stuff to all of the elements of a container. It gives you such great bits as foldr and fold and toList. Where Functor gives you a shape-preserving mechanism to operate across a value, Foldable provides catamorphisms allowing you to reduce a value to a structure of a different shape.

For example, if you want to know how many subscribers are found within a SubTree [String] named t you can write something like foldMap (Sum . length) t and your Foldable implementation and the Sum Monoid does the rest of the work for you.

Traversable

Traversable is a bit of a fancier Functor that allows for effects and a few other things that aren’t necessarily interesting for this discussion, but when you’re building a container, it’s good to have around.

Finally, the SubTree Type

So now that we’ve covered all of the desirable properties, the long-dreaded SubTree ended up being mostly just this:

data SubTree a = SubTree {
  subs       :: Maybe a
  , children :: Map Filter (SubTree a)
  } deriving (Show, Eq, Functor, Foldable, Traversable)

It’s a tree that may have subscribers at any given level, and it may have children below it. I think I hand-wrote Functor et. al. before just trusting the compiler to do the right thing (there should be only one valid implementation, anyway). So at this point, we’re almost done with most normal storage bits.

Semigroup and Monoid require a bit more more work, so let’s implement those really quickly before we move on:

instance Semigroup a => Semigroup (SubTree a) where
  a <> b = SubTree (subs a <> subs b) (Map.unionWith (<>) (children a) (children b))

empty :: SubTree a
empty = SubTree Nothing mempty

instance Monoid a => Monoid (SubTree a) where
  mempty = empty

(note that empty exists so we can have non-monoidal empty SubTrees)

Modification

In order to add, change, or remove subscriptions in a SubTree, we introduce the modify function. It’s the most general mechanism for performing any modifications, so it gets a pretty generic name. It looks like this:

modify :: Filter -> (Maybe a -> Maybe a) -> SubTree a -> SubTree a

i.e., for a given Filter and a function that takes a maybe-existing value and returns a new maybe-existing value, we can do our thing.

The actual implementation leverages Data.Map’s alter function which does most of the work here, but the actual implementation is just a couple of lines:

modify :: Filter -> (Maybe a -> Maybe a) -> SubTree a -> SubTree a
modify top f = go (splitOn "/" top)
  where
    go [] n@SubTree{..}     = n{subs=f subs}
    go (x:xs) n@SubTree{..} = n{children=Map.alter (fmap (go xs) . maybe (Just empty) Just) x children}

We start by splitting the filter topic on / so we have the segments and then we walk the tree. If the remaining topic is [] then we’ve arrived at the topic we’re looking for and we just run the transformation function and we’re done. Otherwise, we walk the tree using alter which will create any necessary subtrees as we go.

Note that this would be slightly simpler if we required a to be monoidal, but fewer constraints are possible, so we did the broadest thing here.

It’s a little awkward to use, though, so we also have addWith:

addWith :: Monoid a => Filter -> (a -> a -> a) -> a -> SubTree a -> SubTree a
addWith top f i = modify top (fmap (f i) . maybe (Just mempty) Just)

addWith assumes a is monoidal and gives us a far simpler transformation by just allowing us to add a specific value with a collision function to deal with existing cases.

e.g., the most simple case, add:

add :: Monoid a => Filter -> a -> SubTree a -> SubTree a
add top = addWith top (<>)

add does the thing you’d expect when adding a new value. e.g., if you have a SubTree [Int] that has subscribers at a/b/c of [1,2] and you add [3] at that path, you’ll have [1,2,3].

This is also how we build fromList:

fromList :: Monoid a => [(Filter, a)] -> SubTree a
fromList = foldr (uncurry add) mempty

Searching

And now, the entire reason this thing exists: finding subscribers.

The most general function we have for this is findMap which has a fairly simple signature:

findMap :: Monoid m => Topic -> (a -> m) -> SubTree a -> m

I’m going to omit all the code here since it’s about 8 lines long because of all the weird expansion rules, but the signature tells us really everything we need to know. It looks a lot like foldMap. Given a topic and a function that converts whatever a is found for that topic to a monoidal value, you get a monoidal value.

e.g., if the a is already a monoid, you get this function:

find :: Monoid a => Topic -> SubTree a -> a
find top = findMap top id

So when I store my subscriptions as a list, find "some/topic" st gives me a list of all the things subscribed as some/topic or +/topic or topic/+ or topic/# or #.

In Practice

In my original implementation of my mqtt broker, I implemented subscriptions as a dumb list. It seemed like it was going to be a hard problem, so I punted until I could do something better. Every message that was published had to look at every subscription for every client and see which ones matched before redistributing stuff. For my little server at home, I have around 250 subsriptions at any point in time and get about one message per second on average. That’s on the verge of gross.

But it turned out to be very easy to implement something efficient that worked quite well. I just have a TVar that holds a SubTree (Map SessionID SubOptions) and just use stm to do the reads and writes. The semantics are still quite complicated as there are private and shared subscriptions and there’s the session vs. client separation and being able to have concurrent deliveries of messages sent by one client to any relevant subscribers while new clients are concurrently modifying the subscription SubTree.

Being able to express data structures like this and test them thoroughly against well-established laws makes me avoid having to think of large swaths of bugs I’d write in most other languages.

I replaced mosquitto with my own mqttd project a year or so ago due to a couple strange bugs I’d encounter occasionally and some missing features of MQTT v5 I wanted to use. I’m around 1,400 lines in with a solid broker I’m relying on.

But SubTree.hs is one of my favorite pieces of code.

GoPro Plus

2020-04-29T00:00:00+00:00

This is the story of a project I started (almost exactly) two months ago. Not much of it is about the code itself (links at the bottom), but you can read what I did and why and stuff. I got to write a bunch of stuff in my favorite language and learn a new language along the way. I also got a little bit of control over my media.

GoPro Plus

I got a new GoPro Hero 8 for a trip I took recently. There were lots of exciting things to look at. The picture to the left was taken on January 20 at a resort in Siquijor.

The resort had quite good connectivity, and since I bought a new GoPro, I figured I’d try out the GoPro Plus 30 day trial thing they were offering. I could plug in my camera and my video and pictures and stuff would magicaly be stored safely in The Cloud™.

GoPros

GoPro Plus ends up being something like $5/mo (or I think $50/year if you do it annually). For this, you get unlimited storage and up to two camera replacements per year. It’s a pretty good deal to not have to worry about asset tracking, etc… And the camera replacement is a good incentive to actually try to use the thing.

The mobile app lets you browse around in the cloud-stored data and find things of interest to assemble into videos locally. It does a pretty good job of thinking up the edits and stuff for you.

Also, this isn’t my first GoPro, but you can upload video through their web UI. I uploaded a bunch of my old clips so I could store all my footage in one cloudy place.

Except, there were a few things that annoyed me about this.

GoCons

There were a few things that annoyed me about GoPro Plus early on (in no particular order):

The web site made it quite difficult to navigate anything but your most recent media.
While you could upload media and the camera could upload time lapse photos as a single medium, the web UI didn’t let you do this.
Sometimes, things would just not upload and I couldn’t figure out why.
Downloading in bulk (e.g., a day’s worth of stuff) is nearly impossible.
Sharing is painful and broken.

As I became more of a power user, I found weirder, more obscure bugs (e.g., items stuck on their end in a particular state, or incorrectly recognized media).

But perhaps the biggest problem of all: I wanted to be able to make sure I could tell what media I’d uploaded and retrieve it all at will. I don’t know if this service will last forever, but I do know I would like my media to, so if they decide to shut down or something, I want to know I can get all my media out quickly and easily.

This is a fairly big flaw, as GoPro Plus doesn’t have an API. Well, officially…

Getting My Data

I finally decided to dive in on February 26. I found a little bit of stuff online where people had managed to get a listing from the media service, but these were all quite incomplete and didn’t meet any of my goals.

My goto language these days is Haskell as it’s consistently the easiest language I work in so I started exploring the API endpoint in ghci with wreq and aeson. As it turns out, their web app is just a javascript interface to the same API the mobile devices use, so it was relatively easy to just watch what it does and do the same thing.

I spent a bit of time over the weekend just making essential bits. The first thing I figured I should do was capture all metadata from the search results into a local sqlite database, along with the thumbnails I could get from their image servers. I built my gopro commandline tool to manage authentication tokens (and refreshes) and then built a little web service that could give me my data back.

The GoPro Plus media-browser web site has endless scrolling of a few tens of items at a time. If I want to find the oldest of my 2,229 items I currently have stored, it might take me ten minutes just to scroll to the bottom (go to bottom of page, wait for load, repeat). Just getting to footage from the above trip is an excursion in itself.

Initially, I built an interface using crossfilter to let me do multi-dimensional filtering of my data quickly and easily, but I don’t much like working in JavaScript, so I decided to learn elm as part of this project. My new UI in elm was much more pleasant to work with, but there’s no crossfilter and javascript FFI is awkward even when you’re not trying to do something that fancy. But all I really needed was a collection of filter functions to apply in succession, so writing my own functional equivalent of crossfilter in elm ended up being basically something like this:

List.filter (\m -> List.all (\f -> f model m) allFilters) media

Given that I know when all of the media was captured and from what device, and what the type of media it was, I could easily build myself a date-based filter and camera filter and type filter and suddenly could ~instantly.

Need More Data

However, the GoPro Hero 8 captures some really rich telemetry. They open-sourced a parser for this which is super cool, but I ran into two problems with it:

Their code seems very difficult to use (see the hundreds of lines of their basic demo).
The demo didn’t even display very relevant data for most of my media.

That was a little unfortunate. I want access to all of the metadata, but after playing with their C code for a bit, I finally realized that’s quite an uphill battle.

So I finally just read their spec and wrote a gpmf parser from scratch. My core GPMF parser is under 100 lines of code and gave me complete data from every sample I ran into. The low-level data is a bit too low-level, so I added another ~176 SLOC to cover higher level concepts like GPS location and speed, face detection, etc…

Now I just needed to extract all the metadata from my ~700GB library.

However, you don’t need the entire content for this. It turns out that as part of processing your video uploads, GoPro Plus produces a few different variants of your content. This typically includes a remarkably small one called mp4_low which usually carries the full metadata. There are lots of qualifiers in there because things aren’t entirely consistent, but it’s good enough:

ms <- asum [
  fv "mp4_low" (fn "low"),
  fv "high_res_proxy_mp4" (fn "high"),
  fv "source" (fn "src"),
  pure Nothing]

fv here is a function that locally caches a stream from GoPro by name and then attempts to extract a GPMF stream from it. Often, mp4_low has it and we’re done quickly and easily. If not, try the higher res variant used for streaming in the browser, then the source, then give up and declare there’s no GPMF (e.g., my old Hero 4 footage doesn’t have any).

I added geographical areas to my database, a web handler to retreive them, summarized location in my metadata and added a bit of point-in-box Elm code. At this point, “Show me all of my media taken in the Philippines in the year 2020” is two clicks, roughly instant, and I see Showing 426 (49.88 GB) out of 2,229 items (946.81 GB) along with all the nicely grouped thumbnails in my web UI.

Syncing

Syncing is relatively straightforward. I run my commandline tool periodically. It asks for paginated media ordered by upload date (descending) and keeps asking for pages until it sees a media ID it already has. Subtract the existing data and I can pull down whatever’s leftover.

I keep metadata, thumbnails, and downloadable variant metadata (minus URLs as they expire) in my media table. So grab these things for each missing record in concurrent batches (as not to do everything at once) and commit them to my local DB.

I have this retrieve function that, given a medium ID returns the list of downloadable artifacts. The URLs themselves are signed s3 URLs that expire after a bit, so they’re only really interesting when you’re actualy downloading an artifact. There are a few different types of artifacts, some of which have one URL and some have multiple. The resulting JSON includes url as a key for when there’s one URL, and urls when there’s more than one. This sort of makes sense, but it causes a lack of uniformity. Each one also has a head/heads variation of the URL for issuing a HEAD request instead of a GET request.

This detail sounds boring, but one interesting thing here a fun aeson-lens tip. If you have an arbitrarily complex chunk of JSON without much uniformity, but for any object anywhere within the structure, you want to remove a couple of keys, you could write something like this:

fetchSansURLs :: (HasGoProAuth m, MonadIO m) => MediumID -> m (Maybe Value)
fetchSansURLs = fmap (_Just . deep values . _Object %~ sans "url" . sans "head") . retrieve

(the actual version excludes urls and heads, but that’s not an important detail and this line’s long enough).

Web Syncing

One seemingly unnecessary annoyance is that I had to reload my web page every time I did a commandline sync. I made this slightly better by just making a data sync (soft reload) button, but I thought it might be better to sync from the web UI itself.

The biggest problem with this was knowing what’s going on while this is happening. I’m using an elm library called toasty to put little ephemeral popups on the screen. I’m also using MondaLogger in the commandline tool. Combining these seems like a good idea.

Initially, I was asking the UI to poll the server to look for items in the database that should be displayed on the UI. This was gross for lots of reasons. I decided this might be a good time to learn a bit about websockets.

I’d “used” websockets once in the past when I added websocket support for my mqtt library, but this was mostly theoretical and all client-side. I figured server-side websockets in Haskell would be a bit of a challenge and client-side in Elm would be easy. My expectations were totally backwards.

I wrote a custom stderr logging function and a logging function that writes to an STM broadcast TChan. The latter is what I ended up using for websockets. I initially would do something fancy during the sync to redirect the log using a local logger mutation, but I figured I might as well just always log to the websocket channel in case any browsers are attached. It’s slightly less code.

My logger looks like this:

notificationLogger :: TChan Notification
                   -> (Loc -> LogSource -> LogLevel -> LogStr -> IO ())
notificationLogger ch _ _ lvl str = case lvl of
                                      LevelDebug -> pure ()
                                      LevelInfo  -> note NotificationInfo
                                      _          -> note NotificationError
  where note t = atomically $ writeTChan ch (Notification t "GoPro" lstr)
        lstr = TE.decodeUtf8 $ fromLogStr str

…and that’s hooked up to the web like this:

wsapp :: Env -> WS.ServerApp
wsapp Env{noteChan} pending = do
  ch <- atomically $ dupTChan noteChan
  conn <- WS.acceptRequest pending
  WS.withPingThread conn 30 (pure ()) $
    forever (WS.sendTextData conn . J.encode =<< (atomically . readTChan) ch)

At this point, my entire web-based sync on the server-side looks like this:

post "/api/sync" do
  _ <- lift . async $ do
    runFullSync
    sendNotification (Notification NotificationReload "" "")
  status noContent204

The NotificationReload that gets sent is a special message that isn’t displayed by the browser, but just indicates that it should reload the media catalog.

So, now that that’s done, let’s do the elm side. I won’t dump all that code out here (it was a lot). The short story is that Elm doesn’t have native websockets support and wiring it up was a bit of a pain. It’s not too bad after that, though. Notification messages come in and are dispatched easily enough:

doNotifications : Model -> List Notification -> ( Model, Cmd Msg )
doNotifications model =
    let reload_ (m, a) =
            (m, Cmd.batch [a, reload])
        doNote n m =
            case n.typ of
                NotificationInfo    -> toastSuccess n.title n.msg m
                NotificationError   -> toastError n.title n.msg m
                NotificationReload  -> reload_ m
                NotificationUnknown -> toastError "Unhandled Notification" (n.msg) m
    in
        List.foldr doNote (model, Cmd.none)

Uploading

The upload process is complicated. I’ve not documented the whole in English, but my GoPro.Plus.Upload module does all of the things I’ve figured out how to do.

It’s quite facinating. A single medium upload can consist of multiple files (e.g., a timelapse photo series, or a video that was so large it broke into multiple files on its own). Additionally, they’re leveraging S3 chunked uploads such that when you define an upload and specify your chunk size (I’m using 6MB which is what they use in the web app), they give you a collection of pre-signed URLs to do the upload. The order you process these things doesn’t matter.

In the end, this means you can get massive parallelism for most uploads. Some of my uploads have consisted of hundreds of parts. I don’t have great connectivity where I am currently, but I can rsync my media to a location with good connectivity without any care for how long it takes. Once it arrives, I can use my commandline tool to upload with all the cores and network I can throw at the problem and then tie it up in the end.

I end up doing a lot of fun parallelism in this project, which often gets too large. I get a lot of use out of this function:

mapConcurrentlyLimited :: (MonadMask m, MonadUnliftIO m, Traversable f)
                       => Int
                       -> (a -> m b)
                       -> f a
                       -> m (f b)
mapConcurrentlyLimited n f l = liftIO (newQSem n) >>= \q -> mapConcurrently (b q) l
  where b q x = bracket_ (liftIO (waitQSem q)) (liftIO (signalQSem q)) (f x)

My commandline tool has two different upload commands because in one case you want to just blast a bunch of photos and clips up and in the other case, you want to sequence a series of parts into a single item of a specific type. This was a huge win for me because before I figured out how to do this, I had a couple folders of hundreds photos each that were GoPro time lapses that I couldn’t upload to GoPro using any of their tools.

Actually, it’s got a third upload command because I was able to recover a failed upload, but that was a super complicated process and their API doesn’t quite let you do this in a usefully automatable way.

Up…dating?

One issue I found in my bulk uploads is that I had a large number of items for which GoPro couldn’t recognize the camera. This was especially weird because I could. More specifically, in many cases, my own GPMF parser correctly identified the camera used, but GoPro didn’t know what it was.

I had found a URL from which I could GET the details of a medium by its ID and I had observed in the process that it would mutate its state by using a simple PUT. Can I do that?

I wrote a fixup command that lets me do bulk edits on the GoPro side using SQL queries. e.g., consider the following query:

select m.media_id, g.camera_model
    from media m join meta g on (m.media_id = g.media_id)
    where m.camera_model is null
    and g.camera_model is not null

This query emits one row per record where I know the camera type from my metadata extraction, but GoPro does not. The fixup code will run that query, look up each item based on the returned column media_id and update the JSON fields corresponding to the column names returned from the SQL query (e.g., camera_model above) and row values/types. I fixed up something like 800 bad items with a single, simple command.

gopro fixup "select m.media_id, g.camera_model ..."

I also used this for individual edits since I could craft literal rows for edits easily enough and didn’t otherwise need to make a special tool. Not all fields are editable, though.

Extraction

But what if I want to move everything out of GoPro Plus right now?

I have a gopro backup command that does that. It uses local credentials and my previously mentioned retrieve command to request all the (signed) media URLs from GoPro Plus and then pushes them into an SQS queue which delivers to an AWS Lambda function whose job is to copy the data into my own S3 bucket. Since GoPro itself stores all of its data in S3, I figured moving the data from their bucket to mine using AWS technologies made sense.

I did write some javascript for the lambda function, but it’s small and boring.

In my current, low-bandwidth, unreliable network state, I can move ~1TB of media out of GoPro cloud into bits I control with a small command that takes a couple minutes to run. Most of the time on my end is just making a couple thousand calls (though I’m using mapConcurrentlyLimited here, of course). Then it’s just however fast AWS Lambda wants to feed my function.

There’s really no reason I couldn’t just convert this to my own GoPro Plus from here. I’ve got browsing, media persistence, etc… I just need to do a bit of scaling, but I’m already using ffmpeg…

Resources / Current Status

Relevant github repos:

The GoPro Plus client API is fairly complete. There are parts I understand, but haven’t really used because I haven’t figured out what I want to do with them. e.g., I couldn’t figure out how to retrieve the contents of a GoPro Plus collection (for sharing), so they’re not super useful to me if they’re write-only. Also, device lists and stuff aren’t that interesting, but this code is pretty easy to work on. I’ve considered fixup-style editing into this library because it’d be relatively easy.

The GPMF Parser low-level bits seemed to work with every sample I could find, but I do think there are some theoretical things I didn’t implement. I can’t find anything that uses them, though. The DEVC stuff for higher-level abstraction has plenty of room for expansion in areas that aren’t that interesting to my current project.

The GoPro CLI and Web UI are good enough for me right now, which isn’t saying that much since I’m not much of a web designer.

In Summary

With a bit of reverse engineering, learning a new programming language, delving into parsing video telemetry extracting poring through a silly number of third-party libraries to do all the things I want (e.g., exif, websockets, amazonka, etc…), I made a cloud service personally useful.

Hopefully someone else finds this work useful, too, but if I’m the only GoPro Plus user out there, then at least I’ve got awesome tools. :)

Playing with LEDs

2015-06-28T00:00:00+00:00

Playing with WS2812 LEDs

I’ve been playing with WS2812 addressable LEDs for a few days now. They’re kind of neat. You can read a lot about these in Adafruit’s Neopixel Überguide, but there are a wide variety of these things available and lots of interesting applications.

I ordered a handful from Adafruit last time I needed some stuff from there and finally got around to playing with them.

Of course, the first thing I did was wire them up and try to play with existing sketches. That was neat, but I wanted to figure out how hard it would be to deploy them into something else.

I came across this really great PicoC project for Tau Labs that embedded an attiny85 into a servo wire and used it to provider a higher level abstraction to interacting with the LEDs. I built one of these real quick (though changed the wire protocol slightly to allow me to address one pixel at a time).

Friday’s Projects

Josh.com had a nice informative post explaining a lot of details about other ways to play with them, which got me very interested in sniffing stuff out and see what’s going on should I try out a few of these alternative libraries.

This inspired me to write friday’s first project of the evening – a WS2812 Saleae Logic plugin. As seen in the screenshot above, this detects which lights should be which color at exactly what time offset as things go across the wire, both interactively above, and in an export allowing further processing.

So the obvious second project for Friday night was to write a playback tool for these recordings. The nice side-effect is that I was able to see what a particular program would look like on a longer strand of LEDs without having a long strand. I could just tell an arduino I had N LEDs and record the signal it sent out to the phantom LEDs (without even having to plug a single one in).

e.g., I only have 5, but my simulator let me see what 13 would look like on strandtest by running the actual program on my actual arduino. The result looks like this:

Initially, I was running the default configured for 60 LEDs. I certainly don’t have that many, but it makes a nicer demo.

Saturday

Saturday, I had dance recitals and other stuff, so I didn’t get much done. My biggest accomplishment was building a bit of code I could run on my attiny85 to allow PWM-based light mode selection. I coded up a couple of modes and had a simple table to select which mode based on the minimum PWM level (in μS) like this:

struct {
    unsigned long minVal;
    void (*mode)();
} modes[] = {
    {1500, flash},
    {900, pulse},
    {0, emergency},
};

This is a dumber version of the Tau Labs thing above, but it means it can work with dumber things, which is nice.

Sunday

Sunday, I ended up out driving most of the day. It was nice out. Not a lot of time again, but I did get to miniaturize the bits from yesterday.

I had to hack together a 500Ω resistor out of two 1kΩ 0603 resistors I soldered together to attach the output signal wire. You can see the progress of this as it went from prototype on a breadboard, through the tiny soldering to the ready-for-consumer product (almost):

In the end, I have my mode selection on a wire, demoed below. Note “emergency” mode is when there’s no PWM signal at all, as in the beginning of this video (flash red). Then I switched to the servo tester to try varying the PWM level to see it hit different modes. As shown in the snippet above, “pulse” is on the lower end and “flash” is on the upper end. The names might not be useful, but you can see the difference in the demo.

I haven’t fully figured out what I want to do with the lights yet, but I’m having fun.

Opensky

2015-06-10T00:00:00+00:00

Opensky

I’ve been doing a lot hardware stuff recently. I didn’t know much about hardware when I started and I like learning, so this is a lot of fun for me. Let me introduce you to a project I’ve been doing some work on lately called Opensky.

Opensky is an open source hardware and software RC receiver that speaks the Frsky D8 protocol, so it interoperates with Frsky Taranis. To the right is an Opensky receiver I built and use on one of my smaller flying things.

My contributions to Opensky so far have been small. I documented the hardware requirements (BOM) in case someone else wanted to build one and added a few firmware features. I made it easier to bind to a transmitter after flashing the firmware and added RSSI PPM injection since I like the way that works in OpenLRSng.

Failsafe is critical for a remote control system, and Opensky’s failsafe works reasonably well, but can be a bit tricky to set up correctly. The existing failsafe detects a loss of radio link and sets all channels to 1000μS and continues these pulses. The problem is that I’m getting about 995μS pulses at minimum, so the failsafe value is actually higher than my lowest transmitted value. The flight controller can’t know that the connection is down without some odd tweaking, so the failsafe ends up not being very safe in this case.

Last week’s project was writing a couple Saleae Logic analyzer plugins for helping me analyze this sort of thing. I didn’t actually have an application at the time, but I figured it’d be fun. As it turns out, my CPPM analyzer was helpful in figuring out how to make a better failsafe last night.

First, I took at look at what the d4r-ii “no pulses” failsafe looked like on the wire. It’s not exactly what I expected, but shortly after turning off my transmitter, the signal line went high and stayed there:

Getting the Opensky to do that was a small bit of code, but I’d kind of screwed myself over on the programming aspect.

The Opensky RX is based on Arduino in the sense that it uses at atmega328p microcontroller and uses the arduino toolkit for building the firmware images. The receiver itself has a UART exposed and can be programmed directly from the Arduino software if you include the boot loader.

But of course, I didn’t do that when I made mine. So I had to rig up an ICSP setup with the soldering of a bunch of tiny bits of wire and metal, some chip clips, and a bunch of other crap that ended up looking like the thing to the left.

You can’t really see all the detail in the mess, but the one wire mast that has to be dropped approximately into the center of the board is non-trivial. But it worked. As a favor to my future self, I went ahead and added the arduino bootloader.

Adding the arduino bootloader provides a couple major advantages. First, I can program the thing without all that wiring difficulty. But also, I can actually interact with the thing’s serial port for debugging and stuff. I figured I’d give this a go.

To the right, you can see the current version of my opensky dev kit. I installed a single right angle pin to get to DTR off the Opensky RX itself, and then just used a six pin female connector to plug it in. I added male connectors to access all the individual pins, but ran a common ground across the board.

The board also has a 3v3 regulator to run the RX off a battery as well as another six pin female connector (and breakout male pins) that allow me to plug a CP2102 (USB/UART adaptor) breakout board in. Of course, the serial port and DTR are already wired up. You just need one jumper wire to specify the desired power source (battery or the CP2102’s 3v3 output).

Hopefully this allows me to do some more dev on this thing since there are more things I’d like to do.

In the meantime, here are a couple of pictures of how I’m using one of these in real life:

I attached the Opensky RX directly to an afromini. The RX itself requires 3V3 (any more and you kill the radio), so I removed the input voltage pin I had an ran a bit of 30AWG silicone wire to the output of the LDO that provides power to the main processor on the flight controller. This worked out pretty well:

The above is the control system for my 190mm quad which has been quite fun to fly. The RSSI over PPM lets me verify I’m not getting a weakened control signal (on OSD) as I fly around.

If you want to make an Opensky RX yourself, check out the [hardware][openskyrx] page for the list of parts to get started along with a couple boards from oshpark. There’s more documentation to write, but I found enough to get stuff working quite well.

This stuff keeps me busy. I feel like have no idea what I’m doing, but in the end, the things I make fly (and help others make things fly). That’s very rewarding.

Taranis and the Nano QX

2014-12-25T00:00:00+00:00

What’s This?

So, this isn’t my typical programming post, but I wanted to write about stuff I learn in my hobby time. I fly things.

Building and flying is a fun, but I have a specific topic that might be interesting.

To the right is a picture I took during an early test flight of a quad I built. Building and flying is pretty fun, but you need a lot of space, big batteries, good weather, etc…

I want to write about some of my “development” work in the meantime, though. It’s Christmas today. The weather’s not too bad, but it’s a bit windy. Since it’s winter, we also don’t have as much light, so I want something I can fly indoors.

So I practice a lot indoors on my Blade Nano QX for both LoS and FPV flying. I slam it into walls and people and what-not with minimal harm.

If you pick one of these up with one of the RTF kits, you get a terrible radio controller. It’s OK to learn some basics, but won’t get you very far (literally, my tiny apartment exceeds the range of this thing).

Better Control

My weapon of choice is the Taranis+. It’s basically a weird computer whose interface is a bunch of switches and knobs and joysticks that you can use to control things over RF.

Note that the Nano QX requires DSM2 or DSMX for control and the Taranis won’t do that natively, but I got an OrangeRx module that speaks the right protocols and plugged it in as an external radio.

In order to get basic flight control of the Nano QX, you plug in the module, and set up the following:

Thrust on Channel 1
Aileron on Channel 2
Elevator on Channel 3
Rudder on Channel 4
Flight Mode on Channel 6

Note that flight mode is a toggle, so I mix it in from a momentary switch (SH) on the Taranis. This is a bit unfortunate, because it means the radio has no way to tell what flight mode you’re in – you have to just look at the stupid lights to see what it’s doing.

Also note that aileron and rudder are reversed, so in the “servos” config, you’ll need to mark them as inverted. e.g.:

At this point, you should have basic flight operations.

Advanced Mixing

But that’s all background. The main thing I wanted to write about is how I use OpenTX to actually do interesting things with this model.

Taming Acro Mode

First, the acro mode on the Nano has been described as “fidgety.” In that mode, rather than auto-leveling when you let off the sticks, the pitch and roll are basically held such that aileron and elevator control change the rate at which it rotates in the specified direction. Tiny, tiny adjustments will just about flip the thing. This wasn’t fun for me, so the first thing I did was make a curve I could apply to pitch and roll to give me subtle controls in the middle, but still allow me to do flips and stuff.

To apply this, I made a flight mode on the Taranis controlled by SA. I only change ail and ele control in mixes. For example, existing ele control would be set for only flight mode 0. The new mode (2 at this point) gets a new mix that’s only for flight mode 2 and works pretty much like the default, but applies the curve defined above. Repeat for ail.

Now you jump into the right flight mode on the taranis and flip SH until the light turns red and bam – it’s flyable.

Auto Banking (via Super Advanced Mixing)

That was fun and practical, but I wanted see if I could automatically bank the craft while flying when I try to turn only using yaw.

Goal: In isolation, yaw, pitch, and roll should all work normally. But when pitched forward, yaw should proportionally also apply roll to bank the aircraft.

Secondary goal: I have no idea how much to do this, so I want to make the amount configurable on the fly via one of the sliders (I used LS).

This was non-obvious enough to make me want to post about it.

Flight mode Setup

We create FM1 for this new flight mode. For the initial goal, we’re not adjusting anything, so nothing special just yet.

Limit Rudder (optional)

For FM1, I modify my CH4 (Rud) mix to a weight of 50%. This is optional, but since my goal is to fly around and do circles and stuff, I thought slowing down the pirouettes was helpful.

If you do this, this should override the existing rudder control.

Make the Curve

We need a curve that takes us from zero in the middle to 100 at either size. A smoothed curve is good here. I use the following:

This curve is used to grab an absolute distance from zero in elevator control, as we want to bank proportionally to the pitch regardless of the direction we’re going.

Configure Aileron Channel

So here’s the kind of tricky part. For the FM3 aileron channel mixer, we want three inputs.

Rudder at 50%
Elevator at 50% with the above curve applied multiplied in
Aileron input added in

This allows us to mix the rudder and elevator to compute an aileron while still giving us relatively normal aileron control, though it’ll be a bit exaggerated at speed if you try to bank manually and/or difficult to flatten.

Dynamically Changing Bank Amount

The 50% up there was an arbitrary number. I don’t actually know what a good number is, and it sucks to reconfigure everything when you want to experiment. That’s why we have knobs and sliders and stuff.

We can adjust weights with a global variable, so we just need a way to adjust the global variable. Firstly, have the global GV1 variable owned by the flight mode, as shown here, defaulting to 0.

Because we want to adjust the weight from 0 to 100 (unless you want counter-banking), we need a curve to adjust the slider input linearly over that amount.

So basically this curve:

Applied to a new input with LS as a source (I used input slot 10 here just to separate from the radio channels):

Then it’s a simple matter of having a special function set the value of GV1 to the cooked output of this input (IAAdj).

Then just swap out the 50 weight above with the variable GV1 and you’re done!

Rudder example:

Repeat the same for elevator and enjoy the magic.

But I’m Lazy!

You can look over the OpenTX printout of the model, or just download the model directly to play with it.

Pools in Go

2014-04-25T00:00:00+00:00

Pooling in Go

A few people in irc have asked how to build an object pool. I’ve had a need for this in the past and came up with a design I rather like, so I’ve walked people through the code a few times.

Since I’m at Gophercon and I had a similar discussion today, I figured I’d write up a detailed description of how my code works, how it progressed and why it is the way it is now.

Channels as a Pool

You can think of a buffered channel in go as a thread-safe queue. It also has this nice property of blocking if it’s too full or too empty.

Channels practically do all the work for you.

For example, the core of the first version of the pool I wrote for go-couchbase follows. cp.connections is a buffered channel of *memcached.Client. cp.mkConn will build a new connection whenever one is needed.

func (cp *connectionPool) Get() (*memcached.Client, error) {
	select {
	case rv := <-cp.connections:
		// Existing connection
		return rv, nil
	case <-time.After(time.Millisecond):
		// No existing connection, let's make one
		return cp.mkConn(cp.host, cp.name)
	}
}

func (cp *connectionPool) Return(c *memcached.Client) {
	select {
	case cp.connections <- c:
	default:
		// Overflow connection
		c.Close()
	}
}

The first time you try to get a connection from this pool, the channel is empty, so it hits the delay path, opens a connection, and then returns the newly created connection. While that’s in use, a second would also do the same thing.

Returning a connection is simply a non-blocking send back into the channel. The reason it’s non-blocking is subtle, but should hopefully be obvious – if there’s space in the channel, we keep the connection. If there’s no space, it was overflow and more than we want to keep in the pool, so we immediately close it.

Problems?

This served me well for quite a while, but one of my users had an application in which hundreds of goroutines simultaneously wanted to perform a quick DB operation. All of them timed out instantly, all of them opened new connections and then all but the pool size (~5?) closed them because they were overflow.

So how can we limit the total number of connections? I dreaded fixing this problem for a long time. I was thinking about having a counter for the number of connections outstanding, which would need a lock, and then a condition for notifying waiters availability of space whenever a new slot became available and then loop. But mixing that with the existing channel op would be painfully tedious.

Turns out, you just need another channel to control connections. This brings us to go’s most amazing feature: select. This issue quickly degraded from something I was trying to push back to the application author to work around to basically the following code:

func (cp *connectionPool) Get() (rv *memcached.Client, err error) {
	// Try to grab an available connection within 1ms
	select {
	case rv := <-cp.connections:
		return rv, nil
	case <-time.After(time.Millisecond):
		// No connection came around in time, let's see
		// whether we can get one or build a new one first.
		select {
		case rv := <-cp.connections:
			return rv, nil
		case cp.createsem <- true:
			// Room to make a connection
			rv, err := cp.mkConn(cp.host, cp.auth)
			if err != nil {
				// On error, release our create hold
				<-cp.createsem
			}
			return rv, err
		}
	}
}


func (cp *connectionPool) Return(c *memcached.Client) {
	select {
	case cp.connections <- c:
	default:
		// Overflow connection.
		<-cp.createsem
		c.Close()
	}
}

Notice the new channel cp.createsem which is the semaphore we use for opening connections. The buffer size controls how many total connections we can possibly have outstanding. Each time we establish a new connection, we place an item in that channel. When we close a connection, we remove it again.

Return is roughly the same, save the overflow semaphore management.

The Get method is trickier. There’s a select block nested within another select block. The outer one only attempts to wait for an existing connection. If one becomes available relatively soon (one millisecond), we just use it. It’s generally better to use an existing connection than open a new one.

If the pool wait times out, we go into the inner select. It should be noted that this is similar, but is different in an important way. We still wait for an available connection, but we also wait for the ability to create a new connection. Waiting for this connection semaphore in the outer select block would cause us to open connections prematurely.

Rolling them into a single select is perfectly valid if the objects and their creation are cheap, but when they’re not, it’s easy enough to prefer reuse over creation as I’ve done here.

Final Version

The above shows an approximate evolution of my pool. Check out the production code if you want to see the final hardened version. There are a few differences between the above code code and what’s in production:

nil receivers are supported (will close a connection if there’s no pool)
returning nil is supported (does nothing)
there are health checks on the connection objects
some tracing exists to help understand which path was used to establish connections.
connection pool shutdown is supported via closing and clearing the connection channel
there’s a short-circuit connection pool that avoids allocating a timer
there’s always an overall timeout on acquiring a connection

Otherwise, the theory of operation is as described above.

Also, for a good time, notice that I’ve got 100% test coverage on this pool. That’s really quite nice considering how much crap it has to deal with.

Bonus: (*connectionPool).Close()

This is only mildly related, but I wanted to describe the connection pool shutdown mechanism. It’s slightly clever and super dense, but I’m keeping it.

The purpose of this method is to close all connections available in the pool and signal to anything waiting for the pool that it’s closed and should error immediately (these are all the errPoolClosed paths you can see in the code).

This is the code in its entirety:

func (cp *connectionPool) Close() (err error) {
	defer func() { err, _ = recover().(error) }()
	close(cp.connections)
	for c := range cp.connections {
		c.Close()
	}
	return
}

So, ignore the defer for a second. We close the connections channel and then iterate the channel closing each connection within it. Then we return. The return is naked, so it’s just going to return the current value of err, which is nil (could also return nil – makes no difference here).

Pretty straightforward.

OK, now stop ignoring the defer. Why is that there? Well, if you call Close twice for some reason, the channel close will panic. This recover will catch that panic and convert it to an error and store it in err, the method’s return value. The first call returns nil and the second returns an error. Admittedly, it’s not an awesome error that is obvious what you’ve done wrong, but it doesn’t panic.

But what happens in the first call case? In the first call, recover returns something that isn’t an error (nil). We do a two-value type assertion which returns a nil value and false. Then we assign this new nil to our err return variable. The method returns nil, but for a different reason.

This method actually just returns whatever the first result of a two-result type assertion on the recover is.

Big Data

2014-02-04T00:00:00+00:00

Thanks for the memories

A while back, there was a leak of a LinkedIn password data in the form of a list of unsalted SHA-1 hashes. A few sites had password check tools up such that you could provide your password or a hash of it and it’d tell you whether it was found in the leaked information.

These sites were all really slow, and would sometimes report database errors. I found it curious that anyone would even consider a database for a fixed-size single record lookup of a small amount of immutable data.

I downloaded the data set and played with it during a meeting. In about a half hour, I had a small server that could load the data set into memory in a couple seconds and serve responses from memory stupidly fast an with perfect horizontal scalability.

Capacity planning

I think in the LinkedIn case, there were 6.5 million hashes leaked. SHA-1 hashes are 20 bytes.

That’s 130MB of data.

I have tabs in Chrome right now that are using more memory than this, yet people deployed multi-tier infrastructure to answer simple presence queries. They’re fragile, complex and slow.

What database do I need?

This brings me to my motivation for writing this. People have gone a little bit overboard with thinking up big solutions to small problems. I’m not trying to pick on any particular user, but I did have an example that helps make the point pretty clearly.

I picked up a Stack Overflow question yesterday about checking for hash presence that was almost an identical problem. Note that the question is tagged bigdata. In this case, it was “a few million” SHA-256es.

Initially, the user attempted to use both MySQL and Couchbase to solve the problem. Apparently Couchbase used too much memory (presumably using hex encoded keys) and MySQL was too slow, so he tried sharding the table by first nibble, but it still was too slow, so he asked for help.

Two of the answers were (reasonably sensible) suggestions for MySQL. Some schema suggestions and configuration parameters that will help with efficiency.

Another was suggesting some combination of hbase, redis, and cassandra. That’s just… overkill. This is a super small scale problem.

The spec said “several million” of these hashes. I wrote a small test with 50 million hashes. 50e6 * 32 is about a gig and a half of RAM. It’d be unusual to find a computer that couldn’t spare 1.5GB of RAM for such processing. You have to get up to about a billion hashes before it starts to get a little harder.

“But that won’t scale!” you say? An EC2 instance that can hold about 8 billion such hashes in memory costs about $3.50 per hour. By the time you get to that level, you can think about something better anyway.

But I don’t want to write a lot of code!

I pointed to the code I’d written in that meeting as an example to get started. It contains both the text -> binary format convert thingy as well as a web server that loads that file into memory and returns an HTTP status that indicates presence. That was a distracted half hour of work.

However, I realized that I’d since written go-hashset. That makes this type of problem much easier.

Below is a complete server using go-hashset and net/http that will return HTTP 204 on a hit and HTTP 410 on a miss given a GET request to /[sha-256].

package main

import (
	"encoding/hex"
	"log"
	"net/http"
	"os"

	"github.com/dustin/go-hashset"
)

const (
	hashSize   = 32
	listenAddr = ":8080"
)

func loadFile(fn string) *hashset.Hashset {
	f, err := os.Open(fn)
	if err != nil {
		log.Fatalf("Error opening hash file: %v", err)
	}
	defer f.Close()
	hs, err := hashset.Load(hashSize, f)
	if err != nil {
		log.Fatalf("Error loading hashes: %v", err)
	}

	return hs
}

func main() {
	hs := loadFile("hashes.bin")

	http.HandleFunc("/", func(w http.ResponseWriter, req *http.Request) {
		b, _ := hex.DecodeString(req.URL.Path[1:])
		if len(b) == hashSize && hs.Contains(b) {
			w.WriteHeader(204)
		} else {
			w.WriteHeader(410)
		}
	})

	log.Printf("Listening on %v", listenAddr)
	log.Fatal(http.ListenAndServe(listenAddr, nil))
}

Half the code is loading the file and the other half is specific to this HTTP API. It’s easy enough to imagine another protocol if this doesn’t work for you.

Conclusion

“big data” isn’t all that clearly defined, but as a rule of thumb, here are indicators that you’re definitely not working with big data:

If it fits comfortably in your phone, it’s not big data
If it fits comfortably in your computer’s RAM, it’s not big data
If it fits on your laptop’s SSD, it’s not big data
If it fits on a single hard drive, it’s not big data

One might even argue that if you can fit the data into a single computer, it’s not worth calling it big data, though big data processing tools can benefit even on smaller scale.

In the meantime, enjoy the smaller data in life. It’s fun and easy.

2013 Contributions

2013-12-31T00:00:00+00:00

2013 was a reasonably productive year. The days I didn’t commit any open source were, of course, a Tuesday and a Thursday. Boo.

But I’ll focus on the positive. I really learned a lot and hopefully made at least a few things that helped people. Based on bug reports and pull requests, at least a couple people got some use out of my time.

One thing I definitely did too little was blog. I find interesting things and I write them into code and bring it out when it comes up in conversation. I’ll never get good at it if I don’t start practicing more.

Today, I wrote a little program to dig through all the github public events of 2013 (about 82 million) I store in cbfs and figure out which ones were of my doing. That yielded 2,483 push events, which I reviewed in a google spreadsheet to find 2,110 of them that were actually my fault.

Notice that github shows 2,229 contributions this year. This is because I managed to open source some of the stuff I was working on. yay. I’ll get into more of that later. In the meantime, let’s do a quick breakdown of a few interesting things that happened each month.

January

In January, there were two days I didn’t do any code. But a lot I did.

I started writing cbugg as the perfect bug tracker for me. There were a couple of goals here. One of them was to not have to touch Jira since everything I did there just consumed entirely too much time. We wanted to create workflows that streamlined things as much as possible.

Perhaps more importantly, we wanted to exercise Couchbase Server (and go-couchbase) and get a better understanding of what it was like to throw an app together really quick with this. Aaron Miller saw the initial version and made a new UI for it in in AngularJS that was really great.

Marty Schoch came along and did some really awesome search integration. Overall, it’s been pretty great.

I did some work on cbfs since I keep increasingly more critical things there (e.g. attachments in cbfs, builds, etc…)

I did lots of work on gitmirror. gitmirror is great for keeping local copies of everything you push up to github, updated all the time. It’s also a good way to have a single integration point of all of your repositories. It’s got a tool, setuphooks, that allows you to use patterns to configure any type of hook event on a repo, or across all repos you own or all repos in an organization to which you have appropriate access. It feeds any change anyone does to any repo at Couchbase into cbugg, for example.

I still don’t have a complete APRS stack that both receives and transmits, but I did get more work done on go-aprs that uses go-nma to cause my phone and tablet to go off if anyone mentions my callsign on the radio.

I started sallingshome to manage some of the goings on around my house. Primarily, it just has the chore management of the house. Kids go there to see what tasks are available, they do the things, the things get marked done and unavailable for their respective repeat periods and I have a lot of things to pay for. It was my first from-scratch Angular project which really helped my understanding, but it’s also used a lot around my house. If you think it’ll help you, let me know and I’ll try to document it enough to let someone else run it.

February

February saw a lot more cbugg work. It was addictive, so we kept going.

But we also figured out what we were doing with cbgb. Steve Yen and I made a pretty useful working clone of Couchbase Server in pure go. In particular, I did a lot of work to scale it up to large numbers of buckets. That was some fun work that led to me extracting go-broadcast and making a heavily multiplexing implementation of it I used when I had hundreds of thousands of things that could be wanting ticks and what-not.

These projects took most of my time, but I also did a little work on frames, go-couchbase, seriesly, location, and a few other things.

March

March was tons more cbgb. We had a mission of supporting any application that could run against Couchbase, but in a single binary download. We were evaluating this by running cbfs, cbugg, and whatever other apps we could find against it. Fixing bugs as we found them.

I fixed a few bugs in a sockjs library we were using in cbugg. Marty had originally written some magical websocket code, but we run a lot of our web services through an Apache frontend, which eats websockets, so he chose sockjs as a fallback. It was decent, but had a few broken channel idioms and would hang and/or panic if things didn’t go perfectly. Those got contributed back upstream.

go-jsonpointer is built on my fork of go’s JSON package. In addition to updating the fork, I ended up plaguing Marty with a few bugs in edge cases I wasn’t handling well. It’s got a pretty good test suite, though, so maintaining it isn’t too bad.

I have a tool called pktreplay I use for taking memcached packet captures and playing them back against a machine. We’ve used this to reproduce bugs from customer production situations that were difficult to simulate from descriptions. I did a little bit of maintenance on this for another customer engagement.

April

I worked on cbgb a lot more in April. Just more stuff.

I also took a trip to visit a customer who was having some unpredictable latency in production. I built a tool I called pktlatency that passively measures latency from packet traces and then dumps them into data I could process with R. After juggling it around a bit, I created the plot you see to the right showing distributions of slow responses by server node. It seems obvious in retrospect, but one machine was consistently the source of all the issues. Got the customers netops involved and just hung around in Baltimore with Trond.

I also had been working on a robot that played some games for me online at the time. This involved automatically moving transactions around in bitcoind, so I worked a lot on a go bitcoin interface. I eventually released my game bot code since all the games went away (boo). It’s no longer useful, but there were some interesting bits of code in there that someone will find useful someday.

There are a few packages called goquery, but I found this one and fixed up and implemented some of the parts that I needed and got that back upstream.

I wrote a really neat backup mechanism for cbfs that backs up into itself. I should really do a post on this, but I’ve made use of it several times. On catastrophic failure of Couchbase, I can restore terabytes of data in minutes. I’ve made use of this a few times (though generally because I just decide to destroy my database).

May

In May I rewrote one of my photo album incarnations on AngularJS. That was exciting.

I also did a tiny bit of work on sync_gateway. My go infection spread to co-workers who built some pretty awesome software from idea to deployment in go.

cbfs’ backup tool was pretty great, but older backups could reference objects that no longer existed since GC was only rooted by current file references. I wanted to make sure that everything that existed in backups was always available, so I decided to consider objects in backups gc roots as well. I wrote go-hashset as an efficient way to maintain and operate on sets of hashes used as object references. At the time, I had somewhere around 200k distinct live objects, but I was building for billions and it was fun.

June

I mostly fixed bugs in June. I worked on Steve’s slabber some, a lot of go-couchbase, finished off my hashset, and a few other things. 168 commits across 22 projects as far as the github public feed saw.

One thing that was fun in June, though, was the way I capture the public feed data. I wrote a tool that syncs it up with a chunk of cbfs so I’ve always got recent, replicated copies of the data locally. That was very handy when I wanted to research this blog post. :)

July

In July, I actually blogged a little. I had a post about using SIGINFO to ask for interactive process information The Unix Way (unless you’re on Linux).

That helped me understand download problems I was running into which led to my writing go-saturate.

Now my downloads can kill networks again.

This month, I also started working on Couchbase Cloud. It’s a self-service frontend to a usable sync_gateway you can use as a sync point for Couchbase Mobile. It’s AngularJS, go, and cbgb. It was one of the biggest things I worked on in July, but only accounted for about 12% of my commits.

August

Although I worked on 30 projects in August, most were relatively small. I extracted the logger used in sync_gateway into a project called clog. Not because the world needed another logger, but because it helped meet Marty’s requirements for something more easily than the rest of them.

I built papertrails to roll up my logs that get dumped into S3 from papertrail monthly. It’s small, but really quite useful (and I’m about to run it for all of December’s logs). That’s slightly exciting for me. These replicate through btsync into cbfs and all that fun stuff.

I also did some work on go-coap as I began using it in Couchbase Cloud as a cheap and lossy means of reporting some DB events. CoAP is like a really lightweight HTTP over UDP. I POST DB events such as opening and closing of DBs with their sizes and stuff to the management system. If it’s busy or dropped, it’s not an issue. Most importantly it’s not polling and the code delivering the events isn’t burdened with lots of file descriptors, a slow or down server, etc…

September

September is when I started having actual users on Couchbase Cloud, so we got the necessary features in for it to go by itself, logging, monitoring, and supervision.

This means I worked on supporting tools such as logexec which makes it super easy to send arbitrary programs’ output to syslog (i.e. papertrail), go-couchbase, cbgb, go-coap.

I also designed a new circuitboard for my washer project, though when I got around to hardware procurement, I think I found something even better. It’s in front of me, not assembled.

October

My birthday is in October. But later in the month, I got to go visit a customer in Montréal. They had a decent amount of data and traffic, so I got to pull out an old tool I’d built for watching our clusters rebalance.

The link to the right will take you to a live visualization of a fairly boring cluster I’ve got. Just trust me, it’s super-exciting when it’s not stable. I’ve got record and playback tools to let me see what this looked after the fact and with variable time.

I started using docker a lot more and creating tools like confsed to help dynamically rewrite JSON APIs that try to magically discover their addresses to proper external addresses that aren’t known until instance run time.

I wrote go-manifest (last month, really) and go-set-versions as an example of how trivial it is to manage go packages if you try. I wrote the first one while eating lunch at my desk just to show how to easily determine the revisions of all of your dependencies as your project is built. The latter just to reverse it since I had a couple people not believing the process could be reversed… somehow.

But I don’t use such things myself. I like progress, so I wish everyone to run the latest everything. The fear, of course, is that things will break and you won’t know about them for long periods of time. For this, I wrote gadzooks. I have an instance of this running on GAE that sees every public change that goes through github, as well as acting like a github hook receiver (which I populate with setuphooks). You configure sets of dependencies and a build to trigger to drone.io or your favorite CI system and any time anything that might affect your build changes, your build gets triggered.

November

Randbo was one of those projects that just had to be written because of the name. Someone wanted a way to grab arbitrary random []bytes in go, which to me, means you want an io.Reader. I’d written something like it before, so I threw it together. I’m sure I spent more time on the image.

I published my first CRAN package: humanFormat. This is similar to my go-format package, but for R. I was doing some R plots and got tired of pasting in the same format functions and changing them slightly depending on my data. I must say, it’s a lot more difficult to get a package available to people in R than in go. They really vet it. You have to pass all the tests (of course it failed on Windows the first time, because Windows can’t spell μs properly), document all the things, etc…

I pretty much rewrote go-couch tests from scratch. I wanted to get and not have to hit an actual CouchDB every time.

coveralls is a great tool which I started using a lot more. This led to major work on goveralls including adding support for go 1.2’s built-in cover tool, offline executions of coverage and several fixes for issues I ran into.

I added go template support to docker inspect so you can more easily script things on your docker hosts.

My gosh server I use for doing builds and deployments from webhooks in such a way that doesn’t excessively use resources seemed like it deserved a proper repo instead of the gist I’d been keeping it in.

And look to your right.

This is actually rendering and serving from GAE in a neat batch processing thing that gets data from my house and processes it on demand.

That was a fun challenge for which I learned pull queues and generally learned how to be really lazy with resources on GAE.

December

I finally got to open source the stat collector we use to collect anonymous statistics from field units at Couchbase. It was mostly closed because I had some passwords and stuff in it. Opening it is really great because I abstractly talked about it when trying to help people out on Google App Engine apps, but couldn’t show them real code I’d written because I did dumb stuff like store passwords and junk it.

Click the image to see people (who opted in for anonymous stat collection) using Couchbase right now.

Yellow is which is a tiny go library that helps you raise awareness of bits of your code that are executing more slowly than you expect in production. It’s another thing I’ve pulled out of a few applications and wanted to get some reuse out of it.

In the last half of of December, I started getting involved in camlistore which is really quite fun. I wrote a little about my adventures in extended attributes over on google plus, but suffice it to say I’ve been well over my head for a while. I went from not knowing anything about FUSE or camlistore to wanting to implement a means of looking at any point in the history of the filesystem by timestamp.

And then I thought it’d be a good idea to make a Mac OS X GUI for all the things camlistore.

I do plan on working on it a lot more, though. It’s a really great project.

Wherefore go-saturate

2013-07-17T00:00:00+00:00

Wherefore go-saturate

In my previous blog post, I wrote about a bottleneck I ran into that caused my application to pause when I felt it should’ve been working hard.

My workers were all stuck waiting for data from one source while other sources of the same data were available.

This is how I fixed it.

Problems Illustrated

First, it’s good to see exactly what the problem looked like.

There are four workers and they all need a mix of data of type a or b. Both a and b data sets are replicated so any given record can come from one of two of these servers.

If there are equal tasks for a and b and servers are all generally available, but one of the servers is slower than the others, then you will invariably end up with all of the workers stuck retrieving data from a rep 1 even though a rep 2 contains the same data.

But why is it blocked on that node? Because that node is slow.

When randomly selecting workers and one worker takes considerably longer to complete its work, you will inevitably be blocked waiting for slower workers to complete their tasks.

If this isn’t intuitive to you, think about it this way:

Imagine just two servers containing the same information from which you randomly choose for any given request. Let’s say one server is 8x slower than the other.

Assume a reasonable random number generator such that there’s a 50% chance that any given request will be issued against the fast one, and a 50% chance it will be issued against the slow one.

Now imagine you’ve got two clients that are wanting to grab a bunch of data from the two servers.

The scenario illustrated above will occur frequently:

Client 1 hits the fast server.
Client 2 hits the fast server.
Client 1 hits the slow server - is blocked for a while.
Client 2 hits the fast server again.
Client 2 hits the slow server - all workers are blocked while the fast server is idle.

Boo.

What Does go-saturate Do About This?

If you can express your work in the form of (task, []resources) pairs, you can use go-saturate to help resolve this type of problem with a double-fanout as illustrated and discussed below.

Internally, the resources are each represented by a channel that has one or more goroutines servicing it (user-specified). A resource is “available” when a worker is idle and just waiting for new work on a channel. If the resource is slow, the worker spends more time working and less time accepting new work (see the red in the sequence diagram above).

The diagram to the right represents a producer feeding into two tiers of workers. The first tier workers, e.g. worker 1 are higher level (e.g. copy a file from the internet to local disk) while the second tier workers, e.g. a rep 1 are lower-level (e.g. read this file from this location).

It’s important to understand that the first-tier workers are in the same process as the second tier workers, the distinction being that the second tier workers are operating on a resource identified by the first tier workers and the duration of that work isn’t necessarily predictable.

In the illustrated case, workers 1 and 3 are concurrently performing tasks that need data from a.

Let’s break down the specific example shown in that diagram.

The producer submits two tasks that both need information from a
Workers 1 and 3 pick up this work.
They both identify a rep 1 and a rep 2 workers as being possible sources of a.
The select from worker 1 chooses a rep 2
The select from worker 3 chooses a rep 1

Note that steps 4 and 5 can occur at approximately the same time, but the a rep 1 worker can only pick up one job and the first-tier worker (e.g. worker 1) can only have one worker pick up its task.

By asking for one of either and choosing the most available resource, they’re able to do their work without blocking on each other. For example, if a rep 1 is slow, tasks requiring a will continue to get their as from a rep 2 as fast as that replica can keep up.

In a sense, the resources are self-selecting.

Here is a concrete example that illustrates go-saturate in the real world. It’s being used in the cbfs client code. (cbfs is a distributed blob store for Couchbase Server).

As you can see in the right (a cbfs scenario where one server has 100Mbps ethernet and the others have gig-e), slow resources are still used, but when a slow resource is being slow and a fast resource is available, we won’t choose the slow one.

In long-running tasks, this is great – if I can keep all the fast nodes and the slow nodes busy, then I get a little more out of the network as a whole.

Other Resources

You can read the go-saturate docs to get an overview of the API.

Also, I gave a presentation at work earlier this week on this topic which goes a lot further into basic go channel semantics interactively including a range of information from basics like how to send and receive on a channel to how send a message to exactly one of a dynamically built set of recipients.

That presentation is captured in the following 20 minute video:

Both the slides and the long-form contents of the material used in that presentation are available.

Need the INFO

2013-07-04T00:00:00+00:00

Why Isn’t my Program Working Harder?

In my efforts to saturate my network with cbfs, I kept noticing lulls in my graphs – both the seriesly and the OSX widget I use showing me what my computer’s doing. I decided to figure out what the client is doing when it’s supposed to be breaking my networks.

The cbfsclient download tool does three passes against the cluster to figure out what to do:

The first translates the path you requested to a tree of filenames and object IDs (typically sha1 hashes).
The second finds current locations of all of those objects.
The third just starts flipping through all of those objects and asking one of the origin servers at random to stream it down.

As all requests are against origin servers, it should almost always just be a straight up sendfile.

So how do I find out what’s going on? netstat wasn’t really useful – it just told me that my client didn’t have anything to do, but had a few connections open.

The thing I really want to know is exactly what HTTP requests are currently in flight and how long they’ve been in flight. However, I only want to know this when I observe behavior to be weird.

Enter SIGINFO

SIGINFO is awesome. On BSD systems, ^T sends SIGINFO to the process currently attached to the terminal. A few programs (e.g. dd) have built-in SIGINFO handlers that give you useful information on long-lived processes.

^T doesn’t work on Linux. I don’t know why and explaining that is beyond the scope of this post, but I’m not developing on Linux, so back to the lecture at hand.

Signals in UNIX are essentially messages delivered to the process, but the UNIX APIs for signal handling involve registering a function to be called when the signal is available for processing. This is unfortunate because most things you’d be tempted to do in a signal handler are unsafe.

In go, signals are delivered to a channel. A goroutine reading from that channel can safely do anything any other goroutine can do.

The most simple example of SIGINFO, at least on OS X, is as follows:

package main

import (
	"log"
	"os"
	"os/signal"
	"syscall"
)

// SIGINFO isn't part of the stdlib, but it's 29 on most systems
const SIGINFO = syscall.Signal(29)

func main() {
	ch := make(chan os.Signal, 1)
	signal.Notify(ch, SIGINFO)

	go func() {
		for _ = range ch {
			log.Printf("You pressed ^T")
		}
	}()

	select {}
}

You can read more on go signal handling either in the docs or wiki.

Now Do Something Useful

I spent a few minutes replacing the http RoundTripper for the default client for my program with one that would keep track of the beginning of every HTTP request through the Close() of the response body.

Then I ran my program again. Once I saw another lull, I pressed ^T and, ah ha!

load: 1.45  cmd: cbfsclient 14765 running 47.14u 145.92s
In-flight HTTP requests:
  servicing "http://slownode:8484/.cbfs/blob/[oid]" for 1m34.00s
  servicing "http://slownode:8484/.cbfs/blob/[oid]" for 1m31.67s
  servicing "http://slownode:8484/.cbfs/blob/[oid]" for 1m32.08s
  servicing "http://slownode:8484/.cbfs/blob/[oid]" for 1m33.67s
  servicing "http://slownode:8484/.cbfs/blob/[oid]" for 1m34.06s
  servicing "http://slownode:8484/.cbfs/blob/[oid]" for 47.26s

This program was only allowing 6 concurrent requests and they were all stuck doing requests against the same slow node. It’s so obvious when you can see it.

Code?

I threw this together in about fifteen minutes just to debug this current situation, but I’ve got it down to initHttpMagic() in cbfsclient’s httpmagic.go.

It’s not well documented at this point because, well, I spent ~15 minutes on it to solve my problem. The basic theory is pretty straightforward, though:

Every time we start a request, record the URL and timestamp.
Every time the response body is closed, forget about that URL.
When ^T is pressed, dump out the current map.

(Do note that I don’t ever have two requests to the same URL, so I’m not worried about losing that information.)

Good luck, out there.

CBFS DNS Service

2012-10-05T00:00:00+00:00

CBFS DNS Service

Warning: This is kind of a silly idea and not necessarily a recommendation for how you should do things.

Also, this is not a replacement for DNS-SD or mDNS or any such things. But it’s a fun toy I got working in a couple of hours, so I’m playing with it.

DNS for Humans

If you’re looking at this web page, you’ve probably interacted in some way with the domain name system. It’s pretty convenient as a human to ask for dustin.github.com and not think about what that means.

There are tons of descriptions of this service out there, how it works, etc… I’m not going to get into that as much as a small bit on relevant parts and how they’re generally used.

A Records

Many of the DNS queries that are tossed about are for A, or address records. e.g., I ask my browser for dustin.github.com which does a magical DNS dance around looking for an A record for dustin.github.com. and get the following:

dustin.github.com.	41033	IN	A	204.232.175.78

From that point on, the browser has an IP address it can talk to. Typically, this is put in place by the human who owns the IP address 204.232.175.78. This is most likely a wildcard for *.github.com., but most importantly, the management direction is usually “thing I want to provide” -> “resource on which I can provide it.”

i.e. you want to provide pages, you configure up your machines for it and point the service at the machines.

NS Records

Part of the above requires a lookup of an NS record to find out where to even ask for the A record. The NS record is the way that a domain in DNS can delegate responsibility to another system. In this case, someone who administers com. delegates to someone who adminsters github.com. (TLDs are a little more complicated, but this is roughly the idea)

Most of that magic happens in the DNS server, but basically, assuming I know who serves com., I ask who serves github.com. to ask where dustin.github.com. is. Such an NS query returns the following:

;; ANSWER SECTION:
github.com.		81150	IN	NS	ns4.p16.dynect.net.
github.com.		81150	IN	NS	ns3.p16.dynect.net.
github.com.		81150	IN	NS	ns1.p16.dynect.net.
github.com.		81150	IN	NS	ns2.p16.dynect.net.

;; ADDITIONAL SECTION:
ns4.p16.dynect.net.	81150	IN	A	204.13.251.16
ns3.p16.dynect.net.	81150	IN	A	208.78.71.16
ns2.p16.dynect.net.	81150	IN	A	204.13.250.16
ns1.p16.dynect.net.	81150	IN	A	208.78.70.16

This lists both the names of the nameservers that I asked about and was kind enough to also senda long their IP addresses so I don’t have to make another trip to figure out where they are.

SRV Records

SRV records are kind of neat. They tell the address(es) of something, but also on which doors to knock. They also provide concepts of “weight” and “priority.”

Jumping really quick into an example, let’s say you want to IM someone. You’re logged into gmail and you want to talk to example@jabber.org. Well, the first thing google’s going to want to know is how to connect to this service. Specifically, the XMPP service over TCP. To find that out, it issues an SRV query against _xmpp-server._tcp.jabber.org. and gets this:

_xmpp-server._tcp.jabber.org. 900 IN	SRV	31 31 5269 fallback.jabber.org.
_xmpp-server._tcp.jabber.org. 900 IN	SRV	30 30 5269 hermes.jabber.org.
_xmpp-server._tcp.jabber.org. 900 IN	SRV	30 30 5269 hermes6.jabber.org.

[...and in the extras section]
hermes.jabber.org.	900	IN	A	208.68.163.221
hermes6.jabber.org.	900	IN	A	208.68.163.221
hermes6.jabber.org.	900	IN	AAAA	2605:da00:5222:5269::2:1
fallback.jabber.org.	900	IN	A	208.68.163.218

There’s an equally low priority for hermes and hermes6, so google will try one of those first. hermes6 has two IP addresses, so it may try all three of those addresses before trying fallback.

These lookups are done magically not only by servers communicating in XMPP, but also clients that want to talk to XMPP. Someone publishes the connection details and we’re all good to go.

Now You Fully Understand Global DNS

OK, if you came here not knowing much about DNS, you still don’t, but that’s OK. My main point is that often when people who think about DNS think about DNS, they are thinking about what they want to publish and how things are going to find them.

The exceptions here are in DNS-SD and, to a degree, mDNS. You’ve probably interacted with both of these when you ask your computer to find a printer or someone tells you to look at something on his laptop (in my case, that’s dustinnmb.local.). These magical discovery protocols are pretty awesome for ad-hoc services, and with properly administered DNS-SD, even globally advertised services.

But I came here to talk about something I’m doing that’s just a little bit different. Probably not different enough to justify the hour or two I spent today trying to make it work, but interesting for me.

DNS for Self-Organizing Services

cbfs is a storage service that, if we get it all right, blurs the line between administered and magic. The servers need a bit of configuration to know where to coordinate, but after that, clients can pretty much pick any one of them to work with.

At home, I have a couple of nodes that are going to remain “permanent,” but intend to have a few others coming and going as I experiment with things.

The thing that’s a little difficult is figuring out which node I should talk to when things go wrong. And if I want to use a service name (as opposed to just always hitting the same host I know is running the service), what do I point it to? And what do I do when that host goes down? And even when everything’s mostly stable, what’s the best machine to talk to do the thing I want to do right now?

Because of these questions, I had the absolutely ridiculous idea to make cbfs its own DNS server.

*sigh*

It’s useful, though. cbfs is actively monitoring the cluster, knows what nodes are in it, out of it, when nodes start to die, it can respond instantly, etc… If I plug in a node, I want clients to find it instantly, and I use my web browser and curl as clients as lot, so I’d like it to work there, too.

For this, I did two things:

SRV Records

Firstly, you can make an SRV request as a “smart” client for _cbfs._tcp.[domain]. to get the current list of nodes and a recommendation for which node to talk to at that point in time.

Here’s an example from my network at home (I’m abbreviating the queries name to $q just to keep the line short enough to read):

;; ANSWER SECTION:
$q 5	IN	SRV	2 0 8484 dustinnmb.cbfs.west.spy.net.
$q 5	IN	SRV	3 5 8484 bigdell.cbfs.west.spy.net.
$q 5	IN	SRV	0 1 8484 z.cbfs.west.spy.net.
$q 5	IN	SRV	1 1 8484 menudo.cbfs.west.spy.net.

;; AUTHORITY SECTION:
cbfs.west.spy.net.	3600	IN	NS	ns.west.spy.net.

;; ADDITIONAL SECTION:
z.cbfs.west.spy.net.	60	IN	A	192.168.1.38
menudo.cbfs.west.spy.net. 60	IN	A	192.168.1.97
dustinnmb.cbfs.west.spy.net. 60	IN	A	192.168.1.113
bigdell.cbfs.west.spy.net. 60	IN	A	192.168.1.135
ns.west.spy.net.	3600	IN	A	192.168.1.40

This begs a bit of explanation.

Firstly, ns.west.spy.net. is my primary name server at home. It’s an off-the-shelf bind instance running on OpenBSD (at least, some strange embedded OS I based off of OpenBSD at some point in the past). This is an administration point where I go and enter RRs for services I want to offer. It serves both my internal and external west.spy.net. domains (which are different).

Internally, I want to provide service for cbfs.west.spy.net., but I want it to be magical and dynamic. I also don’t want to run cbfs as root, so I have DNS bound to a different port. No problem at all, I just forward to a couple of known cbfs servers with the cbfs DNS service running using the following config:

zone "cbfs.west.spy.net." {
        type forward;
        forwarders {
                192.168.1.38 port 8453;
                192.168.1.97 port 8453;
        };
};

This should be obvious, but it basically just causes DNS queries for that zone to proxy through and hit my cbfs server.

The rest of the stuff from above is all dynamic and coming out of cbfs’ internal state. You’ll notice the priorties in the answer section are different (these are the sequential, but not ordered numbers). These priorities are the same priorities cbfs would use for data distribution internally. They’re approximately (but intentionally not exactly) prioritized by heartbeat recency.

Currently, the weight is unused as the priorities order the node usage absolutely, but since I just hacked this thing together, I’m likely to do something different after I play with it a bit.

One thing to note is that those are not “hostnames” in the conventional sense, but just the things I passed to the -nodeID parameter to cbfs. cbfs itself creates the hostname glue and does all that magic.

A Records

“But wait!”, you say, “I thought you wanted this to work with your browser and curl. They won’t do these SRV lookups!”

Correct, so cbfs also responds to A or ANY queries be returning a handful of A records to make these other clients happy. Example:

;; QUESTION SECTION:
;cbfs.west.spy.net.		IN	A

;; ANSWER SECTION:
cbfs.west.spy.net.	5	IN	A	192.168.1.38
cbfs.west.spy.net.	5	IN	A	192.168.1.135
cbfs.west.spy.net.	5	IN	A	192.168.1.97

In this case, I just asked for “the address” of cbfs.west.spy.net. and it gave me three, any one of which is expected to be happy to answer any query I might throw at it.

The keen-eyed reader notes this is fewer than the four listed above. I arbitrarily decided to kill a node while writing one of the paragraphs above. Things kept going fine.

In Conclusion

I probably could (and will) accomplish much of the same with mDNS and I don’t think providing name resolution services with DNS is particularly novel, but this was a fun hack I did for my birthday and I hope it inspires someone to do something better.

CBFS

2012-09-27T00:00:00+00:00

cbfs - a couchbase large object store

Thirteen days ago, Marty Schoch and I created an empty directory called cbfs and started typing some go code into it. The idea was to create an answer to the frequent question, “how do I store large items in Couchbase?”

I think we’re both pretty pleased with the results and would like to share what we’ve made a bit more broadly.

cbfs is essentially a read/write HTTP server with a minimal RESTful API for getting a bit more meta information out of it and serving apps on couchbase. We had a few goals and were able to demonstrate almost all of them within a week of birth. Things like:

No single point of failure.
No single point of contention.
No limits to the size or volume of data that can be served.
No invalid states to reconcile.
Adding and removing nodes and other management tasks should be as easy as possible.

Also, we wanted to be able to bring back something along the spirit of couchapps (but much easier!).

We haven’t written up enough on it yet, but I did do a demonstration to our team today in a google hangout. You can see that quick intro in the first half of the following video.

The slides are available as well. They are meant to be served from a cbfs instance doing the demo, but I made the demos also work when offline.

Some neat stuff has been built around this since we started.

Trond Norbye wrote a FUSE interface so we could mount it locally. I’m synchronizing our Dropbox stuff into cbfs and watched Trond browse around in a terminal and interact with the files as if they were local.

Marty’s written a pretty awesome admin console. I’ve done some demos of other stuff built on the API as well (including the built-in cluster monitoring console and the commandline tool).

The wiki page was mostly written in the first four days, but describes what the idea was pretty well. It’s a lot more complete now, though the issues list shows where we want to add more polish.

I’m running a cluster at home and in the office and serving content and apps out of it, so it’s definitely self-hosting and stuff.

Come join us, help finish tasks, think of new ones, find ways you think it could be better…

But, as a reminder, this isn’t two weeks old yet, so if you want something that’s actually been in production with PBs of data, look at mogilefs. We wrote this because we wanted it to exist and wanted to have some answers for specific questions we’d been asked.

Airing My Dirty Laundry

2012-09-16T00:00:00+00:00

On Laundry

Of the various housework I have to do, laundry is actually not that bad. I have this great machine I bought a long time ago that does most of the work. I just have to remember to put things in it and we’re good. I have another machine that takes the clean things and transforms them into dry things. It’s a magical experience, tainted only by short attention span that leads me to forget to transfer the laundry from the cleaning machine to the drying machine.

I set out to solve this problem using my only skills, but I have a problem. How do I program an old washing machine?

Feeling It

My first idea was to use a vibration sensor attached to the laundry devices to detect when they’re doing something.

This is more complicated than it sounds because the vibration can be a bit subtle during some of the cycles and getting reliable signal out of these sensors at the levels I needed seemed just a little too indirect.

Power Measurement

“I know, I’ll just measure the power it’s using.”

I have awesome tools like the “Kill A Watt” that do much of what I need. I just need to get the data out. Adafruit has Tweet-a-Watt, which is a great concept, but really expensive, and still not doing exactly what I want. So I started looking into building my own.

I looked around for the most basic thing I could find and found this cheap current transformer on Amazon. Of course, I had no clue whatsoever how to use a current transformer. I read a lot about them and there were lots of documents that made me feel like I was going to tear a hole in space and time if I didn’t properly cross the pins. This device is like the anti-Gozer.

As an uneducated software hacker, it took me a lot of trial and error to get the right circuit designed for this thing. At the time, I did a lot of the work on my iPad, using iCircuit which I strongly recommend. I used that to design (and test) the following circuit:

I can’t stress how useful iCircuit was. My notes are filled with readings from scopes where I had attempted something and didn’t quite get the right result. The cases that mattered most were sudden spikes, limits and endurance. Some of the circuits didn’t drain well, so they’d just kind of float up after a while.

I don’t use my iPad much since I got my Nexus 7. iCircuit was one of the tools I hoped I could replace. There were rumors of an Android version coming, but I couldn’t find anything concrete. I did, however, find EveryCircuit. This is a pretty great piece of software. It’s missing a couple of features I hope make it, but it also does a few things much better than iCircuit. It’s a grand age for making things.

Once I got a circuit that was good in theory, it was time to breadboard it and try it with some real load.

As I mentioned above, I found that jolting the CT with a sudden load would call it to bounce way off the charts in early testing. This is sort of the electronic equivalent of a stack overflow, except instead of crashing my program, it burns down my house.

I had incentive to get this right.

There was a lot of simulated testing, then a lot of breadboard testing and then I wanted something I could actually deploy in my garage. I went over to halted and found some decent prototyping boards and ended up with something that was a bit more rigid than the breadboard.

I don’t have a picture of the actual final product which is unfortunate. I went through a few different debugging strategies. First, I would look at things just through the console. That’s very inconvenient as the device is hooked up in my garage. The radio signal wasn’t always great, and I’d affect it by getting close to it (insert obvious Heisenbug joke). I added a light so I wouldn’t have to get too close and hook up a computer and stuff. Then I wanted to know more than one thing, so I’d have the light blink at different rates.

In the end, I hooked up a 2x16 character LCD so I could just print out whatever the sensor and radio states were. That was really helpful, but I damaged it in the final installation so that it’s basically useless now. It was good enough to get it going, though.

Building a Reader

At this point, I have hardware that converts the magnetic field observed by the appliance’s power draw to 0-n volts. Originally I was aiming for 5V, but after looking at wireless options, I decided to give JeeNode a shot. They run at 3.3V, but are otherwise pretty much Arduino compatible, but featuring a small, low power and most interestingly, cheap radio - the rfm12b.

Getting things up and running was pretty easy. One problem I found in using these radios is that they’re pretty low-level. Robust protocols don’t come for free.

For this, my needs were pretty simple. The project evolved just a but, but essentially I send a packet out with an 8-bit sequence ID and I keep sending the same sequence ID until the other end responds. I really only want to know if the device has been on since the last time I looked, so I send the most recent reading off of the sensor and the highest value I’ve read since I received an ACK. Every transmission requests an ACK, but there’s often tons of interference (in both directions) so each side has to transmit and be heard by the other before the needle is moved.

Internally, the sensor updates every second. It transmits at least once every 10s and once on every change since the last ACKd value.

Making it Useful

On the reader side, I have a really simple read-only protocol that converts the stuff in the air to RS232 through a USB interface using a simple go program that does a few things:

Serves the readings up over HTTP.
Lights up stuff in my living room telling me the state of the laundry.
Sends out alerts with NotifyMyAndroid so my phone and Nexus 7 start beeping when things change.

I had to start by creating an RS232 interface for go. I’ve been able to use this for a couple of projects now (hopefully I can write about one of the others, because it’s pretty awesome).

The NotifyMyAndroid interface for go alone has kept me from having to re-wash laundry that sat in the washer too long.

The source to the sensor, reader firmware and go parts are all available, though I can’t guarantee they’re 100% ready to deploy for anyone who isn’t me. If anyone wants to try something similar, I’ll gladly help, though.

Seriesly Internals Tour

2012-09-13T00:00:00+00:00

Seriesly Internals Tour

My previous a blog post introduces a documented-oriented time series database I wrote. It’s introductory material that somewhat describes why the software exists and a bit about how to use it.

Here I describe how it works. As mentioned before, the bulk of the software was in a useful state in two weeks. I attribute much of this success to go. The code itself contains the details of my knowledge, but I’ll highlight some of the parts that helped the most in the coming paragraphs.

The process described in the following paragraphs generally occurs in under a millisecond, in my installations. On a sufficiently large, cold query, it could theoretically take several minutes.

Logical Overview

The query model is as simple as I could meet all of my current use cases with.

The diagram to the right represents a logical overview of a query flowing through the system.

Understanding the Query

First, the HTTP query is parsed and put into a struct that represents the information that’s needed. Here we validate and parse the from and to timestamps, the grouping, the reducers, etc…

Also, we place a before timestamp that is the latest possible time for us to do any work on this query (queries that would run longer than the server’s configured max query time are aborted even if partially processed).

Most importantly, there’s some output data in the struct. The following example should aid in understanding. Start with an input query:

/testdb/_query?from=1347469924&to=2013&group=3600000&ptr=/a&reducer=min

That will map to a queryIn struct that looks like this:

queryIn{
        dbname: "testdb",
        from: time.Date(2012, 9, 12, 10, 12, 4, 0, time.UTC},
        to: time.Date{2013, 1, 1, 0, 0, 0, 0, time.UTC},
        group: 3600000,
        ptrs: []string{"/a"},
        reds: []string{"min"},

        // Control, logging, timeout, etc...
        start: time.Now(),
        before: time.Now().Add(queryTimeout),
        started: 0,

        // Output channels
        out: make(chan *processOut),
        cherr: make(chan error),
}

The started int is a counter that’s incremented (atomically) each time a grouping is located. All of the final responses are fed into the out channel. Once the query worker completes walking the tree, it will send its return value (generally nil) into cherr, informing the HTTP handler that all groupings have been found and started will no longer increase.

Locating the Relevant Documents

The query processor server is asynchronously walking the docs to locate the ones that apply to the query. Once the query is submitted, the HTTP handler no longer pays any attention to the query directly.

But the actual query processing is where we start needing to tighten resource controls. In theory, we don’t need more queries digging through storage than we have the resources to process. In practice (it’s all configurable), oversubscribing doesn’t seem to make anything faster, but leads to a lot of memory bloat as more things are stuck in progress.

Without delving too deeply into queueing theory, let’s just imagine you’re at a well-organized bank where you’ve got a single line and four identical tellers.

The number of workers is fixed throughout the lifetime of a process and each has one simple job:

to transform a query to a collection of grouped document IDs to be handled by a doc processor on behalf of a client

That’s it. One of the workers takes the request off the queue, starts digging through documents between the from and to range of the query emitting a batch when it crosses every boundary defined by group and passes it down to a doc processor.

It’s important to note that both the queues into and out of the query processor are blocking, at least, eventually. By default, an HTTP client will block on submitting his query until a worker is available and a query worker will block on submitting a batch to doc workers until a doc worker is available. Both of these things are configurable, but in practice, letting work build up between these areas increases memory overhead without getting more work done sooner.

Processing Documents

The document processors are another group of identical workers with another fairly simple job:

to read a collection of documents, extract relevant information from them, and reduce the extracted information to an output value

Extracting a portion of the query example document from the intro blog post, you can see the role it performs to the right.

Before it begins pulling documents, it creates one goroutine for each reduction that was requested for the result. I’ll describe this pattern more below, but at this point it’s important to understand that we’re going to be streaming data through the reducers and not accumulating and batch processing results.

The most expensive things it does are also least interesting, so I’m going to be a bit hand-wavy. It fetches the document from disk (which is pretty much pread) and parses it as a json document.

Ignoring all the gory details of coercion, error handling, etc… that gives us a loop that looks like this:

// Build the slice of reducers
reducers, outputs := makeReducers(reds)
// Pull each document from storage
for _, data := range docs {
    jsondoc := parseJson(data)
    for i, jsonpointer := range ptrs {
        reducers[i] <- extractValue(jsondoc, jsonpointer)
    }
}
// Close reducer input channels so they complete
closeAll(chans)
// Read the output of the channels and send it all back
results <- recvItemFromAllChannels(chans)

Due to the way this is streamed, we never require more than one document in memory per document worker. A typical reducer for gauge values doesn’t need more than its return value resident.

Closing the reducer input channels causes them to emit their results on an output channel. Harvesting all these results for a “row” joins all the goroutines and allows us to emit the value back to the HTTP worker so it can transmit the results (described in further detail below).

“But wait,” you say, “what are these goroutines that are being joined?”

Every column in every row is a goroutine. e.g. a 2,000 row result with 100 pointers and reducers applied requires at least 200,000 short-lived goroutines to be started to model this concurrency.

Now, I imagine many of you are wondering why I’d make reducers be functions that are run in separate goroutines that keep their state on the stack and then consume a channel of input. Why wouldn’t I just make them structs where their state is held as a field and then just have a method called on the input to modify the state?

Because that’s hard. Consider the c_min reducer to understand why:

func(input chan ptrval) interface{} {
    rv := math.NaN()
    for v := range convertTofloat64Rate(input) {
        if v < rv || math.IsNaN(rv) {
            rv = v
        }
    }
    return rv
}

The c_min reducer computes the minimum growth rate (per second) of a stream of pointer values on the given channel. The trick is that in order to know the rate at a given instance, I need to know both the next value and the timestamp of the next value so I can divide the growth rate over the time delta.

It’s all obvious once you consider how convertTofloat64Rate must work. This is a function that takes a stream of ptrvals and emits a string of float64s that are flattened rates over that range. It does all of the hard work around finding the first suitable value (skipping nil values where the pointer didn’t point to valid data) and computing the delta from each previous value to emit. By definition, this always consumes more values than it emits.

Perhaps my brain is twisted by decades of living in unix shells, but I do the same thing processing data in the shell:

% grep field fromfile | ./computerate | ./min

Except goroutines are way cheaper than threads (which are way cheaper than processes), so the way that’s natural to me also performs quite well. This is certainly possible to do with OO-style state management, but I don’t want to write that code.

As a bonus, convertTofloat64Rate (and its older brother convertTofloat64) creates a goroutine that consumes the input channel in order to emit over its output channel. That means that if all 200,000 of the results were numeric, it’d require 400,000 goroutines.

Transmitting Results

Meanwhile, back in the HTTP handler…

After it submitted the request, the HTTP handler started a select loop across the two channels built at query time. We’ll call them the error channel and results channel.

When the error channel has received its completion message and there has been one grouping output for every grouping found by the walker (the counter mentioned above), the query processing is complete.

The HTTP layer currently emits docs as soon as they come out of the results channel, this brings two great benefits:

The first result latency is as low as possible. Once a result is known, we immediately transmit it. Early versions batched up the results and would get similar throughput, but I decided it was unnecessary.

Memory usage is as low as possible. Since I’m not batching up the results (and more importantly, not serializing them into a in a single encoder call), memory usage is down to the minimum required to pass the necessary information around and encode it for the wire.

However, these benefits come with two consequences:

Groupings are not returned to the HTTP requester in any defined order. In fact, if you run the same query twice, you will get the same results, but with different time windows appearing in different locations in the result. This is especially strange if you stream compressed data from the server and see it coming back as different sizes with the same canonical document representation.

There’s no clean way to report an error in an HTTP stream. After I send HTTP 200 and start streaming data, it’s too late to realize there’s a problem and call it a 500. The best I can do is hang up and leave you with an invalid document.

I struggled with this one a bit, but not having to buffer into memory, reducing latency, etc… while not having a great answer for large broken queries seemed like a good trade-off. Most ways I could figure to send someone an error in-stream in the rare case where one happens would complicate the user experience. With this approach, even completely cold queries return results pretty much immediately.

About Storage

Seriesly uses couchstore for the backend. Depending on how intimately you know CouchDB, you can think of it as a C implementation of the core storage of CouchDB. Except, of course, I’m using my go bindings.

Couchstore provides an on-disk append-only b-tree. This gives me a durable format I can write to and read from at the same time. Neither readers nor writers pay any attention to each other whatsoever. Each query or doc worker opens a database for read access whenever it’s necessary and just starts reading. If a write is happening at the time, the reader just seeks backwards a bit in the file until it finds a valid previous header and carry on.

Writes are batched by time and/or size. That is, a write targeting a database doesn’t immediately hit the filesytem. This is mostly beneficial when loading lots of data in bulk. Obviously both of these parameters are configurable.

If most of the writes are small, you’ll want to compact the database periodically. Compaction never blocks reads and almost never blocks writes. There’s a buffered channel between the HTTP handler for storing a new document and the actual DB writes. If you exceed this buffer size, the put will block. Otherwise, writes are flushed, compaction occurs, the DB is atomically replaced and then the accumulated writes are completed.

Now with Memcached

The above all assumes no cache. Throwing memcached in the mix makes things a lot more interesting. Let’s take the first diagram and fit a cache into it:

Now, conceptually, the cache just interposes between the query workers and the document workers. The code is very close to that, in fact. If you don’t have memcached enabled, the channel that the query worker places its requests into is the document worker input itself. When you have memcached enabled, the query worker output goes to the cache workers (and cache misses go to the document workers).

What’s unusual about the cache usage here is that I’m not using any patterns I’ve used elsewhere with memcached. It’s a little similar to how the internals of spymemcached were implemented, but I did some interesting binary-protocol-only stuff.

The diagram to the right shows the approximate anatomy of the memcached workers and their interaction with the rest of the system. The orange box delineates the conceptual memcached worker from the rest of the system. Note that a memcached worker is two goroutines sharing a connection to memcached. One only reads and one only writes. I labeled the one that writes as main as it’s also our interface to the cache.

Unlike most uses of memcached, we don’t do any “multi” type operations like a “quiet” get at all. Instead, any time we need to send data to memcached, we construct a packet and lob it over. The reader is reading the output of that and after parsing the result into a packet and determining that the main goroutine might be interested in it (basically, this means it’s a successful get request), it sends it back. These two messages are stitched together using the memcached binary protocol opaque field – a 32-bit number that is application specific and designed to enable this type of thing.

Pseudocode of our memcached main looks like the following:

for {
    select {
        case req := <-cacheRequestChannel:
            // Generate a unique identifier for this request
            opaque := nextOpaque()
            opaqueMap[opaque] = req
            mc.transmitGetRequest(req.key, opaque)
        case res := <-cacheResponseChannel:
            // Map the response back to the request by identifier
            in := opaqueMap[res.Opaque]
            delete(opaqueMap, res.Opaque)
            if in.err {
                docProcessor <- in
            } else {
                in.response <- res
            }
        case toSet := <-setRequestChannel:
            mc.transmitQuietSetRequest(toSet.key, toSet.value)
    }
}

Where a typical “multiget” type strategy would require you to batch up a bunch of requests, send them all, get all the responses and infer the misses, we process the results with minimal state – only keeping up with what we’ve got in-flight. If a response is a hit, we send it back up to the http handler. If it’s a miss (or there was any other error), we send it to the doc worker. Done.

Of course, the actual loop is a bit more complicated as it deals with fetch cancellations, connection failures which result in dumping all in-flight cache requests directly into the document worker queues while asynchronously starting a timed reconnection loop and a few other things. The above is pretty much the golden path, though.

At the layer the cache is installed, there are a lot of benefits I don’t get, but when the site was pretty popular on hacker news, I was seeing queries like this flying by:

Completed query processing in 69.528ms, 9,394 keys, 1,920 chunks

That means we pulled 9,394 document IDs from the on-disk index, chopped them up into 1,920 groups, computed hash keys for them and then passed through the above select loop 3,840 times (once each for 1,920 requests and again for the responses) in 69.528 milliseconds. Under load. That’s an absolute ceiling of 35µs cache round trip time, again, ignoring all other queries running at this time.

Seriesly - Document Oriented Time Series DB

2012-09-09T00:00:00+00:00

Why so, Seriesly?

So, I started writing a document-oriented time-series database in go two weeks ago. It’s really nothing groundbreaking, but it’s been quite fun.

My purpose is to have a system that allows me to store arbitrary performance data captured as it’s seen in the wild, and then later come up with ways to look at it.

Check out this real life example to get a feel for the kinds of captures we’re working with. This sample represents a single point in time, there may be many of these occuring at any frequency (well, nothing more frequent than once per nanosecond). From this, we can do arbitrary queries and report groupings at millisecond granularity.

Usage

The seriesly wiki describes in detail all that can be done with it, but the general strategy is the following:

Capture lots of data.
Query the data.

An important goal is to make it this simple in practice.

Capturing Data

Captured data is submitted to a database and recorded with a timestamp. The timestamp may be generated by the system, or may be supplied at request time (which is useful for backfilling data).

Querying Data

Querying captured data is about taking data from a specific time range, grouping it into specific time chunks (e.g. 5 minutes worth of data at a time) selecting keys from the data (using json-pointer) to query on and performing reductions over the values selected using those keys.

The following diagram shows an example query and how the fields are selected and values are returned (conceptually). For this query, I’ve asked for the count of /a, the avg of /c/e and the min of /b/2 grouped in 5 minute windows (300 seconds).

(The key in the result does not represent the actual key that will be emitted. I used a human-readable time representation for illustration purposes only. Had this been an actual query, the timestamps would’ve all been absolute and emitted as the number of milliseconds since UNIX epoch.)

Web Stuff

I use cubism for a lot of my time series fun. You can see an example of this backed by seriesly at my thermometers page at home. Seriesly is designed after building query interfaces to cubism for other systems. There’s no impedene mismatch between what cubism wants to know and the query language provided by seriesly.

Other than the normal cubism configuration and styling, the code to fetch data for display is pretty straightforward:

// First, point it to your data
var baseUrl = "http://my.seriesly.host/",
    dbname = "mydb",
    pointer = "/some/json/pointer",
    reducer = "avg",
    lbl = "Label for my Metric";

// Then get a seriesly metric source:
var sr = context.seriesly(baseURL);

// And then your specific metric you want to plot:
var myMetric = sr.metric(dbname, pointer, reducer, lbl);

// Then plot it like you would any other metric:
d3.select(here).selectAll(".horizon")
    .data(things.map(function(x) { return myMetric; }))
  .enter().insert("div", ".bottom")
    .attr("class", "horizon")
    .call(context.horizon());

A comparison of a metric against itself is just like you’d expect:

// Get a metric and compare it to the same data a week ago.
var primary = sr.metric(dbname, pointer, reducer),
    secondary = primary.shift(-7 * 24 * 60 * 60 * 1000);

// Then do the compare thing.
d3.select(here).selectAll(".comparison")
    .data([[primary, secondary]])
  .enter().append("div")
    .attr("class", "comparison")
    .call(comparison);

For complete examples, check out my temperature plots and view source. series.js has all the details.

Using It

Within a few days of starting the project, I had around fifteen million data points loaded and querying. Today marks the first fortnight birthday of the system and I’ve moved from the “make it work correctly” phase to the “make it fast” phase.

I think I’ve been doing pretty well. When I started, every query like this was taking over two minutes:

Completed query processing in 5.7s, 1,894,473 keys, 1,396 chunks

I’ve still got a couple things I want to do that I haven’t yet, but for the most part, I’m just looking for more experience with it.

Incremental Mapreduce for Analytics with R

2012-04-16T00:00:00+00:00

Incremental Mapreduce for Analytics with R

I’ve been wanting to describe some of my work with using R to help me understand data I’m collecting in Couchbase Server^† because I find it quite interesting, useful and easy. However, it’s been difficult for me to figure out a good starting point because I don’t know who the audience would be. That is, finding the right set of assumptions to get going has been quite hard.

Last week, however, I spoke to a really awesome guy in a media company who had a specific question: “How can my analyists report on all the wonderful data I’m storing in Couchbase?” I dug deeper. Who are these analysts? What tools do they use?

^† the incremental Map Reduce Views are identical to Apache CouchDB views, so everything will also work with CouchDB

My Audience

Turns out, the analysists pretty close to what I would imagine. They often use some kind of data warehousing tools from Oracle that do all kinds of great magic, and then fall over really hard if you drift outside of bounds they’re comfortable with. This sounded like something I could ignore. But then he said something that gave me a pretty solid foothold. While they’re not programmers, they do use R as part of their data analysis.

Because this question was asked by a Couchbase user who wanted to know how to get his data out, I’m going to assume anyone reading this knows R a bit better than Couchbase.

About Views

There are a lot of things you can read if you want to understand the couch view concept. The view chapter of the Couchbase Server Manual covers the concept pretty well. If you want to know everything you can know, then dig through that, but for most of my uses, it really comes down to three things:

Extract the useful information.
Sort it, putting like things together.
Do some basic aggregation.

That’s how I take a lot of data and turn it into useful information most of the time. Hopefully the examples that follow will help you do the same.

The Data

The hardest part of any data grokking tutorial is that it’s never about your data. This simultaneously makes it less interesting to the reader and often makes it a bit harder to apply to your own problems.

Unfortunately, the most interesting data I regularly extract for reporting is somewhat sensitive, so I can’t share the things that I’ve got the most use out of, but I’m hoping this will help lead you to something interesting.

The data I’ve chosen to work with is the SFPD Reported incidents set from the SF Data Website web site. It’s pretty much everything that the SFPD has reported since 2003.

These documents are pretty regular and flat. Your data may be more complicated, but the techniques are the same. Let’s begin by looking at an example document from the SFPD data set:

{
   "category": "Prostitution",
   "incident_id": 90096348,
   "district": "Tenderloin",
   "timestamp": "2009-01-27T04:03:00",
   "lon": -122.416261836834,
   "lat": 37.7853750846376,
   "location": "Ofarrell St / Hyde St",
   "time": "04:03",
   "date": "2009-01-27",
   "resolution": "Arrest, Cited",
   "day": "Tuesday",
   "desc": "Solicits To Visit House Of Prostitution"
}

I think I can understand what all these things are, so let’s get to work.

Getting Data into R

There are a few packages I’ll be using here, so let’s make sure we get those into your R before we go:

install.packages(c('rjson', 'ggplot2', 'reshape'),
                 dependencies=TRUE)

require(rjson)
require(reshape)
require(RColorBrewer)
require(ggplot2)

As R likes “square” data, I tend to have the output of my views be very regular, which also means I can have very simple functions for taking a view and pulling it back out. For this purpose, I have some basic common setup in my R scripts that looks like this:

# Pointer to your couchbase view base.  This is where you find your
# own data
urlBase <- 'http://couchbase.example.com/sfpd'

# This is your basic GET request -> parsed JSON.
getData <- function(subpath) {
        fromJSON(file=paste(urlBase, subpath, sep=''))$rows
}

# And this flattens it into a data frame, optionaly naming the
# columns.
getFlatData <- function(sub, n=NULL) {
        b <- plyr::ldply(getData(sub), unlist)
        if (!is.null(n)) {
                names(b) <- n
        }
        b
}

# Also, I'm going to be working with days of week, so I need these:
dow <- c('Sunday', 'Monday', 'Tuesday', 'Wednesday',
         'Thursday', 'Friday', 'Saturday')
shortdow <- c('Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat')

Overall Crime Report Count

As with most data sets, I don’t actually even know where to start, so first let’s just see what kinds of crimes we’ve got. I’m interested in total counts and counts by day of week. The nice thing is that with couch views, I can build a single view that will tell me either. Let’s look at the view source:

function(doc) {
  emit([doc.category, doc.day], null);
}

Looks really simple, but combined with the _count built-in reducer, this can do a lot of neat things when grouping. With group_level=1, we get crime count by category. Let’s plot that and see what’s popular. Assuming we saved that in a design document called categories with the view name of byday, here’s what you tell R:

# Get a dataframe containing the categories and their respective counts
cat <- getFlatData('_design/categories/_view/byday?group_level=1',
                   c('cat', 'count'))

# The columns come back as strings and requires fixes to make it useful
cat$cat <- factor(cat$cat)
cat$count <- as.numeric(cat$count)
# Also, I found sorting it by count made it easier to understand
cat$cat <- reorder(cat$cat, cat$count)

# Now plot it.
ggplot(cat, aes(cat, count, alpha=count)) +
       geom_bar(fill='#333399', stat='identity') +
       scale_alpha(to=c(0.4, 0.9), legend=FALSE) +
       scale_y_continuous(formatter="comma") +
       labs(x='', y='') +
       opts(title='Total Crime Reports') +
       coord_flip() +
       theme_bw()

Then R will give you this:

By Day of Week

I found this to be somewhat interesting, so I wanted to know what the distribution was by day of week. I can use the same view above with group_level=2, but since the rates are tremendously different, I had R compute the relative variance across the data frame for each category by day of week and then plotted that. Here’s the R code:

# Grab the same data, but separated by day of week.
cat_byday <- getFlatData('_design/categories/_view/byday?group_level=2',
                         c('cat', 'day', 'count'))

# I'm doing similar fixup to the above, but with another ordering and
# a couple views of day of week (much for playing around)
cat_byday$cat <- factor(cat_byday$cat)
cat_byday$cat <- reorder(cat_byday$cat, cat_byday$cat)
cat_byday$count <- as.numeric(cat_byday$count)
cat_byday$cat_by_count <- reorder(cat_byday$cat, cat_byday$count)
cat_byday$day <- factor(cat_byday$day, levels=dow)
cat_byday$shortdow <- factor(cat_byday$day, levels=dow, labels=shortdow)

# Compute the percentage of each category by its day of week
a1 <- aggregate(cat_byday$count, by=list(cat_byday$cat), sum)
a2 <- aggregate(cat_byday$count, by=list(cat_byday$day,cat_byday$cat),
                sum)
total_per_day <- rep(a1$x, rep(7,length(a1$x)))
cat_byday$perc <- a2$x / total_per_day

# Let's see what this looks like
ggplot(cat_byday, aes(shortdow, cat, size=perc, alpha=perc)) +
    geom_point(color='#333399') +
    scale_alpha(legend=FALSE) +
    scale_size(legend=FALSE) +
    scale_y_discrete(limits=rev(levels(cat_byday$cat))) +
    labs(x='', y='') +
    theme_bw()

That seems like a lot of setup, but it was mostly just type setup and stuff. We’ll reuse some below. At this point, we’ve got something to look at, though:

Focused Subset

There’s a lot of data here and it’s all relative making it hard to kind of see how to compare things. I wanted to really look at a couple of areas and figure out what kinds of correlations existed. As I already had the data loaded, I figured I’d just grab a subset of what was already requested and facet plot it.

# Pick a few categories of interest
interesting <- c('Drug/narcotic',
                 'Prostitution',
                 'Drunkenness',
                 'Disorderly Conduct')

# Extract just this subset and refactor the categories
sex_and_drugs <- cat_byday[cat_byday$cat %in% interesting,]
sex_and_drugs$cat <- factor(as.character(sex_and_drugs$cat))

ggplot(sex_and_drugs, aes(shortdow, count)) +
       facet_wrap(~cat, scales='free_y') +
       geom_bar(fill='#333399', stat='identity') +
       scale_y_continuous(formatter="comma") +
       labs(x='', y='') +
       opts(title='Select Crime Reports by Day') +
       theme_bw()

And that should show me a lot more detail on these individual categories.

Personally, I found the lack of correlation between alcohol related incidents and others quite interesting. Alcohol seems to be the anti-drug. Maybe prostitutes don’t like drunks. Who knows…

Over Time

At this point, I realized the data’s going back to 2003 and I haven’t even considered whether things are getting better or worse. I didn’t really explore this very much, but wanted to get a quick feel for whether things are getting better or worse. Here’s a view that will tell us incident rates by year and category:

function(doc) {
  var ymd = doc.date.split('-');
  emit([parseInt(ymd[0], 10), doc.category], null);
}

As in all these examples, I combine this with the _count built-in reduce. Let’s just chart up the yearly rates with the following R:

byyear <- getFlatData('_design/categories/_view/by_year?group_level=1',
                      c('year', 'count'))
byyear$year <- as.numeric(byyear$year)

# There's not enough 2012 here, so let's ignore that for this chart.
ggplot(byyear[byyear$year < 2012,], aes(year, count)) +
    stat_smooth(fill='#333399') +
    labs(y='', x='') +
    scale_y_continuous(formatter=comma) +
    opts(title="Total Incident Reports by Year") +
    theme_bw()

What’s this tell us?

Looks like things are getting better (or police are getting lazier). I could dig into this a bit more to find out whether it’s true for all categories, but I’m not that interested, so let’s look at something else.

Crime by Area

I was interested in knowing whether certain crimes were more popular in some areas than others. I’m using the doc’s district property for this (rather than the built-in coordinates) and thought it might be a good use case for a heatmap.

One thing I noticed is that some reports don’t have a district associated with them. I chose to ignore those for this report, but you can quite easily see how you might substitute it with a custom value if you wanted to specifically consider it. Let’s begin with the following view code:

function(doc) {
  if (doc.district != null) {
    emit([doc.category, doc.district], null);
  }
}

Of course, we’ll use the _count built-in again. One thing I should note about this is that while I did originally plot all data, I later decided that I wasn’t interested in any area that had less than 1,000 crimes reported. As this is the output of the filter, I needed to apply that in R as we have no means of requesting that from couch a view (since the views are materialized and the map function did not include a filter before the reduce was applied). Ideally, we’d support this in the actual view request, but in the meantime, we can extract it easily in post:

by_region <- getFlatData('_design/region/_view/by_cat_region?group_level=2',
                          c('cat', 'region', 'count'))
by_region$count <- as.numeric(by_region$count)
by_region$region <- factor(by_region$region)

# Ignore anything that doesn't have at least 1,000 incidents
pop_regions <- by_region[by_region$count > 1000,]
pop_regions$cat <- factor(as.character(pop_regions$cat))
# And have the hottest crimes float to the top
pop_regions$cat <- reorder(pop_regions$cat, pop_regions$count)

ggplot(pop_regions, aes(x=region, y=cat, fill=count, alpha=count)) +
    geom_tile() +
    scale_fill_continuous('Incidents',
                          formatter=function(x)
                                      sprintf("%dk", x / 1000)) +
    scale_alpha_continuous(legend=FALSE, to=c(0.7, 1.0)) +
    labs(x='', y='') +
    theme_bw() +
    opts(title='Crime Types by District',
         axis.text.x=theme_text(angle=-90),
         legend.position='right')

That gives us the following heatmap:

The blank areas didn’t have 1,000 incidents of the specified type of crime in the indicated area since 2003. The lighter blue areas have had some incidents. The bright red have the most. Looks like I want to avoid the southern district.

How Many Does the DA Refuse?

As an example of pulling a server-side aggregate on part of the data, I found the “District Attorney Refuses To Prosecute” resolution type particularly interesting, so I wanted to know how often this happens. Again, we start with a simple view:

function(doc) {
  emit([doc.resolution, doc.category], null);
}

Then we do our normal _count thing. However, the difference here is that when I do the request, I want to use the start_key and end_key parameters to find only things that were resolved in this way. I happen to know that the list of resolutions goes from “District Attorney Refuses To Prosecute” to “Exceptional Clearance”, so I can just look for things that start with “Di” and end with things that start with “Dj”. These are also arrays I’m emitting, so it’s really based on the first element of the array. The R code then looks like this:

by_resolution <- getFlatData(paste('_design/resolution/_view/by_res_cat',
                                   '?group_level=2&start_key=["Di]',
                                   '&end_key=["Dj"]', sep=""),
                             c('resolution', 'cat', 'count'))
by_resolution$count <- as.numeric(by_resolution$count)
by_resolution$cat <- factor(by_resolution$cat)
by_resolution$cat <- reorder(by_resolution$cat, by_resolution$count)

ggplot(by_resolution, aes(cat, count, alpha=count)) +
    scale_alpha(to=c(0.4, 0.9), legend=FALSE) +
    coord_flip() +
    geom_bar(fill='#333399', stat='identity') +
    labs(x='', y='') +
    opts(title='Crimes the DA Refused to Prosecute') +
    theme_bw()

R then gives us the following:

Do note that these are absolute numbers. Don’t call up SF and complain because they don’t care about assault as they care about vandalism. There are simply more of those cases. I’ll leave as an exercise to the reader evaluating resolution types by category and deciding what to think about them.

In Conclusion

I could obviously keep going with this for days, but just wanted to help people understand my process. In most places I use this, the patterns are similar. Data sets may grow very large, but the aggregations remain small. Incremental processing of the views means my sorted and aggregated answers continue to arrive quickly and processing remains cheap.

What I've Been Up To (2011-09-20)

2011-09-20T00:00:00+00:00

What I’ve been Up To - A Virtual Blog Digest

I’ve been busy doing lots and lots of things. Much of it is work related, but I figured I’d quickly dump out a semi-structured list of things I’ve been doing since my last post (over three months!)

Many of these things deserve their own blog posts, but since I’ve clearly not been doing that, this digest will at least provide pointers for those interested.

mccouch

My primary job is to build a better couchbase server. mccouch is a key part of that by providing a memcached binary protocol interface to CouchDB. Bypassing HTTP and using memcached multi-set semantics gets us really good throughput for putting data on disk.

house temperature readers

My thermometers live on a 1-wire bus I wrote a stack and collector for that sends multicast data around the LAN I pick up from various things like the app that generates the image to your right.

I wanted to get that data stored into CouchDB, so I wrote a quick multicast temperature -> couchdb thing in coffeescript for node.js.

There were a couple of aspects of that I didn’t like (I think the primary thing was the VSS on the Linux box I had it deployed), so I ended up rewriting it in go. That’s been serving me pretty well.

couchdb duplicator

In early testing of replicator DB of CouchDB (i.e. playing with it around the house), I wrote a quick tool to have one CouchDB be a full mirror of another for all existing databases. I’ve got a few big databases. That was fun.

complex data-driven couchdb partial replication

I thought it might be fun to replicate my temperature database (~11M docs at the time) to another database filtered down to only the documents that met certain criteria specified by the replication request. That worked via a replication document specification, but backfilling wasn’t as fast as I’d like. I’ll need to do more work here.

location project

I created my location project as a repository of location data from google latitude and tripit. It automatically populates itself with data from these sources and will let me explore my many travels interactively thanks to geocouch (my first time using it) and google maps.

As you can see, my many travels are typically to Tahoe and home.

code review

I did some work here and there on our gerrit dashboard thing that shows what we’ve been working on.

One somewhat notable thing was switching from gravatar to libravatar for the avatars. Most people probably didn’t notice, but libravatar is pretty cool, and it’s got pass-through compatability.

I’ve got one of these for android as well, though kernel.org’s outage has made it pretty stale.

photo app work

I did a bunch of work on my photo couchapp. I use this to store and replicate all of my photos. It’s probably not exciting to most people (though everyone’s free to use it for a great way to use the cloud to benefit without dataloss or availability risk).

I made a bulk edit screen and built myself a proxy for accessing the full-size image I have mirrored onto S3 via a signing proxy redirector thing to my house. This is a service I wrote in (I think) python. I originally tried to write it in node.js, but I found a bug in the S3 API I was trying to use and just wanted it to work.

readme

I wrote this web app for our support guys to track discussions of interest from around the web. Instead of them having to have duplicate RSS reader config, lots of mail setups and stuff, they just have one screen they can look at and share and track state of all the things they need to respond to.

(and sorry, the app doesn’t have a README, if anyone’s interested, I can start documenting it)

reddit pics

Another tiny app I wrote just for me, my reddit pics app presents me with a stream of images from reddit and lets me mark the ones I want to keep, while deleting the rest.

(well, deleting in this case means removing from my local DB which is a replicate of an upstream master DB that keeps All The Things)

gomemcached update

Someone filed a bug report against my go memcached server that involved little more than an update for a newer compiler, but that was fun to play with again.

gotap

I don’t even remember why, but I wrote an implementation of the memcached tap protocol in go. This is the protocol upon which we build replication, failover, ETL, etc… in membase.

html5 page idle stuff

HTML5 added a feature that allows you to detect when the browser renders your page hidden (most commonly by switching tabs). This is really awesome for things like my code review and reddit apps since I can disable the realtime data stream to pages that aren’t even being looked at (and enable it when someone looks again!)

web de-auth proxy

I did some private use map visualizations use google maps (which Volker Mische ported to the jquery openmap API). Unfortunately, they also used private databases and running a web screen saver or full-time web browser that requires auth can be a minor annoyance.

That sounded like a great job for node.js, so I wrote a quick deauthing proxy in node.js using the built-in http server, client, and a pipe. It was a bit more complicated than I initially assumed it would be, but that was basically it.

Unfortunately, it was very unreliable. Not sure if my bug, or node.js, but after a few minutes, it’d stop answering any HTTP requests, or even acknowledging that it received one. That kind of sucked.

So I wrote a new one in go. It’s the same number of lines (once I added commandline parsing and stuff), but has worked flawlessly since I deployed it even with a bunch of concurrent clients grabbing lots of data and long-polling and what-not.

fprof visualization

Part of our ongoing effort of performancing everything involves understanding everything. I did a little bit of stuff with a tiny module or two showing me the current state of things, but that really didn’t help understanding things moving forward.

My first pass at this was to understand the data produced by Erlang’s fprof and organize it into a giant document we can pan and trace and see where things are unnecessarily hot and why. I wrote my fprof dot thing to do that work. It’s not quite productionalized, but it’s worked well for me.

web pipe

Turns out, graphviz from homebrew is broken on Lion (one of the relatively few issues I’ve run into), and I wanted to be able to look at stuff I was building, so I made a little web pipe tool that let me have processes that run on remote machines take data from stdin and pass it back to stdout on my local machine using curl and node.js. That worked pretty well.

erlang dtrace

This definitely deserves a post by itself, but in the meantime, I’ve been working to add proper dtrace support to erlang (once and for all!)

I mentioned my first attempt with fprof above. There was a second attempt at digging out enough information where I used the low-level tracing facilities of erlang itself to really understand where everything was happening. It will really tell you a lot, but you have to write a lot of tools and really just end up doing it yourself.

So then came dtrace.

Anyone who knows dtrace will understand this, and anyone who knows me will have heard it all (all!) before, but dtrace answers all the questions. Life, the universe, everything.

A co-worker asked me how long it takes for a message sent from one erlang process to be picked up in another process. Without as much as picking up an editor, I blasted out this commandline and was able to say it’s around 8µs ± a few nanoseconds for clock skew across cores and processes that are scheduled independently on other threads regardless of my message having been sent.

My DTrace wiki page is a starting point for getting the stuff going and knowing where all the probes are (basically function/bif/nif entry/return, process scheduler events, gc events, allocation events, message sending, hibernate, probably more).

It makes it really easy to test things that are just invisible to erlang itself – like knowing which erlang functions cause heap growth that directly causes the OS itself to actually mmap/sbrk in more memory since it all correlates.

You, too, can know it all.

lua (again)

Way back in October of last year I started experimenting with server-side scripting in membase. I spent a bit of time a couple of weeks ago and actually got a pretty rich set of APIs and stuff built and was able to demo complex server-side manipulation and batch processing with a dynamically extensible client.

There were lots of fun challenges inside of this (e.g. multiple threads operating on lua concurrently while being able to define global functions available to all future sessions while blocking accidental global variable definition). I’m hoping to get it into some products as a way to enhance testing and more quickly extend client functionality (though ideally users won’t even know about it).

Glue code can be a lot of fun. I’ve really got a lot more to say about this one, too.

couchconf

I did a CouchConf talk on jQuery in San Francisco. There are a lot more coming, so I might be doing more of this on different topics very soon.

For this one, I did a bunch of prep work, thought about what I was going to do, etc… Then I threw it all away and finalized my presentation with prezi.

I wanted to demo some of the stuff I was talking about, so I thought of something new and literally wrote the app on the ride up to the show (and kept enhancing it until my talk).

The app itself isn’t that exciting – it shows realtime trades of bitcoin across all exchanges (data populated by a little go app that I threw together to replace a python one I didn’t like).

git tree hash based positive test result memoization

I can’t remember what I was doing, but it involved running a bunch of tests with git test sequence and they weren’t fast enough. Someone had given me some memoization code a while back, but it was commit based and used refs and I think it lacked a bit of perfection.

To speed up my testing on an evolving tree, I updated it to record successful test results by the test issued and the tree hash into the object store directly. This means there are dangling objects that will eventually be cleaned up, but this is going to happen around the time that you don’t care about those tests results anyway, so it’s perfect.

Everyone: abuse the git object store to make your lives better.

leveldb - first impressions

Somewhere along the way, I wrote a backend for membase that stores data in leveldb. It’s a lot of fun to work with, and as you can see in my results, it’s up to 10x faster than our highly tuned SQLite on inserts on SSD (less on EC2, but that’s OK). It’s certainly more consistent.

It’s not perfect, though. It’s not really any faster on fetches or commits (sometimes a lot slower) and I’ve got one crashing bug in leveldb, that I need to get out of the way before I can even show it to anyone.

the programming challenge

I got to review some code from a candidate a few times for a fairly simple “let me see your handwriting” kind of coding question we pass around.

The basic problem involves reading a file with a bunch (~1M) usernames into a structure in memory and providing a function to tell you whether a given user is in it.

Candidates invariably want to build something like a trie, but often fall into this trap where they believe pointers don’t take up space (even arrays of pointers, apparently). I wanted to give it a go, so I did a plain C implementation that ended up being fewer lines of code than what we’d got from our candidates, used about 10% of the memory and ran many times faster.

I sucked some other people’s time into this as well. If you’d like to offer a suggestion, here’s a tool that generates 100M usernames for you to try out. (finally got to use a bloom filter, yay! (and no, the answer doesn’t involve a bloom filter unless you want to do a times/space tradeoff thing)

If you want to try it, note that my program is < 100 lines of plain C (no external libraries) and uses 1.26GB of RAM on my macbook for 100,000,000 usernames. It loads and indexes them in 84 seconds and spends another 31 seconds verifying it can find each of the 100,000,000 users within that list (+ 4 more that are known to not be there). Please do better than that.

program like a pirate

I’ve written a couple thousand lines of R in the last month or so.

Here are some examples taken completely out of context. I also owe a blog post on how incredibly easy it is to just throw all your data into CouchDB and then look at it with R.

python heatmap

Some of my visualization work wasn’t just couchdb -> R to pdf or png or something, but I wanted to do some slightly different stuff.

As part of my upcoming pycodeconf talk on breakdancer, I wanted to come up with some visualization on what it looked like for a cluster of tests to fail out of a 130k test suite in a way that might be consumable by a human. For this, I grabbed a python heatmap implementation and hacked on it a bit to do what I wanted.

Well, that didn’t work for me all that well, but it did get me interested in doing an interactive heatmap animation of geographical density data changing over time, so I ended up doing that instead and updating the library a little bit to do it. That was more successful.

couchdb wikipedia

Over the weekend, I loaded all of wikipedia (2011-09-01) into a couchdb and added a geo index over all of the features I could find that reference a “place.” It’s pretty fun to look around and find articles by bounding box. It’s also pretty decently fast to see the r-tree go with a bit over 300,000 points across the world.

Hopefully, I’ll end up making an offline wikipedia at some point, but I’ve got a lot of projects lined up ahead of this.

…but mostly

It’s been customer and product work. I’ve got a variety of really interesting problems at customer sites (e.g. a 100 microsecond SLA for one, high volume realtime data analysis across many dimensions for another).

We’re continuing to define and build UnQL for everybody and attending and hosting lots of talks all over the place. I hope to see you at some of them.

(and if you program in C, C++, objective C, java, go, erlang, javascript, R, ruby, and/or python and want to help use and build some awesome technologies, I could use some help)

New Operations in Membase

2011-06-07T00:00:00+00:00

New Operations in Membase

We built a couple of new protocol operations for people building applications. The general goal of adding an operation is to keep it orthogonal to other commands while enhancing the functionality in a way that lets you do things that couldn’t be done before, or at least were common and difficult to do efficiently.

Here is a description of the new commands and an idea of how they might be used.

Sync

The first new concept we introduced is a sync command for providing a barrier where you wait for an application’s data to change state in specific ways such as having an item change from a known value or achieve a specified level of durability.

Quick background on how this works in membase (for which we implemented sync to begin with): Membase’s engine has what is effectively an air-gap between the network interface and the disk. Operations are almost all processed from and to RAM and then asynchronously replicated and persisted. Incoming items are available for request immediately upon return from your mutation command (i.e. the next request for a given key will return the item that was just set), but replication and persistence will be happening soon.

The membase sync command is somewhat analagous to fsync or perhaps msync in that you can first freely lob items at membase and verify that it’s accepted them at the lowest level of availability. When you have stored a set of critical items, you can then issue a sync command with the set of your critical keys and required durability level and the server will block until this level is achieved (or something happens that prevents us from doing so).

There were discussions about different semantics (such as a fully-sync’d mode or a specific set+sync type command). While a single set+sync command would be one fewer round trip than doing a separate set and sync, it makes little difference in practice since the typical effect of a sync command is a delay. This, however, comes at the cost of making it very difficult to do any sort of practical batching or pipelining. One can sync after every command, after a large batch, or on select items from within a large batch.

What can you Sync On?

The specification permits a given set of keys to be monitored for one of the following state changes:

Wait for Replication
Wait for Persistence
Wait for Replication and Persistence
Wait for Replication or Persistence
Wait for Mutation

There’s also space for a lightly discussed “any vs. all” flag for the keys where you can hand the server a set of keys and be informed as soon as any one of them changes to the desired state instead of waiting for all of them.

Example

Given a giant sack of items, with a mix of important items (want stored) and really important items (must guarantee are stored before returning), let’s do the right thing.

def store_stuff(items):
    """Store a collection of items.

       Items will be stored asynchronously, then important items
       will be synchronized on before returning."""
    important = []
    for i in items:
        mc.set(i.key, i.exp, i.flags, i.value)
        if i.important:
            important.append(i)

    mc.sync_replication_or_persistence(important)

(note that a python client supports these features, but not exactly with this API, but this should give you the basic idea)

Similarly, one can rate limit inserts such that items don’t go in faster than they can be written to disk.

def store_stuff_slowly(items, sync_every=1000):
    """Store a collection of items without building a large
       replication backlog."""

    for n, i in enumerate(items, 1):
        mc.set(i.key, i.exp, i.flags, i.value)
        if (n % sync_every) == 0:
            mc.sync_replication(i)

Every sync_every item (default 1000) waits for synchronization to catch up. Setting sync_every to one would cause us fully synchronize every item.

Touch

We have heard from quite a few projects owners that they’d like the ability to have items with a sliding window of expiration. For example, instead of having an item expire after five minutes of mutating (which is how you specify an object’s time-to-live today), we’d like it to expire after five minutes of inactivity.

If you’re familiar with LRU caches (such as memcached), you should note that this is semantically quite different from LRU. With an LRU, we effectively don’t care about old data. The use cases for touch require us to actively disable access to inactive data on a user-defined schedule.

The touch command can be used to adjust expiration on an existing key without touching the value. It uses the same type of expiration definition all mutation commands use, but doesn’t actually touch the data.

Similar to touch we added a gat (get-and-touch) command that returns the data and adjusts the expiration at the same time. For most use cases, gat is probably more appropriate than touch, but it really depends on how you build your application.

Example Usage

Usage of touch and gat are pretty straightfoward. A really common pattern might be storing session data where we want “idle” data to be removed quickly, but active data to stick around as long as it’s active.

def get_session(session_id, max_session_age=300):
    """Get a valid session object for the given session ID.

       Sessions will only live for five minutes.
       Unauthenticated will be thrown if the session
       can not be loaded."""
    s = mc.gat(session_id, max_session_age)
    if not s:
        throw Unauthenticated()
    return s

This example showed a simple session loader that keeps the session alive and signals mission sessions to another part of the application stack that can deal with logins and stuff.

Availability

We’ve been using this stuff, but we haven’t yet achieved universal availability.

Servers

Membase 1.7 provides this full touch and gat functionality and partial sync functionality.

For sync, only waiting for replication is supported, and only a single replica. The protocol allows for the tracking of up to 16 replicas, but membase as a cluster uses transitive replication so it’s not possible to track when the second replica is complete from the primary host (much less the sixteenth!).

Similarly, we’ve written most of the code for syncing on persistence, but before our 2.0 storage strategies, we thing it could be more harmful than useful in most applications. Even with our 2.0 strategies, it’s likely that it’s not as appropriate as replication tracking for all but the absolutely most important data.

Memcached 1.6(ish) has support for touch and gat in the default engine (which also ships in Membase).

Clients

In addition to mc_bin_client.py (which is a sort of reference/playground client that ships with membase and we write many of the tools with), we’ve got support in two clients yet, but we’re considering the feature “evolving” as we’re trying to find the best way to do it. Feedback is far more than welcome!

Java

spymemcached 2.7 has support for touch, gat, and sync.

C Sharp

The enyim C# client for memcached has support for touch, gat, and sync in a release that should hit the shelves quite soon.

Couchbase OSX

2011-04-04T00:00:00+00:00

About a year ago, Trond asked me to build him a GUI tool for running membase on his Mac. I finally got around to it and we liked it enough that we’re making it available to everyone.

I’ve been on a cocoa development kick lately after doing some work Jan’s CouchDBX for our Couchbase Server 1.1 release. It was really quite awesome and easy to get people going. I recommended it to all of my friends who were interested in CouchDB, but it was not something I ever ran myself.

After releasing 1.1, I started thinking about what would really make it better and put a bunch of time into something I’d run on my own (both for development and my home production instance). The biggest features I wanted for myself were the following:

Minimal UI in the app (much prefer the status bar only)
Autorestart on failure (I like to kill my daemons randomly)
Easy start at login

The minimal UI includes everything you need and nothing you don’t.

Firstly, it no longer brings its own web browser. You have one you like, that’s the one I want you to use.

Also, I run on a headless server with minimal resources. I kill my CouchDB randomly when it gets big, or whenever else I feel like doing it. It’s a crash-only design, so why not let that happen and just restart? If you have a persistent failure (i.e. it can’t run for at least ten seconds), the server will pop up an alert box letting you know, stop automatically restarting and wait for you to tell it to retry or just give up.

Similarly, whenever my machine finishes booting, I want it running my server. Instead of hand-crafting a launchd config like I normally do, I just check a box. Done!

“But wait,” you say, “what does this have to do with membase?” After estabilishing the process monitoring framework, the desired interaction, and the build system (really the worst part), it was pretty obvious how to get membase running a similar way. The build process isn’t totally straightforward (lots of weird library stuff requiring me to learn all about install_name_tool and magic incantations of automatically discovery of development time dependencies and packaging them up) and when you’re done, you either have to write a launchd plist or just sit in a terminal with the thing running and manage its logs and all that kind of stuff. That’s tedious.

In the end, there are two free packages ready for you that should work exactly as you’d expect software to work on your Mac.

Get Couchbase Server and Membase Server and instantly make all of your friends envious of how easily you can set up scalable databases on your Mac.

Using Dropbox as a Work Queue

2011-02-27T00:00:00+00:00

I’ve been writing myself a web-based photo album for over a decade now. It’s gone through many different technologies over the years as I used it to learn new ways of doing stuff.

I’m finally shedding the years of java from it to do something that’s not only more lightweight, but also distributed. The purpose of this post isn’t to describe the photo album, but a little bit of background will help to understand the point of what’s going on.

Most recently, my photo album is implemented as a couchapp which has been huge fun.

My Photo Album

The current version has a basic CouchDB backend that lets me set up replication as shown in this image to the left here. Photos can be added or edited on any server (with the exception of the public instance, which is a replication target only) and these adds and edits will eventually make their way to all of the other servers.

Previously, pictures were added to the photo album by the means of a standalone OS X app that batches them into my local DB which I then replicate out to others.

However, I haven’t written such a thing for Android, so getting photos I take on my phone while I’m out stored safely has been a bit more difficult. This is where Dropbox comes in.

Dropbox As a Networked Spool

The Android app for Dropbox is really well done. Apps that can share files get a share button that will place the content directly into Dropbox folders.

Dropbox itself delivers the files into their locations atomically – that is, if a file doesn’t sync up to dropbox in its entirety, it won’t be delivered, and a file will not show up on any client machine until it’s properly fully down and can be moved in atomically (via rename(2)).

This seems like something worth taking advantage of, and the most obvious way for me was to create a spool within Dropbox.

The Queue Processor

I wrote a simple uploader that grabs as much info as it can from the photos that appear via dropbox, uses PIL to do some scaling and stuff and then sticks it in the local DB.

The recipe for safely and reliably processing items from a spool is relatively simple and well-known. It is the way print and mail processing systems have worked for decades. The wikipedia overview does a better job of explaining the concept, but I’ll go over the code I wrote for mine since I find it easier to understand concepts with code than with English.

The actual queue processor based on the uploader is easy to understand (almost the entire thing is shown below), but does have to take care of failure modes, new things coming in while it’s processing, etc… For this, I end up with three directories:

The incoming directory
The work directory
The complete directory

The processor atomically moves an item from the incoming directory (inside Dropbox) to the work directory. It works on the item from that location, then atomically moves it to the done directory once it’s done.

Here’s what my spool processor looks like in python:

def process(basename, processing, done):
    workfile = os.path.join(processing, basename)
    donefile = os.path.join(done, basename)

    # Move from the incoming directory to the work directory
    os.rename(basename, workfile)

    # Actual work happens here.
    doSomethingImportantWith(workfile)

    # Move from the work directory to the complete directory
    os.rename(workfile, donefile)

if __name__ == '__main__':
    incoming, processing, done = sys.argv[1:]

    os.chdir(incoming)
    for basename in os.listdir('.'):
        try:
            process(basename, processing, done)
        except:
            traceback.print_exc()

I’m catching all exceptions at the toplevel loop and just logging them. If anything goes wrong at any stage, the file will stay wherever it was when it broke (usually the work directory).

Note that it could delete it when it’s done instead of just moving it to a done directory, but I don’t want to automatically delete stuff.

Invoking the Processor

My script runs on Mac OS X, so I wrote a quick launchd plist to read a sync dir and operate like a mail spool. Basically, if nothing’s coming in, nothing happens.

My monitor plist looks something like the following:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC -//Apple Computer//DTD PLIST 1.0//EN
          http://www.apple.com/DTDs/PropertyList-1.0.dtd >
<plist version="1.0">
  <dict>
    <key>Label</key>
    <string>net.spy.photoupload</string>
    <key>ProgramArguments</key>
    <array>
      <string>/path/to/script</string>
      <string>/Users/me/Dropbox/incoming-photos</string>
      <string>/Users/me/spool/work</string>
      <string>/Users/me/spool/done</string>
    </array>
    <key>WorkingDirectory</key>
    <string>/</string>
    <key>QueueDirectories</key>
    <array>
      <string>/Users/me/Dropbox/incoming-photos</string>
    </array>
  </dict>
</plist>

Place that in ~/Library/LaunchAgents and have load it with launchctl load ~/Library/LaunchAgents/net.spy.photoupload.plist and we’re up and monitoring (note that it’ll automatically load on boot).

If you wanted to run something similar on a system other than Mac OS X, you could easily write a queue manager that monitors the directory using kqueue or inotify or even just a cron job poking around looking for new stuff.

Maintaining a Set in Memcached

2011-02-17T00:00:00+00:00

Maintaining a Set in Memcached

This is something that comes up every once in a while. I usually describe a means of doing it that I think makes sense, but I don’t think I’ve ever described it quite well enough. People tend to think it’s complicated or slow or things like that. I’m going to try to try to solve that problem here.

Constraints

In order to be useful for enough applications, we’re going to work under the following assumptions:

must minimize round trips to the servers
O(1) add (for both current size and new items coming in)
O(1) remove (for both current size and items being removed)
O(1) fetch
lock and wait free
easy to use
easy to understand
no required explicit maintenance

And, of course, it has to be web scale!

Ingredients

The concept is simple and makes use of three memcached operations with atomicity guarantees.

An index is created with add. This should be pretty obvious.

Whenever we need to add or remove items, we use append. For this to work, we need to encode the items in such a way as to have them represent either positive or negative items. I created a simple sample encoding of +key to represent the addition of key to the set and -key to represent the removal of key from the set. I then use spaces to separate multiple items. Example: +a +b +c -b represents {a, c}. The sequence is, of course important.

A set that has members coming and going frequently enough may need to be compacted. For that, we reencode the set and use cas to ensure we can add it back without stepping on another client.

Walk Me Through It

I’m using python for this example. Ideally this gets implemented in your client and everything’s good to go.

First, we need encoders and decoders. This is actually the hard part and it’s really trivial when it comes down to it.

The Encoder

We start with the most basic representation of data within our sets.

def encodeSet(keys, op='+'):
    """Encode a set of keys to modify the set.

    >>> encodeSet(['a', 'b', 'c'])
    '+a +b +c '
    """
    return ''.join(op + k + ' ' for k in keys)

This is more documentation than code, but it’s pretty clear. If you want a set of JPEGs instead, you could create a simple binary encoding with a length and a body instead of having it be whitespace separated.

Modifying a Set

Modification is append-only with the only difference between adding and removing being an encoding op. This is useful because we can write the same code for both cases.

def modify(mc, indexName, op, keys):
    encoded = encodeSet(keys, op)
    try:
        mc.append(indexName, encoded)
    except KeyError:
        # If we can't append, and we're adding to the set,
        # we are trying to create the index, so do that.
        if op == '+':
            mc.add(indexName, encoded)

def add(mc, indexName, *keys):
    """Add the given keys to the given set."""
    modify(mc, indexName, '+', keys)

def remove(mc, indexName, *keys):
    """Remove the given keys from the given set."""
    modify(mc, indexName, '-', keys)

I allow a side-effect of add to create the index if it doesn’t exist.

In an actual application, there’s a non-zero chance that the append would fail because the item is missing and the immediately subsequent add would fail due to a race condition. I didn’t write the code to cover that here, but it’s pretty simple. If it matters to you, just loop the entire modify method as long as both fail. You’d have to be trying to get it to fail more than once.

The Decoder

In order to use the data, we’re going to need to decode it, so let’s put together a quick decoder that can reverse what the above encoder does (including the appends for add and remove).

def decodeSet(data):
    """Decode an item from the cache into a set impl.

    Returns a dirtiness indicator (compaction hint) and the set

    >>> decodeSet('+a +b +c -b -x')
    (2, set(['a', 'c']))
    """

    keys = set()
    dirtiness = 0
    for k in data.split():
        op, key = k[0], k[1:]
        if op == '+':
            keys.add(key)
        elif op == '-':
            keys.discard(key)
            dirtiness += 1

    return dirtiness, keys

This is the most complicated part.

Retrieving the Items

Now that we can encode, set, and modify our data, retrieval should be quite trivial. A basic pass would look like this:

def items(mc, indexName):
    """Retrieve the current values from the set."""

    flags, cas, data = mc.get(indexName)
    dirtiness, keys = decodeSet(data)
    return keys

That’s pretty much it. However, this is a pretty good time to do compaction. dirtiness above measures how many removal tokens are in the set. If there are too many, we want to kill them.

Imagine a DIRTINESS_THRESHOLD number set that decides where we want to do autocompaction. If we have more dirtiness than this, we compact upon retrieval (making a single get into a single get and a single CAS.

For this use case, we don’t actually care whether the CAS succeeds most of the time, so we just fire and forget. It’s safe (i.e. won’t destroy any data), but not guaranteed to work.

So here’s a modified items function conditionally compacting:

def items(mc, indexName, forceCompaction=False):
    """Retrieve the current values from the set.

    This may trigger a compaction if you ask it to or the encoding is
    too dirty."""

    flags, casid, data = mc.get(indexName)
    dirtiness, keys = decodeSet(data)

    if forceCompaction or dirtiness > DIRTINESS_THRESHOLD:
        compacted = encodeSet(keys)
        mc.cas(indexName, casid, compacted)
    return keys

And we’re done.

In Summary

Worst case add to set: 2 round trips (when the set doesn’t exist and needs to be created, but we don’t have to know that).

Normal add to set: 1 round trip regardless of the number of members being added. (You don’t even need to retrieve the current value to correctly add or remove items in bulk, much less transfer it all back).

Worst case retrieval: 2 round trips (when compaction is a side-effect).

Normal retrieval: 1 round trip (just fetch the one key).

Caveats

While the number of sets is roughly unlimited, there’s a practical size of a single set with this implementation. It’d be trivial to “shard” the set across multiple keys (thus across multiple servers) if one needed very large sets (more than, say 4,000 250 byte items).

Since compaction is done on read in this implementation, a case where you’re modifying very heavily but reading rarely might not be a perfect for this code. In that case, I’d start compacting on random writes (making worst case add/remove take about three hops where it would’ve otherwise been one).

Presenting - La Brea

2010-12-03T00:00:00+00:00

(re)Presenting La Brea

I realize it’s a little lame to have two posts in a row on the same topic. I’ve done a lot of work here and the first one wasn’t very well understood, so let’s just pretend like it never happened.

I thought perhaps I can make up for it slightly by using a medium I’m not particularly accustomed to – a video slideshow thingy.

In my previous post, I described La Brea as a great development tool for when your computer is too fast. There I demonstrated a use case where I injected some calls and made stuff go slowly. I hand-waved out some “future” direction of the project and thought people would get it.

I got a lot of feedback, mostly telling me I should buy more, older computers and carry them around, or that I can’t get a meaningful test with a fault injection framework. This is clearly a failing on my part of communicating my vision.

But we’ve already established that didn’t happen, so allow me to introduce you to La Brea.

:( frame no work

Feedback welcome.

La Brea - Because Your Computer is Too Fast

2010-11-12T00:00:00+00:00

La Brea - Because Your Computer is Too Fast

I’ve often thought that developers have machines that are entirely too fast. In my case, I’ve got a relatively recent dual core machine with an SSD. It’s awesome.

Except I find that some people running my software are running on machines that are doing disk IO with considerably less capable disks.

In membase, disk performance differences can be noticeable. When I’m testing with an SSD and a customer is running with a 7200rpm disk (it happens), I can’t see the kinds of situations they run into from my development machine.

How Do I Slow Down?

There are several options out there.

I played with cpulimit for a bit, but it was too coarse and really did awful things on my mac since it is basically strobing the process with SIGSTOP and SIGCONT on an interval.

I experimented with a ptrace-based solution to allow me to more granularly slow things down, but it both didn’t help and it turns out that ptrace is kind of broken on OS X anyway, so it’s neither a portable thing to do, or really all that useful.

So I wrote a library interposer.

It was pretty cool to see it do stuff, but I didn’t want to tell people to recompile the whole thing every time they wanted to change a delay or something. I decided to toss in lua.

Example

For example, what if you wanted a seek to take a full second 1% of the time (and for the fun of it, log that it did). You can write the following and feed it to labrea:

function before_lseek(fd, offset, whence)
   if math.random(1, 100) == 13 then
      io.write(string.format("Slowing a seek on fd=%d to %d (%d)\n",
                             fd, offset, whence))
      usleep(1000000)
   end
end

Now I can remember what it was like to have a rotating disk in my laptop again.

Direction

Right now, my immediate need is solved, but it’s pretty easy to add functionality, so I’m thinking about making it be a full-on fault injection framework. I looked at fiu briefly along my path and found that it was pretty interesting, but didn’t work on MacOS and was still a bit too invasive for where I wanted to be (which includes doing random stuff to third-party apps).

In addition to the before_lseek as above, I would imagine an after_lseek and perhaps even an around_lseek allowing for full AOP on your deployed C programs.

But for now, it just slows stuff down.

(source here)

How to Test Everything

2010-10-27T00:00:00+00:00

How to Test Everything

I recently had a membase user point out a sequence of operations that led to an undesirable state. I’ve got a lot of really good engine tests I’ve written, but not this case:

add with timeout -> wait for timeout -> add with timeout

The bug is pretty straightforward – expiry is lazy and it turns out I’m not checking for expiry in this case. It was pretty easy to write this test, but immediately made me think about what other cases weren’t being run.

Now, I know there are countless tools out there to aid in testing. I’ve written another one. I probably spent an hour or so writing a framework to write and run all of the tests I needed. The difference between what I’m describing here and, for example, quick check is that I want something very simple to express actions that expect their environment to be in a particular state and will leave the environment in another state. Then I want to hit every possible arrangement of these actions to ensure they don’t interfere with each other in unexpected ways.

This blows up very quickly – specifically the number of tests generated for a test sequence of n actions from a possible actions is approximately aⁿ.

Consider three defined actions permuted into sequences of two. That blows out to nine possibilites as shown in the diagram on the right.

The actions in the diagram are defined with memcached semantics on a single key, so add has a prerequisite that the item must not exist and del has a prerequisite that the item must exist.

The generated test expects success at each white box, failure at each red box, and tracks the expected state mutations to build assertions.

My first test… um, test ran with 11 actions in sequences of 4 actions. I have more actions to go, but 4 is a pretty good length, so the chart at the left is going to demonstrate my growth rate.

The awesome part is that it pointed out the original bug quite easily and another couple of bugs with limited effort.

How Do I Use This?

The API is so far pretty simple and composable. There are basically five classes (three are shown in the image on the right).

Condition

A Condition is a simple callable that is used for preconditions and postconditions. A given class doesn’t care which one it’s used for, and in many cases will be used for both.

For example, consider my implementation of DoesNotExist:

class DoesNotExist(Condition):
    def __call__(self, state):
        return TESTKEY not in state

Effect

An Effect changes our view of the state (and depending on the driver, may actually cause something in the world to change with it). For example, the StoreEffect works as follows:

class StoreEffect(Effect):
    def __call__(self, state):
        state[TESTKEY] = '0'

Action

An Action brings together an Effect and one or more Condition classes as pre and post conditions. For example, we’ll look at two actions, an Add action and a Set action:

class Set(Action):
    effect = StoreEffect()
    postconditions = [Exists()]

class Add(Action):
    preconditions = [DoesNotExist()]
    effect = StoreEffect()
    postconditions = [Exists()]

The interesting part of this is that Set and Add have different semantics, but are expressed as different compositions of the same Conditions and Effects.

Driver

Driver is kind of a larger part (seven defined methods!). It does enough that I can do anything from generate a C test suite for memcached engines all the way to actually executing tests across a remote protocol.

I won’t describe the entire thing here since it’s documented in the source. I will however, close the loop by showing you some example code that it generated that demonstrated the error we failed to find in the first place:

static enum test_result test_add_add_delay_add(ENGINE_HANDLE *h,
                                               ENGINE_HANDLE_V1 *h1) {
    add(h, h1);
    assertHasNoError(); // value is "0"
    add(h, h1);
    assertHasError(); // value is "0"
    delay(expiry+1);
    assertHasNoError(); // value is not defined
    add(h, h1);
    assertHasNoError(); // value is "0"
    checkValue(h, h1, "0");
    return SUCCESS;
}

That demonstrates how much information you know at each step of the way. From there, we can do all kinds of stuff with our stubs (delay above is implemented with the memcached testapp “time travel” feature, for example).

From here, it’s less exciting. We provide constraints, it writes tests, and makes sure that there’s another area that it’s impossible for users to encounter something we haven’t seen before.

Memcached Security

2010-08-08T00:00:00+00:00

Memcached security is a hot topic since the sensepost guys released go-derper at blackhat.

The presentation was pretty good and informative, but it seems like the hype around it has left a bunch of people confused. Although much of this was covered in the presentation, it needs to be restated as much as possible.

First and Always, Firewall

This is really part of the sysadmin placement test and has nothing to do with memcached in particular, but I’m going to go ahead and mention it anyway.

You always start by firewalling everything and then allowing only stuff you need to pass through to the places you need it to pass through.

I won’t teach you how to use your firewall, but start with the setting that disables all connectivity to your box.

If you’re running a web server, allow connections to port 443. If you also want non-ssl connections, allow port 80. If that’s the only service you’re providing, then your firewalling is now complete!

I’d like to note that Amazon EC2 does this by default, yet enough firewalls are misconfigured that they felt the need to send out a form mail to many of their users to let them know that they “have at least one security group that allows the whole internet to have access to the port most commonly used by memcached (11211)”.

Check Your Bindings

If your application only runs on one server (with the app and memcached on the same box), you can bind it to localhost by adding

-l 127.0.0.1

to the memcached flags. Now even though you’ve firewalled access to memcached, you have to be on the machine to even contact the cache when someone breaks your firewall settings.

If You Need It, Use SASL

The latest versions of memcached support SASL authentication.

Although you’ve already firewalled your memcached services off, you can require clients to perform strong authentication before using the service.

You can read more about setting this up in the SASL howto page of the wiki.

Please, Please Do Not Run as Root

memcached does not want to run as root. It tries hard to prevent this. Yet many people have a “workaround” that allows memcached to start as root (which I will not repeat) just for the sake of making their infrastructure less secure.

If someone somehow bypasses the firewall you have set up preventing access to memcached and somehow manages to find a security hole in memcached allowing code execution, do you really want to just hand over root access?

There are no such known issues, but we don’t audit the code to ensure it’s safe to run as root. That’s OK, though, because no responsible sysadmin would ever run a service as root without very strong justification, and probably a lot of work in creating a jailed environment.

Check Your Firewall Settings

Look, I’m not doubting that you know how to set up your firewall, but just bear with me.

Grab nmap or similar. Run a full port scan across your box – one from a trusted system, one from a semi-trusted system, and one from a completely untrusted system.

If there’s any response for any service you cannot justify running, you now know about it and can fix it.

That’s not just memcached – that’s gearman, beanstalkd, snmpd, a mail server, a DNS server, LDAP server, etc…

For any service you do have running and publicly available, make sure you completely understand the security implications of running this service.

Do not be embarrassed to ask if you don’t understand everything. It’s a lot better than being an example in a presentation at the next black hat because you’re running a service you didn’t intend to and you leaked important information.

Scaling Memcached with vBuckets

2010-06-29T00:00:00+00:00

For years, people have used memcached to scale large sites. Originally, there was a simple modulo selection hash algorithm that was used. It still is used quite a bit actually and it’s quite easy to understand (although, it’s shown regularly that some people don’t truly understand it when applied to their full system). The algorithm is basically this:

servers = ['server1:11211', 'server2:11211', 'server3:11211']
server_for_key(key) = servers[hash(key) % servers.length]

That is, given a hash algorithm, you hash the key and map it to a position in the server list and contact that server for that key. This is really easy to understand, but leads to a few problems.

Having some servers have greater capacity than others.
Having cache misses skyrocket when a server dies.
Brittle/confusing configuration (broken things can appear to work)

Ignoring weighting (which can basically be “solved” by adding the same server multiple times to the list), the largest problem you’ve got is what to do when a server dies, or you want to add a new one, or you even want to replace one.

In 2007, Richard Jones and crew over at last.fm created a new way to solve some of these problems called ketama. This was a library and method for “consistent hashing” – that is, a way to greatly lower the probability of hashing to a server that does not have the data you seek when the server list changes.

It’s an awesome system, but I’m not here to write about it, so I won’t get into the details. It still has a flaw that makes it unsuitable for projects like membase: it’s only probabilistically more likely to get you to the server with your data. Looking at it another way, it’s almost guaranteed to get you to the wrong server sometimes, just less frequently than the modulus method described above.

A New Hope

In early 2006, Anatoly Vorobey introduced some code to create something he referred to as “managed buckets.” This code lived there until late 2008. It was removed because it was never quite complete, not understood at all, and we had created a newer protocol that made it easier build such things.

We’ve been bringing that back, and I’m going to tell you why it exists and why you want it.

First, a quick summary of what we wanted to accompish:

Never service a request on the wrong server.
Allow scaling up and down at will.
Servers refuse commands that they should not service, but
Servers still do not know about each other.
We can hand data sets from one server another atomically, but
There are no temporal constraints.
Consistency is guaranteed.
Absolutely no network overhead is introduced in the normal case.

To expand a bit on the last point relative to other solutions we looked at, there are no proxies, location services, server-to-server knowledge, or any other magic things that require overhead. A vbucket aware request requires no more network operations to find the data than it does to perform the operation on the data (it’s not even a single byte larger).

There are other more minor goals such as “you should be able to add servers while under peak load,” but those just sort of fall out for free.

Introducing: The VBucket

A vbucket is conceptually a computed subset of all possible keys.

If you’ve ever implemented a hash table, you can think of it as a virtual hash table bucket that is the first level of hashing for all node lookups. Instead of mapping keys directly to servers, we map vbuckets to servers statically and have a consistent key → vbucket computation.

The number of vbuckets in a cluster remains constant regardless of server topology. This means that key x always maps to the same vbucket given the same hash.

Client configurations have to grow a bit for this concept. Instead of being a plain sequence of servers, the config now also has the explicit vbucket to server mapping.

In practice, we model the configuration as a server sequence, hash function, and vbucket map. Given three servers and six vbuckets (a very small number for illustration), an example of how this works in relation to the modulus code above would be as follows:

servers = ['server1:11211', 'server2:11211', 'server3:11211']
vbuckets = [0, 0, 1, 1, 2, 2]
server_for_key(key) = servers[vbuckets[hash(key) % vbuckets.length]]

It should be obvious from reading that code how the introduction of vbuckets provides tremendous power and flexibility, but I’ll go on in case it’s not.

Terminology

Before we get into too many details, let’s look at the terminology that’s going to be used here.

Cluster: A collection of collaborating servers.
Server: An individual machine within a cluster.
vbucket: A subset of all possible keys.

Also, any given vbucket will be in one of the following states on any given server:

Active: This server is servicing all requests for this vbucket.
Dead: This server is not in any way responsible for this vbucket
Replica: No client requests are handled for this vbucket, but it can receive replication commands.
Pending: This server will block all requests for this vbucket.

Client Operations

Each request must include the vbucket id as computed by the hashing algorithm. We made use of the reserved fields in the binary protocol allowing for up to 65,536 vbuckets to be created (which is really quite a lot).

Since all that’s needed to consistently choose the right vbucket is for clients to agree on the hashing algorithm and number of vbuckets, it’s significantly harder to misconfigure a server such that you’re communicating with the wrong server for a given vbucket.

Additionally, with libvbucket we’ve made distributing configurations and distributing configuration, agreeing on mapping algorithms, and reacting to misconfigurations a problem that doesn’t have to be solved repeatedly. Work is under way to get ports of libvbucket to java and .net, and in the meantime moxi will perform all of the translations for you if you have a non-persistent clients or can’t wait for your favorite client to catch up.

One Active Server

While deployments typically have 1,024 or 4,096 vbuckets, we’re going to continue with this model with six because it’s a lot easier to think about and draw pictures of.

In the image to the right, there is one server running with six active buckets. All requests with all possible vbuckets go to this server, and it answers for all of them.

One Active Server, One New Server

Now let us add a new server. Here’s the first bit of magic: Adding a server does not destabilize the tree (as seen on the right).

Adding a server to the cluster, and even pushing it out in the configuration to all of the clients, does not imply it will be used immediately. Mapping is a separate concept, and all vbuckets are still exclusively mapped to the old server.

In order to make this server useful, we will transfer vbuckets from one server to another. To effect a transfer, you select a set of the vbuckets that you want the new server to own and set them all to the pending state on the receiving server. Then we begin pulling the data out and placing it in the new server.

By performing the steps in this exact order, are able to guarantee no more than one server is active for any given vbucket at any given point in time without any regard to actual chronology. That is, you can have hours of clock skew and vbucket transfers taking several minutes and never fail to be consistent. It’s also guaranteed that clients will never receive incorrect answers.

The vbucket on the new server is placed in a pending state.
A vbucket extract tap stream is started.
The vbucket tap stream atomically sets the state to dead when the queue is in a sufficient drain state.
The new server only transitions from pending to active after it’s received confirmation that the old server is no longer servicing requests.

Since subsections are being transferred indepenently, you no longer have to limit yourself to thinking of a server moving at a time, but a tiny fraction of a server moving at a time. This allows you to start slowly migrating traffic from busy servers at peak to less busy servers with minimal impact (with 4,096 vbuckets over 10 servers each with 10M keys, you’d be moving about 20k keys at a time with a vbucket transfer as you bring up your eleventh server).

You may notice that there is a time period where a vbucket has no active server at all. This occurs at the very end of the transfer mechanism and causes blocking to occur. In general, it should be rare to observe a client actually blocked in the wild. This only happens when a client gets an error from the old server indicating it’s done prepping the transfer and can get to the new server before the new server receives the last item. Then the new server only blocks the client until that item is delivered and the vbucket can transition from pending to active state.

Although the vbucket in the old server automatically goes into the dead state when it gets far enough along, it does not delete data automatically. That is explicitly done after confirmation that the new node has gone active. If the destination node fails at any point before we set it active, we can just abort the transfer and leave the old server active (or set it back to active if we were far enough along).

What’s This About Replica State?

HA comes up a lot, so we made sure to cover it. A replica vbucket is similar to a dead vbucket in that from a normal client’s perspective. That is, all requests are refused, but replication commands are allowed. This is also similar to the pending state in that records are stored, but contrasted in that clients do not block.

Consider the image to the right where we have three servers, six vbuckets, and a single replica per vbucket.

Like the masters, each replica is also statically mapped, so they can be moved around at any time.

In this example, we replicate the vbucket to the “next” server in the list. i.e. an active vbucket on S1 replicates to a replica bucket on S2 – same for S2 → S3 and S3 → S1.

Multiple Replicas

We also enable strategies to have more than one copy of your data available on nodes.

The diagram below shows two strategies for three servers to have one active and two replicas of each bucket.

1:n Replication

The first strategy (1:n) refers to a master servicing multiple slaves concurrently. The concept here is familiar to anyone who’s dealt with data storage software that allows for multiple replicas.

Chained Replication

The second strategy (chained) refers to a single master servicing only a single slave, but having that slave have a further downstream slave of its own. This offers the advantage of having a single stream of mutation events coming out of a server, while still maintaining two copies of all records. This has the disadvantage of compounding replication latency as you traverse the chain.

Of course, with more than two additional copies, you could mix them such that you do a single stream out of the master and then have the second link of the chain V out a 1:n stream to two further servers.

It’s all in how you map things.

Acknowledgments

Thanks to Dormando for helping decipher the original “managed bucket” code, intent, and workflows, and Jayesh Jose and the other Zynga folks for independently discovering it and working through a lot of use cases.

What We're Doing in Memcached

2010-04-09T00:00:00+00:00

We’ve been steadily hacking on memcached.

We think it’s going very well, but we do want to make sure everybody who cares has the opportunity to see what’s going on behind the proverbial curtain.

The basic theme is to build a platform that allows a company to solve its scaling problems without preventing you from solving your own.

Extensibility

The biggest thing we’ve been working on is getting the storage engine interface really solid. Trond has been thinking about this for two years and did an excellent presentation on an application of it at last year’s MySQL User Conference.

Since then, we’ve applied it and adapted it to handle a few real-world scenarios and have been pretty happy with the results.

We’re looking forward to fewer forks of memcached to solve one-off problems and instead making it easier to bake a solution to your problem directly into the standard code base.

Our hope is that this will lead to a variety of open source solutions to common problems. For example, NorthScale released the bucket engine that allows a single memcached instance to support multiple logical engines.

Portability

Patrick has done a ridiculous amount of work to get us to the point where we can officially support Windows in a maintainable way (i.e. does as little damage to the rest of the codebase as possible).

This is another area where forks have existed to solve one-off problems, but have been unable to track bug fixes and new features.

Documentation

dormando has essentially been doing an informed rewrite of the documentation to make it more approachable, more comprehensive, and just generally more better.

The new site was the first part of this, and has been pretty awesome, but the wiki reorg is almost done and I’m pretty excited about that.

Releases

The 1.4.5 release just shipped. We have plans for a 1.4.6 maintenance release to clear up a bit more of the problems people have seen in the field (mostly targetting people who run operating systems that won’t update their libraries more than once a decade).

Come Join Us

If you’re somewhere around Santa Clara, come join us at the MySQL User Conference. We’ve got a lot of stuff we’ll be finishing up and are able to answer any questions you might have about storing data, perhaps even some about retrieving it.

We’re having a bof that you are personally invited to and I believe I may be joining Matt for a talk about some of the work we’ve been doing.

Why I Don't Use Maven

2010-04-01T00:00:00+00:00

Why I don’t Use Maven for my Java Projects

(and what you can do about it)

I used to really like maven. A long time ago. Around version 1.x. It had lots of great features I liked in a build system and required very little work to do just about anything. Then there was maven 2.

I often try to find an image to characterize my blog posts well. In this case, I just took the first image I saw on the maven site itself and felt that it pretty accurately summed my experience with maven.

It’s a guy sitting on a table with his back to his computer looking out a window contemplating the jump.

“Will it hurt?” he asks himself. “Will it hurt more than writing one more XML element describing how my .java files get turned into .class files?”

OK, so perhaps it’s not as bad as the imagery on their site makes it out to be, but I do have actual real reasons I’m not using it.

Why Not?

I’m going to assume you know the virtues of maven. I get complaints from users of my memcached client for not directly supporting their build tool of choice, so I’m just going to focus on the parts that keep me away.

Build De-automation

The absolute number one reason I’ve not put any effort into converting my build into maven is because it would increase my work. Not just to do the conversion, but in an ongoing way.

I’ve asked many of the people who have wanted me to run my builds with maven if they’d do the conversion for me. I had two small requests that I didn’t think were unreasonable: It shouldn’t increase my work and it shouldn’t reduce the features in my software.

Nobody has delivered this to me, and I can’t see a way to do it myself.

Apparently, The Maven Way is to edit pom.xml for every release to put the version number into that file (then, of course commit it in my SCM) and then build and then do my normal tagging.

I have to write the version number out twice.

In contrast, here’s the line to figure out the version of the software I’m producing from the buildfile I use when I build my project using apache buildr:

VERSION_NUMBER = `git describe`.strip

That’s how every version I’ve released has worked since I switched to git. Before I used git with buildr, I used hg with buildr. Before I used hg with buildr, I used hg with maven 1. Before hg and maven 1, I used gnu arch and maven 1. Never in the history of this project have I had to modify the build system when I change version numbers. I think that’s an important feature.

My Build System is Not My SCM

There’s a workaround for the above – you use the maven SCM plugin!

Except, it’s as backwards as a guy sitting on a desk facing away from his computer.

The SCM plugin makes maven a user interface to my SCM. I cannot tell you how much I don’t want another interface to my SCM. I can, however, tell you that I carefully make my tags and I have tools that do neat stuff with my carefully created tags like generate a nice useful changelog.

I don’t use my build tool to write my code. I certainly don’t use it to commit my code or show me a diff.

What I do want is for it to appropriately interact with my SCM to get the information it needs to do a build.

Missing Build Automation

One of the features they added to maven 2 was you can’t script. You can’t script feature is great because, according to the site, they believe it’s better to write a formal plugin in java because not only is it less work for them (once), but you’ll probably find someone else has that to do the same thing you’re doing and you’ll want to share it!

One of my projects has a rather large maven.xml file because it builds a compiler code generator that it uses to generate code under the current environment. It’s very specific to that build and the code that goes with it. I’m not writing and hosting a full-fledged plugin in java to do what I used to do with a little jelly script (and ant before that and make before that).

Missing Build Plugins

This is a half-answer, but I’m missing features I want. One could write them, but nobody has. This, again, was a jelly script back in the data, and is now a buildr plugin, but for any given java library I write, you have the ability to do this:

dustinnmb:/tmp 794% java -jar x.jar
spy.jar  on Fri Nov 13 10:47:00 PST 2009
Build platform: java 1.6.0_15 from Apple Inc. on Mac OS X version 10.6.2
Tree version: 2.5rc1
(add -c to see the recent changelog)

That “Tree version:” listed there is straight out of the SCM. If you add the -c option, you get what is effectively my git log. You can take a file in isolation and know which bug fixes you have and all kinds of other junk.

It’s not even clear to me how one would go about doing this in maven.

Real Benefits Have Unrelated Dependencies

Most of the time, when people ask me for maven support it’s not because of how I build my software. It’s not because they’re having trouble building my software (though that does come up).

Most of the time, they want to download it from the internet.

It was trivial to host a maven 1 repo, maven 2 repos are a bit harder because you have to generate a descriptor and a few other magic files and a deeper filesystem hierarchy for each build and then tell users how to configure up your repo.

It’s far better to just stick them in the main, centralized repositories, but you pretty much have to use maven itself to do that.

It is of course possible to create artifacts using another tool (such as apache buildr) and get them uploaded into one of the well-known maven repositories, but I don’t see any support for this coming out of the maven community.

I just can’t express the absurdity of such a thing.

This is like someone telling you that you can’t distribute software in ubuntu unless you use bzr as your revision control software. It just doesn’t make sense.

Think About The Users

I don’t have anything against maven as a concept per se, but it’s not right for me. Yay freedom of choice.

I would of course like my software to be easy to consume for people who do choose maven. I’ve put effort into making this so, and that effort has been slowly thwarted over time.

I intend to continue my efforts to make this work as easily as possible for everyone, but if you find my software being built through maven, it’s because of some hard work of others.

git test-sequence -- Push Working Changes

2010-03-28T00:00:00+00:00

git test-sequence: Push Working Changes

I interactively rebase my changes before I push them for two primary reasons:

The changes I push must concisely represent the changes I intended to make.
Each change should be tested (i.e. don’t break the build)

It’s easy enough to rewrite stuff interactively and squash and what-not to end up with a readable history, but when you actually want to try things out, it can be a pain.

Assume the upstream tree is in a working state. You want to push two changes. The first one adds a feature and the second one uses it. The second one also accidentally fixes a bug in the first one. That appears OK because the tree builds before and after your push.

If I didn’t break the build, what’s the problem? You broke a build. Next time someone runs git bisect he might hit it. It will make him sad. Let us not make people sad. If we can verify every change works, we will forever live harmonious lives.

I wrote a tool for git called test-sequence a long time ago to support this very thing. When it is easy to verify a tree is clean, we will at least try.

I’m writing about it today because it’s come up twice in conversation this week. Once with a co-worker who was trying to prep a rather large branch for review, and again in a stack overflow question where someone was eerily asking for exactly what I had written.

The concept is similar to an automated git bisect except it’s linear. It will test every change between two points in the DAG. It’ll even walk each side of a merge and test those changes individually.

The stack overflow question goes into a lot of details of the why, so I’ll just talk about the how:

Using git test-sequence

First, put the git-test-sequence script somewhere in your path.

Now, think about the stuff you want to test, how you want to test it, etc…

The example I give in the script itself looks like this:

git test-sequence origin/master.. 'make clean && make test'

Since I’m using normal range operators, that should be pretty readable to any git user. In this case, run make clean && make test for every local commit that I’ve made since I last pushed to origin/master.

You can go the other way, too:

git test-sequence ..origin/master 'make clean && make test'

…will there be any incoming changes that will break my build?

When combined with buildbot, you get ludicrous power. I’ve got a buildbot install with 21 builders currently. I’ve got 26 commits on a branch I’m moving forward. The following command will test all 26 changes against each of the 21 builders (i.e. 546 builds will be started):

git test-sequence origin/master.. 'buildbot try'

And no, the code didn’t work on all of the builders. Most of them in fact. My screen was covered in growl alerts from buildwatch letting me know that we broke something.

Before this is published, every build will work on every platform, and it will be trivial to verify.

conc Lists in Erlang

2010-03-04T00:00:00+00:00

conc Lists in Erlang - Or Making the Kessel Run in 12 Parsecs

I saw this wonderful Guy Steele talk titled Organizing Functional Code for Parallel Execution.

The talk describes how what we have learned to get us this far may be preventing us from proceeding in the new world of many cores.

I don’t want to ruin the whole talk for you, you should watch it and learn great things, but I found one thing quite inspirational. In his description of behaviors of conc lists (which I’ll go into more below), he described the time complexity of doing a mapreduce of his conc list as logarithmic.

This blew my mind. mapreduce n items in O(log n) time. I was fascinated. I could read the code, but I had to touch it, so I built an implementation of conc lists in erlang to try it out.

Quick Intro to cons Lists

conc lists are described in contrast to cons lists, so I’ll do a quick review on those first.

A cons list is basically a pair. In lisp, the first element is called a car and the second element is called a cdr, but we’re talking about erlang here where you typically just pattern match stuff, so I’ll just write the lists out in erlang notation and rip them apart with pattern matching.

The empty list is represented as []. It’s boring, but note that it’s at the end of every list. So if you see a list that looks like it has one element: [a], the cons cell is really the pair of a and [].

In erlang, cons looks like this: [SomeElement | RemainingList]. You can cons the single element above as [a | []]. A three element list can be expressed as [a, b, c] but you can also think of the implementation as [a | [b | [c | []]]].

This is significant when it comes to implementing code to process cons lists, because this is how you can model traversal operations such as map, foldl, length, etc… in your head. Each splits the list into the first element and the rest of the elements, performs an operation on the first element and then recurses over the rest.

Length, for example, may be implemented as follows:

length([]) -> 0;                   % Empty list is terminal.
length([Hd|Tl]) -> 1 + length(Tl)  % Recurse on elements.

Now for the good parts…

Quick Intro to conc Lists

conc lists point in two directions instead of just one. They’re really more like trees. Of course, there’s nothing stopping an implementation from having further dimensions, but that’s an exercise for the one doing the profiling.

Unlike a cons where you really only have two things (an element or the empty list), in a conc, you have three:

null is analogous to its cons empty cell. It is the end of the road and contains no data and no pointers elsewhere.

A singleton contains exactly one element and nothing else.

A concatenation contains a left and a right pointer which each may be to another concatenation, a singleton, or null.

Unlike a typical binary tree structure, these have no natural ordering, so at this point, we consider every operation to be O(n). In a moment, we will begin violating this, but in the meantime, let’s draw some pictures.

There are many ways to represent the same conc list. The original presentation went over a bunch, but I’ll just illustrate two here.

We’ll use [23, 47, 18, 11] for the purpose of example, because he did and he’s awesome.

To the right, we see what an unbalanced conc list representation of our list might look like.

If you are familiar with trees, you can see that an in-order traversal will visit each element in the same order as our cons representation above.

To our left, we see the same conc list in a well-balanced form.

And this is where things begin to get interesting.

conc List Processing

conc lists have two fundamental operations: append and mapreduce.

append(A, B) puts can be thought of as building a concatenation of A to B (though we do something a bit more optimal in implementation to avoid unnecessary concatenation cells).

mapreduce isn’t quite the same thing as a lists:mapfoldl, so don’t let that confuse you. You can imagine a simple implementation of a cons mapreduce looking like this:

mapreduce(MapFun, ReduceFun, Initial, List) ->
    lists:foldl(ReduceFun, Initial,
                lists:map(MapFun, List)).

Of course, we don’t don’t do two passes through the list. In fact, assuming we can keep our conc lists balanced, we get to perform linear-complexity operations in logarithmic time.

First, let us consider a naive implementation of a conc mapreduce:

mapreduce(_Map, _Reduce, Id, null) ->
    Id;
mapreduce(Map, _Reduce, _Id, {singleton, I}) ->
    Map(I);
mapreduce(Map, Reduce, Id, {concatenation, A, B}) ->
    Reduce(mapreduce(Map, Reduce, Id, A),
           mapreduce(Map, Reduce, Id, B))

The length operation, for example, just maps each element to 1 and then reduces the sum of all mapped numbers. This includes the sum of the sums as it walks up the tree. Consider the code for length:

length(L) ->
    mapreduce(fun (_X) -> 1 end,
              fun(X,Y) -> X+Y end,
              0,
              L).

If you consider the balanaced list from above, the process of computing the length is to mapreduce the left path, then the right and then reduce the results of each, or (1 + 1) + (1 + 1).

Going Parallel

The secret is that a balanced conc list may be split in half in constant time.

In the mapreduce implementation, each concatenation is a point where we can parallelize the computations of the mapreduce to the left and the mapreduce to the right and then reduce them together as the computation completes.

So in a practical application, length will concurrently compute the left list’s (1 + 1) and the right list’s and (1 + 1) and then reduce them with the final summation of (2 + 2) after they finish.

On a small scale, this looks boring, but consider the following example:

C = conc:from_list(lists:seq(1, 1000)).
timer:tc(conc, mapreduce,
         [fun(_) -> timer:sleep(1000), 1000 end,
          fun(X, Acc) -> X + Acc end,
          0,
          C]).

Here, we take a 1,000 node conc list and perform a mapreduce on it where the mapping function artificially takes 1 second (1000ms) to complete, and then returns that same 1000 as the map result. The reducer then adds all of those milliseconds together.

I’m using the timer:tc function here to show the magic of auto-parallelization in the conc mapreduce. That unfortunately makes the invocation look a bit funny, but anyone who’s programmed in erlang has done plenty of similar M,F,A invocations, so I hope it’s not terribly confusing.

There result of that call is {4013058,1000000} – that is, 4,013,058µs to perform 1,000,000ms worth of work.

The units are kind of confusing – complain to the erlang guys about that. Convert them both to seconds and you’ll see that I did 1,000 seconds worth of work in 4 seconds (and you feel that when you do it interactively).

I Want to Play

I’ve got a repo up on github with my code. The mapreduce parallelism seems about right (though it’s not depth-limited), but rebalancing is suboptimal.

I look forward to your contributions to make this implementation better, but thanks go to Guy Steele for the linked list of the future.

Running Processes

2010-02-28T00:00:00+00:00

I keep suffering a lack of decent process supervision mechanisms provided by operating systems I use.

Most systems and software seem to have this idea that a “start up script” is the right approach.

This is horribly wrong as is demonstrated by lots and lots and lots of tools that exist to do that on top of so many really awful “ps via cron” kind of things people hack together or process counts via nagios or whatever.

This post describes alternatives I’ve found to help relieve my suffering, but has ended up exceedingly long. I’ll provide quick links to specific sections:

The worst one of these I’ve personally seen was a group of contractors who pieced a system together via tibco. There was a computer that had a process that needed to be monitored. There was another computer that ran a monitor thing that would send out messages over tibco that would be processed by an agent on the first machine that would run a really long shell script that would do some ps, grep, awk and other stuff to search for a process and then maybe restart it before replying back across the tibco bus.

It did this about once a minute and used something like 60% of the CPU to verify the machine was still running. They demonstrated this to me and was all ready for me to tell them how amazed I was with their rube goldberg sysadmin skills, but my response didn’t leave them with the good feeling they sought:

“Wow. You reinvented init and cron, but managed to make them both less reliable and consume more CPU than I could’ve imagined.”

What’s Wrong With monit/god/nagios/etc…

Sometimes, these tools can be used for good, but often, they’re variations on the same theme as above. You’re polling your process list, or a pid file, or whatever to try to see if another process is running and then restarting it if it fails.

Many people seem to grab these tools with the goal of rerunning their start script sometime after they’ve noticed the process is not running. This is the mentality I’m hoping to correct.

And this brings me to my point…

So What’s the Right Way?

Don’t start programs, run programs.

init already does this. It does it very efficiently, with no CPU overhead in the general case, no latency in the exceptional case, no custom scripts to write, and with absolute reliability (that is, it won’t forget to run your command, and it won’t crash without taking the entire operating system with it).

Unfortunately, init has historically not been very easy to use use for your own processes. I’ll break it down a bit here to help you keep your stuff running.

Note: In every case, it’s assumed that you have a program that wants to run that does not daemonize on its own. Self-daemonizing programs start you down the path to hell. You can’t use any sane keepalive techniques so you have to resort to polling process lists or checking the pid or something. Even managing that pidfile gets hard when you combine it with things that change their own uid for safety (because you should never run anything as root).

Mac OS X

Mac OS X has launchd which combines init, cron, inetd, and a few other things rolled into one.

Launchd provides tremendous amounts of granularity over control of processes and extends that to users themselves to run applications consistently on events.

The following example shows how you can have a program that runs consistently while you’re logged in. Place this in ~/Library/LaunchAgents/com.example.someprogram.plist and launchd should do its magic to keep it going.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC
          "-//Apple Computer//DTD PLIST 1.0//EN"
          "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
	<key>Label</key>
	<string>com.example.someprogram</string>
	<key>KeepAlive</key>
	<true/>
	<key>RunAtLoad</key>
	<true/>
	<key>ProgramArguments</key>
	<array>
		<string>/path/to/someprogram</string>
	</array>
</dict>
</plist>

For disconnected use, you can build a similar plist and place it in /Library/LaunchDaemons/, add a UserName attribute and have the system maintain a program on behalf of another user for you quite easily.

Much, much more can be done with launchd (such as run programs on filesystem and network events). It’s open source and a horrible shame it isn’t everywhere, as it really does provide a tremendous benefit.

Solaris

Solaris has smf which is similar to launchd, but provides even more control over lifecycle, permissions, dependencies, fault management and many other things.

If you’re using Solaris, you probably understand it already, otherwise I’d recommend reading the smf docs to understand how to use it to manage your system.

Ubuntu

Recent versions of ubuntu ship with upstart, which is the least modern of the modern system facilities, but at least does allow you to specify that you want to run an application instead of just start it.

Here’s an example upstart script I wrote for a twisted app that I needed to keep running on an ubuntu box:

description	"useful description"
author		"Dustin Sallings <dustin@spy.net>"

start on runlevel 2
start on runlevel 3

stop on runlevel 0
stop on runlevel 1
stop on runlevel 4
stop on runlevel 5
stop on runlevel 6

chdir /path/to/project/directory
exec /usr/bin/twistd --uid=daemonuser --syslog -ny project.tac
respawn

Note that this relies on twistd to change the uid since it’s more straightforward than using su or sudo to change the userid before invoking the start script.

FreeBSD

FreeBSD, and most BSDs for that matter have an init that will supervise processes defined in /etc/ttys. This is about as primitive as it can get, but it works fine.

For example, if you wanted to run sshd on a FreeBSD box and make sure that it can never die, you could add the following to /etc/ttys:

sshd "/usr/local/etc/sshd_tty" unknown on

The script, /usr/local/etc/sshd_tty exists primarily to eat the implicit argument init passes to the program it runs. For this, I used the following script:

#!/bin/sh
exec /usr/sbin/sshd -D

/etc/ttys basically exists for getty type services, but it’s suitable for any other process that needs to be supervised.

After any modification to /etc/ttys, you must run init q for your changes to take effect.

Generic Linux

On any vanilla Linux system (that is, systems that don’t use a modern init as well as other systems with similar init mechanisms), you can do something similar to the above with /etc/inittab, although it has no implicit argument, so you can directly invoke sshd.

For example, the following works on my RedHat 5.4 box:

sshd:2345:respawn:/usr/sbin/sshd -D

After adding this entry (and, of course, making sure you don’t already have an sshd server running), you can run /sbin/telinit q to get it to reload.

Third Party Tools

It’s possible to bring in a third-party program to supplement an init that’s less awesome than launchd and smf. I personally have limited experience doing this, but there’s one in particular that I’ve been using and found to do a good job.

daemontools

daemontools is djb weird-ware that serves as an excellent UNIX process supervision framework.

It’s a bit weird out of the box, though. It really wishes a file structure existed that is just not natural on any UNIX system, but some ports overcome that a bit.

However, it’s very composable and has a lot of small tools that each do one thing really well.

I’ve got an OpenBSD machine that’s customized to the point where I wrote my own /etc/rc from scratch. This machine runs my DNS and DHCP services as well as a few other things around the house.

I used to have a problem where it’d lose power or something and one of the processes wouldn’t come up cleanly – very often DHCP, which made things rather difficult for me. I thought this would be a great place to try daemontools for the first time.

It has so far made things much easier to deal with.

One service I have running, for example, is sample_devices from my ibutton suite (thermometers → multicast, basically). This needs to run as a user who can access /dev/tty01 (I don’t run stuff as root unless absolutely required) and has a bit of init.

For this to work, I just have to create a run file in /services/sample_devices which looks like this:

#!/bin/sh
mkdir /tmp/sample
chown uucp /tmp/sample
exec setuidgid uucp /usr/local/sbin/sample_devices \
        -b /dev/tty01 -c /tmp/sample \
        -m 225.0.0.37 -p 6789 -t 64 -s 2121

Note that the exec is important, as daemontools has a lot of control utilities which requires it to know the pid of the actual running process, not a shell that started it. It’s good practice anyway.

The one thing I don’t like about daemontools is that the service directories contain the startup scripts, control sockets, as well as other lock states. It’d serve me better if my service definitions lived somewhere in /etc/ and the runtime control lived somewhere in /var/run, but I’m pretty happy with the results.

Countdown Project

2009-12-31T00:00:00+00:00

OK. Last blog post of the year.

My TV is broken and my daughter said she wanted to celebrate the New Year by watching some kind of ball dropping thing, so I had this sort of “last minute” idea for something and searched around the house for junk I could build it out of.

A few LEDs, a servo, an rbb (arduino clone) and, um, whatever scrap cardboard I could find and I was good to go.

If something looks ghetto, that’s a feature. I strongly encourage people like myself who have perfectionist tendencies to just try to make doing things poorly a goal now and then. It helps.

I didn’t really consider this much of a sustainable project, so I really kind of hacked it together. You can see the code that runs it on github.

The python script initializes communication, synchronizes (which has a corresponding visual effect across the LEDs) and then calculates the time until the nearest target time (I’ve been rounding to the nearest 15s, hour, whatever).

One second before the countdown sequence starts, it pauses whatever music I have playing and then turns the volume up.

After the command is sent, it waits a second and starts the music. This causes the actual music to start just about in time for the clock to strike midnight.

Then it blinks erratically.

It’s crap, but as it says in the comment in the python script, worse is better.

Update (2010)

I made a few changes to it before we got to use it (actually, I was playing with it until about 15 minutes before midnight :/ ). Figured I’d update this post with the details.

Summary of changes since the video:

It got a new face (thanks, kids).
I manually specified the angles instead of computing them to feel more organic (and matched what the kids did).
Control side pushed sync commands through every fifteen seconds during the last ten minutes.
I added some countdown vocals on the computer side to start us going at t - 10s.

It was a great success. It’s now dismantled.

Hello World in Go -- A Memcached Server

2009-11-12T00:00:00+00:00

Hello World in Go – A Memcached Server

I sat down last night to learn go. I’ve been a fan of concurrency-oriented languages since I wrote my first erlang program in early 2004 (which I still use). I’ve been itching to give go a go since the announcement.

The first thing I thought of that could naturally be modeled as a concurrent program was a memcached server. That has sort of become my “hello, world” as I have implemented the first python server for the memcached binary protocol (using asyncore), the main server (which Trond made beautiful), an erlang server, and a twisted server.

My go server has a lot in common with ememcached since both languages are concurrency oriented. There is one process/goroutine for managing the actual data storage, one for accepting TCP connections, and one for each connected user.

The TCP listener just accepts and spins up new processes or goroutines to handle the IO on the connections and then they wait for data. Once they’ve read a single request, they dispatch to the storage mechanism which will sequentually process each operation, regardless of input concurrency. Very simple, very easy to understand.

I mostly liked what I saw. The code feels a little C-like, but in practice is fairly intuitive and pleasant to work with. There are many reviews of the language and I’m not attempting to write one, but I will say that I got hung up on the lack of exceptions which required me to do weird error handling in sequential code.

The concurrency primitives remind me a bit of alef which I remember from early plan9 distributions, though I never actually got a chance to work with it before it got killed off. The one part of alef I really liked lived on in go, which further motivated me.

ZFS for MacOS X

2009-10-23T00:00:00+00:00

ZFS for MacOS X

So I’m a pretty heavy ZFS user. I’ve got a FreeBSD box with a ZFS root that all my important stuff is on, and a mac mini with ZFS that all my movies are on.

Around Snow Leopard, Apple seemed to have not only dropped support for it, but the download from macosforge seemed to not work, either.

I’ve been hoping someone would come around and fix this, but nothing’s happened.

Then today, I read this:

The ZFS project has been discontinued. The mailing list and
repository will also be removed shortly.

That made me very sad, so I decided to do something about it.

Github Project

I set up a mac-zfs project on github and started hacking. It didn’t take long and I was able to get to my zpool that has all of my music. (for the record, Bad Religion was my ZFS test case)

Downloads

I wanted to make it a bit easier to get going, so I created an installer.

You still need to kind of know what you’re doing to make use of it. But I’m hoping we can work together to make it easy for everybody.

Update:

We’ve got a google code page up for bug tracking and what-not and a mailing list for general discussion.

Come join us.

Memcached Report Card

2009-10-22T00:00:00+00:00

Memcached - Report Card for an Open Source Project

While approaching our 1.4 release, the memcached project slipped into a state where we weren’t pushing out new code very frequently.

Some members of the community demanded releases occur more frequently, so dormando published a proposal for a release cycle for memcached that would help drive us into the direction of universal happiness.

I had a feeling that his plan and efforts to keep us on it were quite effective, but the purpose of this post is to sort of take an objective retrospective look and see how the reality of the progress of the project holds up to my perception.

Release Scheduling

This chart pretty much speaks for itself. It’s clear to see that we very quickly approached the stated goal of 30 days between releases.

I should note that 1.4.1 was released later than expected because a user hit a real live failure scenario in a production system that took a while to isolate. Once we did, we wrote some tests to ensure that the entire class of bug described can’t happen again.

These releases are bigger than they appear, but very well tested on our build farm and in production environments. To see what goes into each release, check the release notes for 1.4.0, 1.4.1, and 1.4.2.

Bug Status

We’ve been taking bugs very seriously.

Crash bugs are resolved very quickly, but bugs in general are resolved within our release cycle.

Please let us know via our mailing list or our bug tracker if you’re having issues or can’t figure something out. We fix things very quickly, but we can’t fix what we don’t hear about. Remember that goes for the wiki docs, too.

The following chart shows the average age (time between when they were filed and when they were closed) of bugs by month in memcached. The error bars are showing the minimum and maximum age for any given month. Issues marked as invalid, won’tfix, and had other indicators of not being actual defects or missing features in the software were excluded.

Four anomalies were identified and described briefly in the chart. These are all considered trivial issues that were not detrimental to the state of the server, but they do show that we’re sometimes not too quick in cases like these. (more details on A, B, C, and D).

I also found it useful to consider what I called the “bug load.” That is, the number of bugs going in and out of the project by month.

The following chart is showing bugs activity in the memcached project.

Positive numbers are bugs being reported to the project. Negative numbers are bugs that we closed (bugs removed from the project). The line indicates the net number of bugs at the end of the given month.

So What’s the Grade?

I’d say we’ve got a solid B. We have room for improvement and I’m confident we’re making it.

If you disagree (and can articulate why specifically) or otherwise have feedback, a warm, active community awaits.

EMemcached

2009-10-11T00:00:00+00:00

Why an Erlang Memcached Implementation?

I wrote an erlang implementation of a memcached server speaking the binary protocol over the weekend. You’re all probably wondering why.

It was half a learning exercise, and half a way to plug a memcapable interface into more applications. In its current form, I consider it a framework to take existing erlang backend stores (e.g dict, ets, dets, mnesia, riak, etc…) and put a memcached interface on them. On its own, it’s a little boring.

The first implementation of the binary protocol was in my memcached-test project (which I still actively use as a reference implementation when playing around with features and testing clients and things). It’s a simple asyncore based server in python (with a sample synchronous client).

A bit later, came the actual production C server version we all know and love (which also had my java client to talk to).

I wrote twisted memcached and built an S3-backed server one weekend.

That leads us back to the present erlang-based server.

But Wait, There’s More!

If you have a basic understanding of erlang, you may find this implementation to be the best documentation of the protocol in existence.

In the main loop of a connection, all I’m doing is calling process_message in a loop with the connected socket, a reference to the storage server (an erlang gen_server implementation), and the result of a call to gen_tcp:recv(Socket, 24). That last call will either return {ok, SomeData} or {error, SomeReason}.

The Fun Part

The only definition of process_message/3 I have is shown below. The only valid way to call this is when the third argument is the {ok, Data} tuple where Data in this case is a binary pattern. Some of the values are filled in (which means they must match for this function body to be invoked), and some are bindings which will receive the value.

In the code below (extracted from mc_connection.erl), you’ll see that the first 8 bits must exactly be the defined REQ_MAGIC constant, and then the next 8 bits are stored in OpCode, and so on.

Any attempt to process a message not in this form will result in the connection processing crashing (the effect of which is your client being disconnected from it).

So you can see how erlang easily allows us to rip the bits we need out of the header for dispatch, so the next thing is to ask for the remaining data (extra headers, key, and body) before dispatching to the storage server process.

process_message(Socket, StorageServer,
                {ok, <<?REQ_MAGIC:8, OpCode:8, KeyLen:16,
                      ExtraLen:8, 0:8, 0:16,
                      BodyLen:32,
                      Opaque:32,
                      CAS:64>>}) ->

    % After the header is extras, key, and then body
    Extra = read_data(Socket, ExtraLen),
    Key = read_data(Socket, KeyLen),
    % Note that the length of the body from the header includes
    % the lengths of the key and extras.
    Body = read_data(Socket, BodyLen - (KeyLen + ExtraLen)),

    % Dispatch the read data to a gen_server process.
    Res = gen_server:call(StorageServer,
                          {OpCode, Extra, Key, Body, CAS}),

    respond(Socket, OpCode, Opaque, Res).

The Storage Server

A storage server process is a gen_server implementation whose handle_call implementations look will take a tuple of {OpCode, ExtraHeader, Key, Value, CAS} and return a mc_response record.

For an example storage server, consider my two flush implementations in my hashtable store (noting that flash has one 32-bit integer in extra header, no key, and no value):

% Immediate flush. Ignore the current state and make a new one.
handle_call({?FLUSH, <<0:32>>, <<>>, <<>>, _CAS},
            _From, _State) ->
    {reply, #mc_response{}, #mc_state{}};
% Delayed flush. Keep the current state and schedule a
% flush to occur (via handle_info) in the future.
handle_call({?FLUSH, <<Delay:32>>, <<>>, <<>>, _CAS},
            _From, State) ->
    erlang:send_after(Delay * 1000, self(), flush),
    {reply, #mc_response{}, State};
% More stuff Follows

That’s pretty much it. Even if nobody uses this code, it’s useful to me as a protocol reference since it’s easier to read than even the binary specification.

spymemcached Optimizations

2009-09-23T00:00:00+00:00

I got around to pushing out a new RC of spymemcached today. It’s been a while, but I’m glad I got around to it.

The announcement has the release notes (also in the tag), but there is a particular optimization I’ve been thinking about for a while, and would like to go over here somewhere below.

But first, I’ll frame it with a bit of memcached protocol fundamentals.

Introduction to Quiet Operations

Back when we were initially designing the binary protocol, we were considering how we’d handle the multi-gets. We went through several proposals until we realized that the actual essense of multi-get model was really just a class of operation that allowed us to infer some of the responses.

The above diagram shows a simple case of a multi-get. We ask for the values behind four keys. The server sends us responses for two of those keys and then says it’s done. In a client, we can safely assume that it just didn’t have the other two. It doesn’t need to actually tell us that.

So in the binary protocol model, we just made a type of command that didn’t respond with “uninteresting responses.” It’s easy to see how in a get operation, the not found response is uninteresting as we can infer it.

Other Quiet Operations

Shortly before the actual 1.4.0 release of memcached, we defined semantics for all “quiet” operations in such a way that allowed us to maximize efficiency without compromising correctness.

The binary protocol definition goes through these in tremendous detail, but those familiar with the Unix philosophy will probably find such things intuitive.

For example, in Unix, the rm command does not print out any output on success. If it completes and didn’t say otherwise, you can assume it was successful.

Similarly, a quiet delete operation doesn’t need to tell us that it successfully deleted something. That’s its job. We want to know when it fails to do it.

Optimizing with a Quiet Set

The optimization I was interested in was making a multi-set type operation that worked similarly to the multi-get functionality. After struggling with what such an API might look like, I finally decided that the right thing to do is not change the API at all.

Instead, I do something similar to multiget escalation – an optimization that’s been part of spymemcached for a long time now. If many threads are pushing sets in (or even a single-thread since the typical use-case of set is async), the packetization of these commands escalates a sequence of similar commands into a single sparse operation working on all of the items together.

While YMMV, my cache loader test ran consistently twice as fast.

Previously, one million requests would require the client to process one million responses. Now, one million requests (assume none fail) will require the client to process one tiny response.

If any do fail, the respective callers will, of course, be notified. The ones that don’t fail receive synthetic callbacks as the client infers their success.

What You Need to Do

If you’re using spymemcached, upgrade and you get the optimizations.

If you’re a client author, see how much better things are for your users as you make broader use of quiet operations of the binary protocol.

Tornado on Twisted

2009-09-12T00:00:00+00:00

So what’s this about tornado?

The friendfeed guys created an awesome web site with what was obviously (from the outside) quite awesome technology. A couple days ago, they released the the technology behind the site.

The Problems

Tornado really is two different things:

A great framework for building web sites.
A low-level networking toolkit.

Most of us who use twisted were quite surprised to find out that rather than using twisted’s awesome networking core, they reimplemented a bunch of it. Moreover, they kind of had really vague negative things to say about twisted. It’s not clear what problem they had with it, but as stated the logic kind of fails:

Twisted doesn’t have a good web framework.
No major web sites run on twisted.
So… we’re going to build something completely from scratch.

Now, I certainly build stuff from scratch unnecessarily all the time, and I also don’t want to seem unthankful for the free gift brought to us from our friendfeed friends, but in its current state, it’s a technological island.

On one hand, you have low-level missing pieces. From the bottom, you’ve got multiplexing implementations. twisted has been around a while, so it supports native multiplexors including those for select, poll, epoll, kqueue, CoreFoundation, wxPython, win32, gtk, glib, qt, and probably more. While some of them are less interesting than others, that’s a lot of catching up to do. Some user already began adding kqueue support, but it’s more of an example of the kinds of things that you don’t get for free.

At the high end, there is an incredible selection of protocol implementations from twisted you just can’t use. If you’re building a web site on an asynchronous framework, you don’t ever want to block to get further data. So whatever your networking framework, you need it to have a sane way of asychronously communicating to all of your dependencies.

They wrote an http client, though there are things I do with the twisted http client that I don’t see a way to do with the tornado client such as parse infinite xml streams from http responses.

The Solution

So instead of just generally being frustrated, I thought I’d see what it’d take to rip out the stuff that was reimplemented and use just core twisted.

The original tornado code doesn’t have much in the way of tests, so it’s currently in the “appears to work” state, but this is what got me there:

That is, with an offset of -1,297 lines of code, it can be observed to work for the cases I tried, although I’m sure there are still lines that need to be deleted before everything works.

The good news is that this means that you get all the richness of twisted and the good parts of tornado combined.

Everybody wins.

So go grab the code and see if it works in your app, or send me some patches for parts you were able to explore a bit more deeply.

Buildbot and Git Repositories

2009-09-06T00:00:00+00:00

Buildbot and Git Repositories

In a recent conversation with the GitHub guys, I was talking about how my buildbot setup was hitting GitHub and how a recent filesystem glitch of theirs caused my screen to turn red with growl alerts from buildwatch.

The response was a tongue-in-cheek “I refuse to apologize.”

The thing is, that response is absolutely the right one. This is distributed revision control. Why did I have a screen full of growl alerts because of a failure of a filesystem completely unrelated to what I was doing?

I was relying on GitHub to be highly and quickly available to my seventeen (and growing in number) buildbot slaves for this project.

Most of the time, there’s nothing for them to actually get from a centralized revision control – quite simply, they were asking for information that they had cryptographically verifiable assurance that they already had.

Today I made a small change to buildbot that prevents the slaves from ever talking to any network service to pick out a reference version in our most common use cases, thus realigning myself with the thing that initially sold me on GitHub’s service: It enhances collaboration without causing me to be dependent on the service.

For this failure, I am thankful. This new code will always be faster and more reliable for the common case even when GitHub works absolutely flawlessly.

Memcached 1.4.0

2009-07-16T00:00:00+00:00

Memcached 1.4.0

I’m a bit late to the blogging party here, but we finally released memcached 1.4.0. Check out the release notes for more details.

The release notes cover quite a lot of the interesting stuff, but they don’t properly reflect the time and effort that went into making this all happen.

There are a lot of bug fixes, as one might expect after some time. A lot of testing has shown that performance is better pretty much all around, but very few people have ever seen memcached be a performance bottleneck in their applications, so that’s not too exciting.

The biggest part of this release, however, is something I’ve been working on for about two years: the binary protocol.

The Binary Protocol

So what’s the big deal about the binary protocol?

The most obvious thing for protocol implementors is that it’s now really easy to parse the protocol. After reading a fixed-size header, a low-level packet processor can figure out where to dispatch the input and split it into all of its major components (key, value, opaque, cas, extras, etc…).

That’s great for the (small) number of developers who write servers and clients, but what about random people out there who just want to use memcached? Semantic enhancements in the new protocol allow us to build some really cool stuff.

The first example of such a thing is Trond’s replication functionality for libmemcached. We now have a clean fire-and-mostly-forget protocol semantics that allows for improvements like efficient client-side replication. It also makes it safe to make bulk-loaders (even with CAS).

Go Try It

We’ve run tons of tests, others have run tests, there’ve been various deployments large and small, but if you’re running something older, it’s your turn.

We work hard to make sure that the development versions work on all platforms we can find anyone to care about. Each change is built and tested on all supported platforms before the change is accepted into our master branch.

Do note that 1.4.0 has some build issues on OpenBSD, but someone graciously donated a builder to our buildbot farm so they’re all cleared up for 1.4.1 (which is planned for later this month).

In the meantime, there are several ways to pick it up:

Packages

Package systems are slow to pick up… anything it seems. If your system’s package manager is shipping memcached 1.4.0, please let me know.

In the meantime…

Use the Source

You can download the source distribution from the google code download site.

Building and installing is quite simple:

./configure
sudo make install

Then just run /usr/local/bin/memcached whichever way suits your fancy. Personally, I like upstart on Linux, launchd on OS X, smf on Solaris, etc…

Deploying on Amazon EC2/AWS?

I’ve put together some Amazon AMIs that are production ready and ridiculously simple to get going.

Each AMI allocates all but 512MB of RAM on the system to memcached and just starts up happy and running. These images are based on Ubuntu 9.04 and have an upstart config for the actual daemon execution so if we somehow have some kind of crashing bug, they’ll automatically and instantly restart.

Depending on your needs, you can select one of the following:

US

I’ve assembled a 32-bit AMI (ami-39c52450) for small instances, and a 64-bit AMI (ami-1fc52476) for large instances. They show up as the following:

ami-39c52450 - northscale/community-memcached-1.4.0-i386.manifest.xml
ami-1fc52476 - northscale/community-memcached-1.4.0-x86_64.manifest.xml

For example, using the ec2 command-line tools can start an extra large 64-bit instance with the following invocation:

ec2-run-instances ami-1fc52476 --instance-type m1.xlarge

After this instance comes up, you’ll find memcached 1.4.0 listening on port 11211 with about 15GB of RAM at your disposal.

No maintenance should be necessary, but ec2-run-instance’s -k parameter for supplying a root ssh key is still honored in case you want to still look around.

EU

There are European versions of the same images as ami-818ba3f5 for 32-bit and ami-838ba3f7 for 64-bit.

ami-818ba3f5 - northscale-eu/community-memcached-1.4.0-i386.manifest.xml
ami-838ba3f7 - northscale-eu/community-memcached-1.4.0-x86_64.manifest.xml

Project Skyscraper

2009-07-11T00:00:00+00:00

Project Skyscraper

It occurred to me that there’s a lot of value in building xmpp services – much like web services, but using existing connections and xmpp instead of http.

In collaboration with ga2arch, I launched an xmpp service called skyscraper.im. This has actually been running for a while now, but I’ve been too caught up in writing code to write anything about code.

translate.skyscraper.im

The first part of this service is an xmpp adhoc interface to google translate. It actually does support IM, but that’s incidental, the real value is in the adhoc interface.

If you’re unfamiliar with xmpp adhoc, you can think of it much like CGI, but using xmpp as a transport. You take a bunch of simple key/multi-value pairs and send them to a resource somewhere, and it sends you something back. The nice thing about xmpp, though, is that the mechanism for determining what things exist and what parameters they take are very programmatically accessible.

You can discover available commands through translate.skyscraper.im as shown in this video, but I’ll just tell you what it’ll tell you:

The Input

There is one field called in which is the input language in the form of a two-character language code. You may have only one of these (it is of type list-single).

There is one field called out which is the output language and is also in the form of a two-character language code. You can have as many of these as you like (it is of type list-multi).

Finally there’s a field called text which is the stuff you want to translate. You may only have one of these.

The Output

The response is a form much like the one you sent it, the keys are language codes and the values are the text translated in that language.

Note that you will not receive more language translations than you asked for, but you may receive fewer in the case where the upstream translation service can’t perform such a translation.

The obvious benefit here over doing it yourself is that you get full translations all at the same time without having to do any kind of coordination as things are completing (i.e., I do that for you).

conference.skyscraper.im

A fun thing built atop the translate component is the skyscraper muc – an xmpp multi-user chat with automatic translation.

What this means is that you can have several people enter a room with no room in common, all speaking and reading their native language.

Of course, the dream is limited by the translation service, but it does work within the reasonable limits.

If you’d like to try it out, find a friend who speaks another language and both join a chat room at conference.skyscraper.im. Start by each of you telling it your respective languages (e.g. /lang en) and then talk.

I’ve spent very little time on this, so I imagine it falls apart in all kinds of places, but it was really easy to get going with the translate service from above, and as it’s an xmpp server component, it does all this with just one file descriptor and the necessary state to keep up with who’s in what room and what translations are outstanding.

My Github Anniversary (Sort Of)

2009-03-01T00:00:00+00:00

My Github Anniversary (Sort Of)

I joined github about a year ago today. Kind of. It was actually February 29th, but there isn’t one of those this year, so I’m going to have to wait a few more years before I can properly have an anniversary.

Between the time I joined and the time I typed this line, I’ve generated 206 pages of activity (174 pages public).

By the end of my first day, I had migrated two java projects over from mercurial, converted a ruby project from subversion, had that repo forked, watched another ruby project, started a new objective c project, wrote some new code for some of my projects and pushed it and invited a couple of friends (both of whom now share my leap-year-only start date) and added them to a couple of my projects.

Chris became somewhat a github evangelist and made some really cool stuff stuff there (some if it’s cooler than most people can comprehend). Ian throws awesome parties (I’ll eventually make him give me code).

At the point where I started using github, I’d probably been a (somewhat casual) git user for about two weeks. git is great, but the documentation and tutorials were more about laying out an infinitely complex decision tree – that is, git itself is easy to do anything with, but you can do a lot with it, so it comes across as unnecessarily complex.

Github has been really good about making really common paths really easy so that you naturally fall into workflows that minimize the work required to contribute to open source projects down to the point where you can clone a repo, branch, edit some stuff, and notify the maintainer of a project in just a few clicks on the web site.

Overall, it’s been a pretty good year. Just three more until my real anniversary.

Making Use of Caps Lock

2009-02-09T00:00:00+00:00

Making Use of Caps Lock

If you’re like me (and who isn’t), the caps lock key is an annoying waste of plastic. Its only value seems to be to type things to offend people. Luckily, most operating systems allow you to map it to control, or another useful key.

As a fairly new emacs user (and a long-term shell user), having a control key near where my fingers already makes many things far more accessible to me. Highly recommended.

But there’s another thing that the caps lock provides that quickly moves from annoyance to useful feature:

Just about every keyboard ever made has a caps lock indicator. Such a wonderful thing when used correctly.

Amit Singh over at google had a blog post about manipulating keyboard LEDs which inspired me to add this feature to my buildwatch app pretty much immediately.

Due to a fairly dumb bug I fixed today, it hasn’t been working (and I wasn’t paying attention to it anyway), but now, when anyone does a build against my build farm and the build breaks, my keyboard light will come on.

Sort of makes me want to write some bad code.

Buildbot

2009-01-30T00:00:00+00:00

Buildbot

I’ve used a few different continous integration systems, but buildbot has been my favorite for quite a while. It’s got a really nice architecture, a great codebase, and all the tools I need.

Most of these things seem to assume that if it builds anywhere, you’ve done your job. buildbot assumes there may be a relatively large number of build workers and an even larger number of configurations.

For example, I’m building out a buildbot configuration for a project I’m working on now for some portable software that seems to be most popular on Linux, possibly followed by OS X. However, depending on the distribution, toolchain, architecture, and compile-time options, some things just don’t work correctly. I also use it on FreeBSD, but having added a slave for FreeBSD, I found a small compiler error.

TBYB

Most CI systems are all about telling you when you’ve committed code that breaks the build for other people. Isn’t it rather late by that point?

buildbot has a try command that allows you run a complete build across whichever nodes you want (or all) before making your code available to anyone else.

One of the guys I’m working with on this project does most of his work in Solaris. He wrote some code, tested his code, and sent me a patch. I committed his patch to my local git repo, but before pushing it, ran buildbot try to make sure nothing weird happened. There were two different problems that caused build and/or test failures on every OS that wasn’t Solaris.

I was able to fix up his changes so that they worked everywhere, and they never actually made it into a public tree in their broken form.

The Code

buildbot’s codebase has some very robust plumbing, and it seems to support just about anything you might want to do (which other systems allow you subscribe to the tail of the current step’s log in realtime without having access to the slave?).

I’ve had to make some changes to get some features working as I expect, or fixing bug in edge cases, though.

I’ve been doing some work lately with the git support in try. Rather than repeating myself, you can see what I’ve done in my portion of the changelog and imagine how much better your project would be if every potential change could be tested across all your supported platforms before you publish.

commit dfb18e6c177d490da9dcab29e431eff22cfedfec
Author: Dustin Sallings <dustin@spy.net>
Date:   Wed Jan 28 20:26:00 2009 -0800

    Allow users to specify the remote git branch.

    This allows for a case where someone has a repository that tracks
    someone else's repository, has arbitrary local branches, but wants to
    run tries with the delta from the reference repository (i.e. the one
    the master knows about) to the local changes.

    Without this, it's likely the reference repository will not have the
    necessary objects to pull down a base revision to be able to apply
    patches for the try to succeed.

    This also ensures that the current client's view of the reference
    repository is honored.  That is, if the reference repository has moved
    forward, the trier's current tip of the remote is used to compute the
    delta, and that's sent along as the baserev.

commit 38a9c7fc719b44e2cdfa47884182da7128b369d2
Author: Dustin Sallings <dustin@spy.net>
Date:   Wed Jan 28 16:30:08 2009 -0800

    Added --dry-run (-n) support to buildbot try.

    Need to be able to try try when I just want to know what it's even
    going to consider doing.

commit f43143835cba3ca5963e07874da17c1416a031c2
Author: Dustin Sallings <dustin@spy.net>
Date:   Wed Jan 28 08:34:04 2009 -0800

    Refactored try buildName validation for reuse.

commit a88238cae5000c3481877aa354e3c76fc45770b8
Author: Dustin Sallings <dustin@spy.net>
Date:   Wed Jan 28 08:25:52 2009 -0800

    Don't require a list of builders for buildbot try.

    This maintains the current restrictions around builder lists that
    prevent one from trying a build that isn't in the list, but allows the
    user to delegate the selection to the server by not listing the
    builders at all.

    I want my users to always try their builds on every build
    configuration, but I don't want to be sending out buildbot options all
    the time.
commit 99240ada38677a143971fe390beb714c3017c20b
Author: Dustin Sallings <dustin@spy.net>
Date:   Sun Jan 25 10:47:57 2009 -0800

    git_buildbot should show the author, not the committer

    When I'm looking at my waterfall, I'd like to see the names of the
    people who wrote code, not just mine because I happened to have
    cherry-picked or am'd a bunch of changes.

commit 2c6865d83e967ca135acd3810e08af2dfab727b3
Author: Dustin Sallings <dustin@spy.net>
Date:   Wed Jan 21 20:23:35 2009 -0800

    Look at the remote tracking branch in git for buildbot try.

    This allows us to try committed, but not pushed code.

commit a079d84d4056dbf5ab3489cb7f2f8f0e20d91b87
Author: Dustin Sallings <dustin@spy.net>
Date:   Thu Jan 22 09:59:59 2009 -0800

    Try to reclobber on retry.

    On a failed git update in clobber mode, I was getting the following
    error on the second try:

    exceptions.OSError: [Errno 17] File exists: '/path/to/build'

    It seems that the clobber only occurs once, and any error that happens
    during the checkout should redo the clobber.

commit 6c36579a63b58bc986ec56e0272362038be08112
Author: Dustin Sallings <dustin@spy.net>
Date:   Wed Nov 19 04:56:38 2008 +0800

    Get rid of git- commands in git_buildbot.

    Signed-off-by: Dustin J. Mitchell <dustin@zmanda.com>

commit ac70a83fa05c2b1b31dd9411ffc28876fb9e9f20
Author: Dustin Sallings <dustin@spy.net>
Date:   Sat Apr 19 08:40:20 2008 +0800

    Send merge changes from git.

    Signed-off-by: Dustin J. Mitchell <dustin@zmanda.com>

Publishing Changelogs

2009-01-17T00:00:00+00:00

Publishing Changelogs

A user filed a bug against my memcached client because he couldn’t find the changelog and wanted to know what went into the new version.

I have a decent structure around releases, especially with this project. I tag it and write a good summary of changes in the tag including an abbreviated shortlog output, then I send the same out to the mailing list.

Somehow, I expected anyone not on the mailing list to just dig through my tags to find out what’s changed. I suppose that’s asking quite a bit.

Since I’ve been keeping good information in my tags since moving over to git (which actually has proper tag objects), I’ve found it quite easy to automate this process. My new git htmlchangelog takes a list of tags and generates a reasonable changelog automatically from this.

For example, the following command:

git htmlchangelog `git tag | egrep -v pre\|rc` > changelog.html

created the changelog for my memcached client.

Visualizing Git Contributors

2009-01-16T00:00:00+00:00

I was looking for something quick to do today, so I started drawing pie charts showing who commits to various projects.

Pie charts are particularly terrible for communicating something useful to people, but they kind of look nice, so whatever.

Here are some examples:

Linux

![Linux](http://chart.apis.google.com/chart?cht=p&chs=600x300&chd=s:CBBBBB2&chl=Linus

David

Adrian

Ralf

Jeff

Other)

Git

![Git](http://chart.apis.google.com/chart?cht=p&chs=600x300&chd=s:WFECBBa&chl=Junio

Shawn

Linus

Johannes

Eric

Jakub

Other)

Memcached

![Memcached](http://chart.apis.google.com/chart?cht=p&chs=600x300&chd=s:THGGEEP&chl=Brad

dormando

Paul

Dustin

Trond

Toru

Other)

Emacs

![Emacs](http://chart.apis.google.com/chart?cht=p&chs=600x300&chd=s:OEEDDDe&chl=Richard

Gerd

Eli

Stefan

Kenichi

Glenn

Other)

Rails

![Rails](http://chart.apis.google.com/chart?cht=p&chs=600x300&chd=s:VPDDCCO&chl=David

Jeremy

Michael

Rick

Jamis

Joshua

Other)

Git Timecard

2009-01-11T00:00:00+00:00

Git Timecard

I really like github’s punch card feature. It’s a nice way to quickly see when a project is worked on.

However, it’s very limited. You can only see everyone’s work on the master branch. I have lots of other ways I want to look at my repos (including some that aren’t on github).

So I wrote my own.

Here are the neat kinds of things I can do now:

Punch Card of my Work Repo

~/work-project/ % git timecard

![work](http://chart.apis.google.com/chart?cht=s&chs=800x300&chd=e:CkFIHrKPMzPXR7UeXCZmcKeuhRj1mZo9rhuEwozM1w4U639bCkFIHrKPMzPXR7UeXCZmcKeuhRj1mZo9rhuEwozM1w4U639bCkFIHrKPMzPXR7UeXCZmcKeuhRj1mZo9rhuEwozM1w4U639bCkFIHrKPMzPXR7UeXCZmcKeuhRj1mZo9rhuEwozM1w4U639bCkFIHrKPMzPXR7UeXCZmcKeuhRj1mZo9rhuEwozM1w4U639bCkFIHrKPMzPXR7UeXCZmcKeuhRj1mZo9rhuEwozM1w4U639bCkFIHrKPMzPXR7UeXCZmcKeuhRj1mZo9rhuEwozM1w4U639bCkFIHrKPMzPXR7UeXCZmcKeuhRj1mZo9rhuEwozM1w4U639b,IAIAIAIAIAIAIAIAIAIAIAIAIAIAIAIAIAIAIAIAIAIAIAIAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAYAYAYAYAYAYAYAYAYAYAYAYAYAYAYAYAYAYAYAYAYAYAYAYAgAgAgAgAgAgAgAgAgAgAgAgAgAgAgAgAgAgAgAgAgAgAgAgAn.n.n.n.n.n.n.n.n.n.n.n.n.n.n.n.n.n.n.n.n.n.n.n.v.v.v.v.v.v.v.v.v.v.v.v.v.v.v.v.v.v.v.v.v.v.v.v.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3………………………………………….,BoAeAAAAAAAAAAAtDQDtGuE4IWMhIlJ-GfMECyGuH4L1L1KNFGGRAPAeAAAeBZCGLmeYgs1GjeiyijnavTbmMEHqIHG9IHGuELB3APAAAAAPAtCjGfcwdrpCZ-fTrXzP8SozSUFzGRCyI0H4CGB3AeAeAeB3AeCGEpYlkLjegdgsv.59..okTBJ-JvHMMhJvD8DfB3AAAAAAAADfHqabX4rIiFkLpRze28drVVQsLXI0J-JREpCUAAAAAPAAAACyGRVkchrXh2jPtc4kw7WCGfDBBoCGCjDtAPBKAAAAAAAAAABoBKD8GCHqGRIHGRGCIHDBGRELBoAeDQCjAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA&chxt=x,y&chxl=0:

Sun

Mon

Tue

Wed

Thu

Fri

Sat

&chm=o,333333,1,1.0,25,0&chds=-1,24,-1,7,0,20)

Alternate Branch

~/memcached % git timecard rewritten-bin

![rewritten-bin](http://chart.apis.google.com/chart?cht=s&chs=800x300&chd=e:CkFIHrKPMzPXR7UeXCZmcKeuhRj1mZo9rhuEwozM1w4U639bCkFIHrKPMzPXR7UeXCZmcKeuhRj1mZo9rhuEwozM1w4U639bCkFIHrKPMzPXR7UeXCZmcKeuhRj1mZo9rhuEwozM1w4U639bCkFIHrKPMzPXR7UeXCZmcKeuhRj1mZo9rhuEwozM1w4U639bCkFIHrKPMzPXR7UeXCZmcKeuhRj1mZo9rhuEwozM1w4U639bCkFIHrKPMzPXR7UeXCZmcKeuhRj1mZo9rhuEwozM1w4U639bCkFIHrKPMzPXR7UeXCZmcKeuhRj1mZo9rhuEwozM1w4U639bCkFIHrKPMzPXR7UeXCZmcKeuhRj1mZo9rhuEwozM1w4U639b,IAIAIAIAIAIAIAIAIAIAIAIAIAIAIAIAIAIAIAIAIAIAIAIAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAYAYAYAYAYAYAYAYAYAYAYAYAYAYAYAYAYAYAYAYAYAYAYAYAgAgAgAgAgAgAgAgAgAgAgAgAgAgAgAgAgAgAgAgAgAgAgAgAn.n.n.n.n.n.n.n.n.n.n.n.n.n.n.n.n.n.n.n.n.n.n.n.v.v.v.v.v.v.v.v.v.v.v.v.v.v.v.v.v.v.v.v.v.v.v.v.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3………………………………………….,AAMzIiERAAMzERAAERAAAAAAIiIiREREu7MzERmZMz..iI..qqVVMzIiREERAAERMzZmREERMzZmd3VVREzMMzqqIiMziIu7VVVVMzIiIiERAAERERMzd3VVVVIiREREIiREVVVVREMziIMzMzREMzREERIiAAAAAAIiERmZVVmZAAREVVIiIiREIiIiIiIiVVd3ERERIiERAAERIiIiVVAAIiREIiMzMzERMzAAVVMzAAZmERMzIiIiERAAERAAAAREVVVVMzd3MzMzIiAAREERMzERERAAERAAMzAAERAAAAAAAAAAERAAMzIiERIiIiIiREERIiERMzIiAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA&chxt=x,y&chxl=0:

Sun

Mon

Tue

Wed

Thu

Fri

Sat

&chm=o,333333,1,1.0,25,0&chds=-1,24,-1,7,0,20)

One User’s Work

~/memcached % git timecard --author=dustin

![dustin’s timecard](http://chart.apis.google.com/chart?cht=s&chs=800x300&chd=e:CkFIHrKPMzPXR7UeXCZmcKeuhRj1mZo9rhuEwozM1w4U639bCkFIHrKPMzPXR7UeXCZmcKeuhRj1mZo9rhuEwozM1w4U639bCkFIHrKPMzPXR7UeXCZmcKeuhRj1mZo9rhuEwozM1w4U639bCkFIHrKPMzPXR7UeXCZmcKeuhRj1mZo9rhuEwozM1w4U639bCkFIHrKPMzPXR7UeXCZmcKeuhRj1mZo9rhuEwozM1w4U639bCkFIHrKPMzPXR7UeXCZmcKeuhRj1mZo9rhuEwozM1w4U639bCkFIHrKPMzPXR7UeXCZmcKeuhRj1mZo9rhuEwozM1w4U639bCkFIHrKPMzPXR7UeXCZmcKeuhRj1mZo9rhuEwozM1w4U639b,IAIAIAIAIAIAIAIAIAIAIAIAIAIAIAIAIAIAIAIAIAIAIAIAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAYAYAYAYAYAYAYAYAYAYAYAYAYAYAYAYAYAYAYAYAYAYAYAYAgAgAgAgAgAgAgAgAgAgAgAgAgAgAgAgAgAgAgAgAgAgAgAgAn.n.n.n.n.n.n.n.n.n.n.n.n.n.n.n.n.n.n.n.n.n.n.n.v.v.v.v.v.v.v.v.v.v.v.v.v.v.v.v.v.v.v.v.v.v.v.v.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3………………………………………….,AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAttAAAAAAAAAAAAbbJJAAAAAAAAAAAAAAAAJJJJAAAAAAAAAAAAAAAAkkAAAAAAAAAAAAAAAAAAAAAAAAAAAA..AAAAAAAAAAAAAAJJSSAAAAJJAAJJAAAAAAAAAAAAAAAAAAAAJJSSSSAAAAAAAAAAAAAAJJAAAAJJAAAAAAAAAAAAAAJJAAAAAAAAAAAAAAAAAAAAAAAAAAAASSAAAAAAAAAAAAAAAAAASSAAJJAAAAAAAAAAAAbbAAAAJJAAAAJJAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA&chxt=x,y&chxl=0:

Sun

Mon

Tue

Wed

Thu

Fri

Sat

&chm=o,333333,1,1.0,25,0&chds=-1,24,-1,7,0,20)

Last Week’s Work

~/twitterspy % git timecard '@{1 week ago}'..

![recent twitterspy work](http://chart.apis.google.com/chart?cht=s&chs=800x300&chd=e:CkFIHrKPMzPXR7UeXCZmcKeuhRj1mZo9rhuEwozM1w4U639bCkFIHrKPMzPXR7UeXCZmcKeuhRj1mZo9rhuEwozM1w4U639bCkFIHrKPMzPXR7UeXCZmcKeuhRj1mZo9rhuEwozM1w4U639bCkFIHrKPMzPXR7UeXCZmcKeuhRj1mZo9rhuEwozM1w4U639bCkFIHrKPMzPXR7UeXCZmcKeuhRj1mZo9rhuEwozM1w4U639bCkFIHrKPMzPXR7UeXCZmcKeuhRj1mZo9rhuEwozM1w4U639bCkFIHrKPMzPXR7UeXCZmcKeuhRj1mZo9rhuEwozM1w4U639bCkFIHrKPMzPXR7UeXCZmcKeuhRj1mZo9rhuEwozM1w4U639b,IAIAIAIAIAIAIAIAIAIAIAIAIAIAIAIAIAIAIAIAIAIAIAIAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAQAYAYAYAYAYAYAYAYAYAYAYAYAYAYAYAYAYAYAYAYAYAYAYAYAgAgAgAgAgAgAgAgAgAgAgAgAgAgAgAgAgAgAgAgAgAgAgAgAn.n.n.n.n.n.n.n.n.n.n.n.n.n.n.n.n.n.n.n.n.n.n.n.v.v.v.v.v.v.v.v.v.v.v.v.v.v.v.v.v.v.v.v.v.v.v.v.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3………………………………………….,AAAAAAAAAAAAAAAAAAAAQAAAAAAAv.AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA..AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAQAgAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAQAAAAAAAv.v.AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA&chxt=x,y&chxl=0:

Sun

Mon

Tue

Wed

Thu

Fri

Sat

&chm=o,333333,1,1.0,25,0&chds=-1,24,-1,7,0,20)

Git Reroot - When Rebase is Too Gentle

2009-01-06T00:00:00+00:00

Git Reroot - When Rebase is Too Gentle

The fun thing about git is that it’ll do whatever you tell it.

Many newcomers look at is as this really complicated beast that is impossible to understand, but the less resistant users find that it’s very happy to just sit back and happily do whatever you ask of it (even if you ask it to do something stupid).

Recently, I was working on a project, and wanted to rebase a branch that had drifted quite a bit away from the master branch. rebase itself wasn’t getting me anywhere due to various conflicts from some partial merges and manual merges.

As an attempt towards a solution, I created git reroot.

What Does it Do?

git reroot is very similar to rebase conceptually, with one subtle detail – rebase works by rewinding to a merge point and replaying deltas (while dropping duplicates). reroot works by taking a range of commits and placing the commits at the end of the current HEAD by exact tree state.

The distinction is subtle, but important. git does not record changes, it snapshots tree states with some additional metadata. Commit deltas may be computed between any arbitrary trees, so the representations you often see are these deltas.

When Should I Use It?

Quite likely never. It was not appropriate for the project for which I created it.

However, if

you find yourself with a branch that has diverged too far,
you consider the result of this branch to be the desired state, and
it’s OK to think of the commits as snapshots of work instead of changes to previous state,

then you may find reroot helpful.

How Do I Use It?

The invocation recipes are different from that of rebase because it’s more of a “do what I want” kind of tool.

In a really simple case, let’s say you have a branch new-development that diverged from master a while back. Some work has been done on master, but you really just want new-development to be master. For whatever reason, you don’t want to do a merge to get it there, and rebase fails you due to conflicts you really don’t care about.

You would invoke reroot as follows:

git reroot master..new-development

You should see some output that’s showing you progress, and then a line that looks like this:

The newly created history is available as 2015200[...]

This command is completely non-destrutive, and will not affect any ref, so it’s safe to do whenever and wherever you like.

This output is telling you that the new tree is available, but not linked. You may use log (git log 2015200) to examine it, and when you’re ready to overwrite the current ref:

git reset --hard 2015200

If you look through the deltas (git log -p), you may see some changes that are much larger than you’d expect (especially towards the beginning, or any merge points), but at any given commit, the source tree is guaranteed to be in the exact state it was in when the author committed it.

Git Archaeology

2008-12-31T00:00:00+00:00

Git Archaeology

I just spent a while reconstructing the history of my code junk drawer. It’s on its fourth revision control system now (cvs → tla → mercurial → git) and has been through a lot of different tree states.

CVS really only versions files, but allows you to arrange things into a hierarchy, so I had a natural hierarchy and reflected it in a similar way in CVS.

Gnu arch favored smaller repositories, so when I did the conversion from CVS, I broke the snippets down into several different “branches” and versioned each language independently. I had one container branch that had a build config that would recreate the tree. This codebase lived through three different archives (repositories) and some of the individual snippets had a couple versions within that.

Once I started using mercurial more, I needed my snippets with me, but mercurial didn’t have a similar mechanism for managing a collection of repositories (even today, the forest extension is not distributed with mercurial). I had attempted to use darcs to reconstruct a single tree with full history but the trees renamed, but darcs wouldn’t ever complete with a subset of what needed to be converted. I ended up just snapshotting what was in gnu arch and dropping it into a single mercurial repository.

Having moved into git, I finally have the tools to actually put the history back together correctly. By “correctly”, I mean I wanted a single repository with all of the changes in it ordered chronologically (the order in which junk was placed in the drawer) without lots of weird merges that didn’t actually happen. I also needed to dig up all of the history prior to the snapshot I took for mercurial and get it all in place.

Bringing up Snippets

Just to add to the complexity story, keep the following in mind:

After tla, code was committed into mercurial from a snapshot.
That snapshot was (cleanly) converted to git, and more code was committed there.
One failed archaeological excursion had a few commits as well.

I started by going to the latest gnu arch versions of each snippet set and converting them to git repositories (by way of mercurial – but that’s a different story).

Setting up the Repo

I created a repo with a single empty commit in it as the eventual root of all of the other repos.

Once each repository was converted to individual git repositories, I brought added them as remotes to the conversion repository. Each branch needs to be considered related in order to facilitate the eventual merge, so I created grafts that placed the root of each branch atop my empty commit using the following script:

#!/bin/sh

empty=6c417dd379ccdb46de57e7a3860379633c270c9e

for b in "$@"
do
	oldest=`git rev-list --reverse $b | head -1`
	echo "Grafting $b"
	echo "$oldest $empty" >> .git/info/grafts
done

This was run for every remote repo and then each branch was run through git filter-branch to place the changes atop the empty branch in a real history.

Rewriting Tree Structures

These weren’t quite ready to merge just yet. Before I could even consider an actual merge, I needed to modify the tree structures (e.g. take all of the stuff at the toplevel of the eiffel directory and move it under an eiffel/ directory). The previous excursion had done this using a recipe I’d found on the internet somewhere which worked, but did the wrong thing with my version of sed. Using gsed cleaned this up.

For each remote branch, I’d run the following filter:

#!/bin/sh

git filter-branch -f --index-filter \
        'git ls-files -s | gsed "s-\t-&eiffel/-" |
                GIT_INDEX_FILE=$GIT_INDEX_FILE.new \
                        git update-index --index-info &&
         mv $GIT_INDEX_FILE.new $GIT_INDEX_FILE' $1/master

Note: Out of pure laziness, I would edit this script for every invocation and then run it for a single remote.

Doing the Merge

The merge was really rather exciting. The image to the left shows a 24-way octopus merge.

That is, after grafting the empty changeset to the bottom of every branch, they now had common ancestry, making a merge possible. Since each branch got its paths rewritten all throughout history, there was no chance of conflict.

So enter the octopus.

Performing a Linear Rewrite

As cool as it was to do a massive octopus merge, I wanted linear history.

It would be possible to produce a graft file to place each change atop a single parent, but that seemed quite hard.

The strategy I employed was to dump the entire history using git format-patch and then write a script to rename all of the patches to be in chronological order so I could use git am to reconstruct the tree.

So I created a new branch from “empty”, and ran git am for a while. A nice bonus is that git apply strips off trailing whitespace for me, so the changes were slightly cleaned on the way in (I could’ve disabled that, but I rather liked it).

Removing Emptiness

I no longer needed the “empty” changeset after git am was complete, so I had to get rid of that. The root node is generally a bit difficult to touch, but I sort of guessed that I could add a graft of a hash without a parent and it’d make that change the new root.

So another trip through git filter-branch and I’ve now got a pretty decent set of history up throgh the snapshot that was taken for the mercurial conversion.

Catching up to the Present

So now that I’ve got everything up to the snapshot, what do I do?

I had a lot of options here – cherry-picking, grafting, format-patch. I think I went with format-patch arbitrarily. Basically, I did a git format-patch of the full history from the latest git repo and applied those changes to the newly created one.

Verification

So now that everything has been all hacked up and history is rewritten and changests grafted, etc… how do I have any idea whether it’s even close to where it was before?

This is where git’s content tracking stuff really saves the day. With the git repo I’ve been using as a remote, I can do a simple diff across the trees from the latest branches (and various other states). The only differences I saw were some new scripts/etc… had been added.

All’s well. I certainly learned a lot.

Using Git Alternates

2008-12-30T00:00:00+00:00

Using Git Alternates

Now that you’re happily using github sync to pull down all your repos into local bare trees, you may want to free up a bit of disk space from duplicate objects (about 120MB for me).

git has a way for multiple repos to share object space by way of alternates. You can read more about alternates in the repository layout documentation, but essentially it’s a text file that contains the location of another objects directory from which objects may be fetched when needed.

Example:

Let’s say you’re me and have checked out my photo album. You’d end up with a .git directory that looks like this:

dhcp-39:/tmp/photo 599% du -sh .git
 18M	.git

By setting up an alternate using my git alternate command:

dhcp-39:/tmp/photo 600% git alternate ~/prog/github/photo.git
.git/objects -> /Users/dustin/prog/github/photo.git/objects

You can then gc and free up gangs of disk:

dhcp-39:/tmp/photo 601% git gc
Nothing new to pack.
Removing duplicate objects: 100% (256/256), done.
dhcp-39:/tmp/photo 602% du -sh .git
144K	.git

From 18MB to 144KB, and everything pretty much works as it did before.

You don’t need my git alternate command, for that, of course, but it makes it a bit easier when you’ve got a lot of them to do.

Using Github Sync to Track Your Projects

2008-12-29T00:00:00+00:00

Using Github Sync to Track Your Projects

When github announced their API, I very quickly threw together a python implementation.

I didn’t end up doing very much with the project as a whole, but I did write one tool in here that I end up using quite a bit: githubsync.py.

githubsync.py takes a github username and a directory and make sure I’ve got a local copy of every public repo that user has on github.

Grab the repo and try it out:

git clone git://github.com/dustin/py-github.git
cd py-github
./src/githubsync.py dustin /tmp/dustinatgithub

Once that finishes, you will have all of my current public repos in /tmp/dustinatgithub and if you run it periodically, you’ll see new repos I add appear while the existing ones are being updated.

But what about private repos, or even repos that aren’t on github?

The file ~/.github-private is read as a tab-delimited list of repos and their sources and those will also be synchronized. For example:

cool-stuff	  git@github.com:dustin/cool-stuff.git

With that in place, the cool-stuff repo will be created and synchronized along with all of the stuff found through the API.

Wasted Time Developing for iPhone

2008-12-26T00:00:00+00:00

OK, everybody’s written about this, but I just wasted a bunch of time making an iPhone app.

I don’t actually feel too bad about it because it was a pretty stupid iPhone app, anyway, but I’m not going to finish it because my first attempt to run it outside of the simulator was going to cost me a hundred bucks.

The application is an iPhone port of my twister app, but with worse graphics and sound (though the sound is at least potentially better). It’s functional enough to play a few games, but not fully polished.

I was hoping I could stick it on my daughter’s iPhone so she could play, but doesn’t seem to be the case.

If anyone wants to do something with it, it’s over on github.

Of course, I’d love to find out I was wrong and I can actually run my own program on my own phone without paying more…

pfetch

2008-12-26T00:00:00+00:00

About pfetch

For a long time now, I’ve had various cron jobs running to fetch various web resources with which I’d build out parts of my own site, or supply myself with custom RSS feeds after a pass through xsltproc.

This mostly worked OK, but there were a few things wrong with it:

I had to be careful to avoid putting stuff in place upon fetch failure.
Fetch failures would send me email unless I put effort into avoiding that.
Network timeouts would cause cron jobs to start piling up.
I’ve actually had cron get sick of running my jobs and just stop altogether.
Various jobs that ran at various frequencies would be in various scripts and hard to keep up with.
Running through cron means all jobs start at the exact same moment in time, thus are more likely to cause strain on web servers (if everybody does it).
Conditional gets require cross-invocation state to be stored (though I wrote a tool for this).
Sequential processing meant the whole thing took longer.

After a while, the problems added to enough of an annoyance that I decided to do something about it, so a couple months ago I started pfetch.

pfetch is a simple twisted app that does scheduled parallel http requests and optionally runs scripts after successful execution.

Given a list of URLs each with a destination, frequency, and optional (with arguments) to run after each successful (200) response, each URL will begin a fetch cycle starting at a random offset from the start time and loop on the defined interval.

Twister

2008-12-25T00:00:00+00:00

Twister

So, on Christmas, my kids decided they wanted to play twister. They wanted me to spin the thingy and call out moves for them. That got really boring after about five minutes.

I wrote a really simple python script to start calling the moves for me since the spinny thing was getting annoying, and would sometimes end up pointing between two colors or otherwise be too difficult to call.

The first version of the script looked like this:

#!/usr/bin/env python

import random

if __name__ == '__main__':
    colors=('red', 'green', 'yellow', 'blue')
    limbs=('left foot', 'right foot', 'left hand', 'right hand')

    print random.choice(limbs), random.choice(colors)

That was fine, but I still had to run it and call it out. Then I remembered that someone made a talking cat for OS X. All I needed to do was run the output of this thing through that, and there’d be speech and then I could go about my business and let the computer call moves for them.

I thought that was kind of cool, but wanted something a little…more. I ended up writing a full OS X desktop version complete with images, icons, a preference pane, etc…

The kids finished playing (using the prototype) long before I finished writing the app. It was fun for all of us, though. :)

If anyone wants to play a two-player version of twister, though, you can grab a copy.

Download

Version 1.1

Moody Bots

2008-12-24T00:00:00+00:00

Moody Bots

Twitterspy is a rather brute-force way to achieve xmpp functionality for twitter. It makes very heavy use of twitter search to provide track-like functionality to end users.

I’ve noticed in watching the logs that I often will get more errors in attempting searches than successes. This was at least the feel I got from looking at the logs. I wanted a way to communicate this to look at this information via xmpp.

Initially, it seemed like status would be a good way to do this. Currently, the status is used to show stats on how many users and queries the bot knows about. This is already a little weird, and I wouldn’t want to try to shove too much stuff into it.

Next, I thought about using the vcard for it. The bio is a fine way to describe such things. That wasn’t quite right, either. The bio is a better general description of the bot, and not so much status.

Then I discovered XEP-0107 – user moods. User moods in a PEP transport provides exactly the kind of thing I’m looking for.

Twitterspy keeps track of how many of its searches are successful, and how many fail. When many searches are successful, it’s in a good mood, when few are, it’s in a bad mood.

I had my kid look through the XEP to come up with some rules for how to select a mood based on how successful recent searches are. I’ve applied many of her changes, but some still require me to keep a bit more state than I do currently. It’s kind of an exciting thing, though few people will ever actually see it.

The pubsub mechanism will hopefully show itself to be useful, though. I’m hoping to do something cool like have a web status showing moods and all. Ralph Meijer’s moods page is quite inspirational here – as long as I’m capturing the data.

For the rest of you out there: Bring your XMPP services to life. Show their moods.

Building Your Site Connectivity

2008-12-24T00:00:00+00:00

Building Your Site Connectivity

When I started building this jekyll site, I thought it’d be nice to link to all the other places I leave junk around the internet. Rather than manually building a list, I took a bit of time to write something to do it for me using the google social graph API.

I made a simple web form to do this that generates HTML source so it can be further hand-edited if needed, but more importantly, so that I can actually paste the results into my github page and have it actually count for site connectivity.

If you don’t maintain a list of links on your own page, you may find it helpful to link to your friendfeed account.

For example, you can see how friendfeed links me. Change the username from dlsspy to yours for results that make more sense to you.

As it’s just a simple chunk of HTML, I’ve created a gist to house it for now. If you’d like to change this for the better, do it there and let me know about it.

I hope someone (else) finds this useful.

Trying out Jekyll

2008-12-23T00:00:00+00:00

Trying out Jekyll

Since Jekyll seems to be all the craze, I figured I’d give it a shot and see if it solved any problems for me.

So far, I like it.