a series of short posts

Reflecting on the posts that I’ve written for my website, I’m concerned that my writing feels a bit fake. I doubt it feels fake to other people, as I’ve written my honest opinions and my style is hardly constrained by traditional conventions and norms. I am worried because my writing feels fake to me. It doesn’t reflect my own way of thinking; it’s far too tidy, too orderly, too put together to truly express my thoughts.

Of course, it would be terrible if writing actually revealed a person’s unfiltered ideas. The same sentences or meaningless words would be repeated over and over as the author relives inconsequential moments that occurred earlier in the day. No one really has control over their own thoughts, and so we’d be forced to peer into the hell hole that is the human mind. It would be deeply unpleasant and pointless.

However, I do want to be true to myself, and something about the orderly nature of my posts is starting to annoy me. Inside my head I feel chaos; I don’t understand my own interests, and my ideas feel disjointed and clumsy. My mind is full of competing, seemingly disparate topics fighting for attention, and I want to find a way to understand that. This post is for me, and in some ways is about me as well.

To slightly cut down on the weirdness, I will at least tell you what each part is about. Part One is about Facebook, Cambridge Analytica, data sharing, Google, and the GDPR. Part Two is very short, and touches on the concept of design. Part Three is about data and programming, the value of old data, and real-time processing. Part Four is about the media industry generally.


The Facebook / Cambridge Analytica story is an odd one. There are so many pieces to it that I feel people aren’t getting. This data isn’t your data, it’s Facebook’s data about you. Sure there is a consent issue here: you agreed to give it to Facebook, albeit probably unknowingly, not to Cambridge Analytica. But is that really what gets people so worked up? Are we really that mad that it was Cambridge Analytica, not Facebook, doing the “microtargeting”? Plus, all the media organizations making a big deal out of this are being a bit hypocritical. After all, open dev tools and you’ll quickly see the ad tech hell hole they operate in. Your data is being smeared across the internet.

On the other hand, this is all very timely. The GDPR is two months away and no one is ready. That is not especially surprising, given that media companies have never really had to be involved in big regulatory compliance efforts before. What is surprising is how unprepared Google seems to be. Just a few weeks ago I wrote a post arguing that Google Analytics collects personal data, and that Google Analytics does not offer the necessary tools to handle that. Then last week Google sent out an email basically agreeing, and promising to make those tools before the GDPR comes into force. It surprises me that it took Google this long.

But what is even more surprising is Google’s new position on ads. Google makes money by tracking people everywhere they go on the internet and using that data to connect buyers to target audiences. The data from one website alone just isn’t good enough; to really understand a person, companies need to track that person’s activity across sites. Google does that. But now, Google realizes the GDPR poses a problem. This data is undeniably personal, and therefore Google needs consent (or legitimate interest, but that just isn’t going to work here). So now Google is telling companies that it needs to collect and pass on user consent in order to serve targeted ads.

Will that work? I have my doubts. In order for consent to be valid it must be given for a specific purpose, and it must be presented “in an intelligible and easily accessible form, using clear and plain language”. In addition, “the data subject should be aware at least of the identity of the controller and the purposes of the processing for which the personal data are intended”. How on earth do you explain Google’s ad-tech ecosystem in a simply way that communicates the necessary information? And what is the specific purpose? Google is going to use that data to target ads everywhere you go, on totally unrelated sites. That’s pretty non-specific.

We’re still waiting on the details of how Google plans on collecting this consent. Like I said though, I am not sure it is possible at all. And even if it is possible, will people actually consent? If a large number of people do consent, that almost feels like evidence that the consent is not clear enough. Once you know how Google is using the data across sites, I can’t see why people would want to agree to that.

It is entirely possible to approach design from a rigorous, scientific perspective. Maybe web design is really a numbers game where the goal is to maximize time spent on a page, or number of purchases, or some other metric. Design can be quantified, and it’s easy to see how my real job as a data engineer could overlap with my design work.

But I didn’t approach design from that perspective. I first started thinking about design by talking about what I didn’t like. The words I used to describe my preference were hardly scientific. Time-based animations are fascist because they strip away personal control, treating website visitors as hostages. Snap scrolling is theft, as it takes control over what is rightfully mine, my browser’s scroll functionality, and uses it against me. These are not the measured, well-reasoned professional opinions of a data scientist.

The design of this piece is clearly excessive. It is probably fairly annoying, for which I am sorry. However, I think the various techniques on display here are interesting, and that is really what I wanted to highlight. Used with more moderation, I think the style in this piece could be nice.


Old, unused data is incredibly overvalued. Companies collect all sorts of data they never use. The data sits idly in a data warehouse somewhere, waiting for an enthusiastic data scientist to come along and turn it into gold. Except that day never comes, and it just keeps on sitting. At my current job, I pretty much have free rein when it comes to using our data. I can access whatever I want, collect whatever I want, and do whatever I want. It’s pretty cool. And you know what I don’t do? Play with old data.

Old data sucks. One, it’s never in the format that I need; after all, it was collected without any specific purpose in mind. If I want to do something with the data, I first have to set up a connection to the database, and then I have to start pulling data into memory so that I can process it. Two, old data doesn’t tell the whole story. I can use the data for analysis purposes, but I then have to walk around the building and ask people to explain my findings. I don’t know what happened in April, 2016! Interpreting results can be very difficult.

That’s why I’ve started taking the Bill O’Reilly approach to data science, also known as the “Fuck it, we’ll do it live” approach. Instead of collecting a lot of data and dumping it into some non-relational storage system to gather dust, I have started writing scripts to process data on the fly. Sure, by not saving the raw data I kind of limit our options in the future. However, no one was going to do anything with it in the future anyway, so we might as well make it useful now.

To follow the Bill O’Reilly approach, I use PubSub and Redis. Data flows into Pubsub and is sent out to subscribers. A subscriber takes that data, calculates the metrics I am interested in, and stores those metrics in Redis. If the data processing can’t be done in real time, I store the granular data in a Redis hash, and have another app running in the background calculating the metrics from those hashes. It’s all very easy once you’ve done it a few times, and it makes life better for everyone. You can then create a simple API to return the data, or just let other developers access the Redis server directly.

The other cool thing about Redis is that you can set expiration times, so the data deletes itself. In other words, no old data.


Every time a big investigative story breaks, we hear journalists confidently asserting, “This is why we need good, quality journalism now more than ever.” This is usually meant to imply that people should pay for journalism. What’s awkward is that journalists make the same argument every time journalism fails. If a story gets something clearly wrong, journalists again assert the apparent need for quality journalism. “This is why we need good, quality journalism. You need to pay us so we have the resources to do our job properly.”

At the moment this seems to be working. Democracy Dies in Darkness, after all, so you need to hand over your money to the Washington Post. Nevermind that the Washington Post prohibits journalists from criticizing advertisers. I guess they wouldn’t have to be corporate shills if you paid them more. Still, “pay us or we’ll shill for our advertisers” is a weird sales pitch.

Journalists also tend to use the wrong stories to hold up as examples of “good, quality journalism”. What if we didn’t have strong media companies? According to some journalists we would’ve never known about the Facebook / Cambridge Analytica story! Except that is clearly not true. The “whistleblower” dude could have blown the whistle on himself on his own blog, or gone to an MP, or gone to any terrible blogger in the world. There are some legal benefits of going to a media company, but media companies are going to continue to exist; they don’t have to be big and powerful and employ hundreds of people in order to reveal these stores. These types of stories aren’t at risk of going away. All it really takes is one person with a computer.

What might go away are the stories that don’t need to be told, but nonetheless provide value mostly in the form of entertainment. For example, Gerry and Marge Go Large, a story about a small town family that exploited poorly designed lotteries to make tons of money. Or The Wetsuitman, a story about identifying a dead body that washed up on the shores of Norway. Or The Strange and Curious Tale of the Last True Hermit, which tells the story of the North Pond Hermit, a guy who lived undetected in the wilderness of Maine for decades.

I would be sad if those stories didn’t exist. They added something to my life. However, they aren’t of national importance, which in some ways is what makes them so valuable. They aren’t guaranteed to continue to exist; no one is going to feel compelled to write those stories for free. No one really stands to benefit from telling the story of a small town couple that found exploits in the lottery. I can at least imagine a world in which this type of journalism is less common, and that makes me sad.

I can’t imagine that world for news, opinion pieces, or “analysis articles” covering current events. Those stories aren’t valuable; they are everywhere, and people are happy to provide that content for free. People share news events live from their phones. Everyone has an opinion, which explains why blogs exist. You don’t need to pay for these things. They’re not going away.