Dealing with the impermanence of public data sets

Image for post
Image for post
Credit: Ula Kuźma

One worry that I always have when downloading data sets off the internet is their impermanence. Links die, data changes, ashes to ashes, dust to dust.

That’s why I’ve been introducing the Wayback Machine into my workflow. But even then, it’s tough to be consistent with whether I’m downloading data off an archived website or a live website and it’s tough to understand what I did in the past.

What I’ve done through my work with the Survey of Consumer Finances (SCF) is implement a system of simultaneously archiving and logging the data that I use. Below is a summary of what I’ve done but if you just want to see the code, scroll to the bottom of this post for a gist of the functions I’ve implemented with respect to the SCF. …


How to deal with the SCF’s multiple datasets

Image for post
Image for post
Credit: Michael Longmire

Last month the Federal Reserve released their triennial survey on the state of households finances in the U.S. in 2019: the Survey of Consumer Finances (SCF). Although they provided a good summary of what’s changed since 2016, what is on everyone’s mind now is COVID-19’s impact on the economy, and all that got in the Fed’s summary was a footnote.

Despite that, the SCF can provide a good baseline of the before-times for future analysis of COVID’s impact on our financial wellbeing. Over 90% of the surveys conducted for the 2019 SCF happened before February of this year. …


Image for post
Image for post
Photo by Chris Ried on Unsplash

I spent the better part of the long weekend trawling through all of the publicly available data on consumer finance and the labor force. It’s part of a long-term project but I had to share now some of the difficulties I encountered that were not solvable through Stack Overflow or other resources.

The lesson is clear: working with CPS data in Python is hard. I leaned on a few resources such as Brian Dew’s econ blog and Tom Augspurger’s tools which helped immensely given the lack of an API. But since I was primarily looking to just import as much data as possible and read it into a Pandas data frame I had to build upon their techniques to make something that fit my purposes. …


Image for post
Image for post
Credit: Nik Shuliahin

I’ve always been fascinated with Hal Abelson’s introductory lecture to his course on structure and interpretation of computer programs.

Typically what you hear about computer science is that it’s about the study of . . . computers. But Abelson points out how weird that sounds. It would be as if biologists should say their studies are primarily about microscopes when, in fact, the core of biology sounds more grandiose: it’s about LIFE.

Abelson similarly puts computer science in more compelling terms. He posits that in the future people will look back at the people of the late 20th and early 21st centuries as amateurs thinking that they were primarily playing with gadgets when really they were beginning to formalize a language to talk about processes and “how-to” knowledge. …

About

Dan Valenzuela

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store