Skip to main content

Posts

Featured Post

On using less popular R packages and data validation

Note: This post is geared towards data practitioners. If you don't fall into this group, feel free to ignore.  I have a saying about R ( probably in Python, I know its in R ) -- There's a package for everything. The beauty of open source languages is that anyone can write their own packages [libraries] and publish them on Github for anyone to download. The vast majority of useful R packages are collected and hosted in the CRAN package repository , which currently features over 18,000 packages.  When I say that there's a package for everything, I really mean it. There are the "standard" packages to augment base R, like the Tidyverse to help clean and structure data. But there are also more niche packages that can help you do useful things like import SAS files . There are also totally random packages, like the awesome Brooke Watson's package that can make your computer output Rapper Adlibs  when your script is finished running (I think my favorite is Waka'
Recent posts

Coronavirus Strikes Back (ft. Delta)

It seems like it's been an eternity since the mad rush to get vaccinated in the United States. Some of us took desperate measures to skip ahead in line to get the first dose. Initially, the results were astounding -- in just approximately four months, 100 million Americans (~1/3rd of the population) had received their first dose. There was an unmistakable effect on the spread of COVID-19. Daily new cases dropped from a peak of roughly 250,000 in January to just 10,000 a day in July.  Then, Something Happened What happened can best be illustrated using a chart (duh, this is Stats with Sasa).  Just after it seemed we had gotten a handle on Coronavirus, it came back with a  vengeance. Even though more Americans are fully vaccinated than ever, COVID is currently spreading twice as fast as last year's Second Wave. Although there were a few potential explanations, a clear one emerged: the virus had mutated.  Enter the Delta  Variant Originating in India, the Delta strain of the Coro

Determining NFL Quarterback Archetypes (with stats!)

We're obsessed with grouping things together. We self-select each other into groups based on which political candidate we support, which sports team we root for, and which arbitrary country we're born in. People also spend hours on the internet arguing over "tiers", or groupings, of their favorite athletes and sports teams. For example, which NBA players are "elite" vs. "great" vs. just "good"? Did Carmelo Anthony belong  on the Banana Boat ? When engaging in these arguments, we typically use statistics like points or rebounds per game to back up our points, but at the end of the day, the groups are more or less kind of arbitrary.  But what if there was a way to algorithmically sort observations into groups based on shared characteristics using machine learning methods? Enter clustering , which is the methodology of grouping similar observations into groups, or "clusters", using a mathematical distance metric derived from a set

The Minimum Wage, the Living Wage, and the Wardrobe

The Senate is currently in intense debate regarding raising the federal minimum wage. Several potential wages have been proposed, including a $10/hour plan from Senators Romney and Cotton  and a more generous $15/hour plan from the progressive Democrats. Right now the current federal minimum wage stands at $7.25 per hour, which 21 states (including my notably Blue home state of Virginia) adhere to. While the debate rages on, I wanted to take a closer look at the history of the minimum wage, the concept of a "living wage", and how these two terms invariably tie together across the United States.  More importantly, at some point, there are diminishing returns and increasing costs to increasing the minimum wage. So where should we settle? The History of the Minimum Wage This isn't a history blog, so I'll be brief. The minimum wage was established under the Fair Labor Standards Act in 1938 and set at $0.25/hour, which is worth around $4.60/hour today. Since then, it has

Why isn't Robinhood letting me trade? (hint: there's probably not a conspiracy against you)

Today's been a big day in the stock market . Lots of people have lost a lot of money, and a lot of people are understandably really upset . Here's a quick breakdown of what's happened so far A subreddit called /r/wallstreetbets  (visit at your own peril), which has exploded in popularity recently and has over 5 million subscribers (and counting) got really excited about three stocks: GME (Gamestop), AMC (the movie theater place), and BB (Blackberry). Gamestop was the main stock.  Yes, I know all three companies are doing terribly in the real world. I won't go into why they got excited about the stocks here.  They convinced a lot of other people to buy the stocks and they did well. Really well. Take a look at their Yahoo Finance pages and look at their 1 month price charts (then ignore the past two days). GME , BB , AMC Everyone got in on it, and I mean it. When a lot of people buy a single stock, the price rises. It turns out, this was hurting a lot of Hedge Funds and I

On Post Frequency

 As you might have noticed, I post rather erratically and infrequently. There's a lot of reasons why, but the main reason why is that writing is hard. Behind the scenes, I often spend 40+ hours on some of my blogposts between idea generation, data cleaning, learning about the topic, analysis, and writing. Even worse, frequently I spend a lot of time on an idea and find that my results aren't worth writing about (uninteresting results, messy data etc.). I've currently got a "Blog" folder on my computer with dozens of folders with ideas with input data and code analyzing said data, but only around 10 published blogposts to show for it.  This is a hobby that I do for fun, but I would still like to see the blog grow (and it kind of has), and I recognize that's probably not going to happen without some consistency in posts. Some of the more successful bloggers I know  blog far more frequently (I'm talking weekly), but that isn't exactly a pace I can sustain

Analyzing Hip Hop - Who's Most Lyrical, What Determines Popularity, and More

Have you ever thought about bringing cold, hard statistics to one of life's greatest artistic joys? Well fear not, because in our increasingly data-driven world, our analyst friends are hard at work attempting to statistisize (numerize?) everything you can think of, so we can analyze and therefore optimize it. One of the art realms that is increasingly falling under the purview of data science is music. We all benefit from it in the form of curated daily Spotify playlists and Pandora stations that allow us to find new artists and songs.  I was recently able to get my hands on a Spotify dataset  that contains data on over 160k tracks dating from 1921 through December 2020. Aside from containing some basic features like track name, duration, and release date, it also contains some advanced metrics as calculated by Spotify like "track positivity" (is it a sad, depressed song, or a happy, positive song?), "danceability", "energy", "speechiness" (