Lightning

2023-04-27

Lightning is a mapping/data vis project for finding EV charging stations. It uses Martin to serve tiles generated from OpenStreetMap data to a MapLibre frontend. Additional layers are added on top via Deck.gl using data from EVChargerFinder made by my friend Kevin.

Cartman is public!

2022-10-20

Cartman is trained by combining Microsoft's DialoGPT-medium NLP model (GPT2 model trained on 147M samples of multi-turn dialogue from Reddit) with 17 seasons of South Park transcripts.

Requests are routed from Nginx through WireGuard to a Raspberry Pi 4B 8GB running FastAPI, and the Cartman model using PyTorch. It has enough RAM for more, but the CPU is pretty much at its limit. Expect it to take a few seconds, I'm cheap. Sorry(kinda).

You can download a Docker image if you'd like to run it on your own hardware for either x86_64 or aarch64.

More info here as well as example scripts to talk to the docker container.

fps

2022-10-09

fps is a Godot/WebGL experiment from scratch with multiplayer using websockets and a master/slave architecture. Invite a friend or open multiple instances!

balls

2022-09-13

balls is another demo to test WebGL performance. This time using Godot Engine.

adam

2022-09-11

adam is a quick fps demo to test how well WebGL performs using Unity.

What goes into a successful Reddit post?

2022-06-16

In an attempt to find out what about a Reddit post makes it successful I will use some classification models to try to determine which features have the highest influence on making a correct prediction. In particular I use Random Forest and KNNeighbors classifiers. Then I'll score the results and see what the highest predictors are.

To find what goes into making a successful Reddit post we'll have to do a few things, first of which is collecting data:

Introducing Scrapey!

Scrapey is my scraper script that takes a snapshot of Reddit/r/all hot and saves the data to a .csv file including a calculated age for each post about every 12 minutes. Run time is about 2 minutes per iteration and each time adds about 100 unique posts to the list while updating any post it's already seen.

I run this in the background in a terminal and it updates my data set every ~12 minutes. I have records of all posts within about 12 minutes of them disappearing from /r/all.

EDA

Next I take a quick look to see what looks useful, what doesn't, and check for outliers that will throw off the model. There were a few outliers to drop from the num_comments column.

Chosen Features:

Title
Subreddit
Over_18
Is_Original_Content
Is_Self
Spoiler
Locked
Stickied
Num_Comments (Target)

Then I split the data I'm going to use into two dataframes (numeric and non) to prepare for further processing.

Clean

Cleaning the data further consists of:

Scaling numeric features between 0-1
Converting '_' and '-' to whitespace
Removing any non a-z or A-Z or whitespace
Stripping any leftover whitespace
Deleting any titles that were reduced to empty strings

Model

If the number of comments of a post is greater than the median total number of comments then it's assigned a 1, otherwise a 0. This is the target column. I then try some lemmatizing, it doesn't seem to add much. After that I create and join some dummies, then split and feed the new dataframe into Random Forest and NNeighbors classifiers. Both actually scored the same with cross validation so I mainly used the forest.

Notebook Here

Conclusion

Some Predictors from Top 25:

Is_Self
Subreddit_Memes
OC
Over_18
Subreddit_Shitposting
Is_Original_Content
Subreddit_Superstonk

Popular words: 'like', 'just', 'time', 'new', 'oc', 'good', 'got', 'day', 'today', 'im', 'dont', and 'love'.

People on Reddit (at least in the past few days) like their memes, porn, and talking about their day. And it's preferred if the content is original and self posted. So yes, post your memes to memes and shitposting, tag them NSFW, use some words from the list, and rake in all that sweet karma!

But it's not that simple, this is a fairly simple model, with simple data. To go beyond this I think the comments would have to be analyzed. Lemmatisation I thought would be the most influential piece, and I still think that thinking is correct. But in this case it doesn't apply because there is no real meaning to be had from reddit post titles, at least to a computer. (or I did something wrong)

There's a lot more seen by a human than just the text in the title, there's often an image attached, most posts reference a recent/current event, they could be an inside joke of sorts. For some posts there could be emojis in the title, and depending on their combination they can take on a meaning completely different from their individual meanings. The next step from here I believe is to analyze the comments section of these posts because in this moment I think that's the easiest way to truly describe the meaning of a post to a computer. With what was gathered here I'm only to get 10% above baseline and I think that's all there is to be had here, I mean we can tweak for a few percent probably but I don't think there's much left on the table.

Predicting Housing Prices

2022-05-29

A recent project I had for class was to use scikit-learn to create a regression model that will predict the price of a house based on some features of that house.

How?

1 Pick out and analyze certain features from the dataset. Used here is the Ames Iowa Housing Data set. 1 Do some signal processing to provide a clearer input down the line, improving accuracy 1 Make predictions on sale price 1 Compare the predicted prices to recorded actual sale prices and score the results

What's important?

Well, I don't know much about appraising houses. But I have heard the term "price per square foot" so we'll start with that:

There is a feature for 'Above Grade Living Area' meaning floor area that's not basement. It looks linear, there were a couple outliers to take care of but this should be a good signal.

Next I calculated the age of every house at time of sale and plotted it:

Exactly what I'd expect to see. Price drops as age goes up, a few outliers. We'll include that in the model.

Next I chose the area of the lot:

Lot area positively affects sale price because land has value. Most of the houses here have similarly sized lots.

Pre-Processing

Here is an example where using StandardScaler() just doesn't cut it. The values are all scaled in a way where they can be compared to one-another, but outliers have a huge effect on the clarity of the signal as a whole.

You should clearly see in the second figure that an old shed represented in the top left corner will sell for far less than a brand new mansion represented in the bottom right corner. This is the result of using the QuantileTransformer() for scaling.

The Model

A simple LinearRegression() should do just fine, with QuantileTransformer() scaling of course.

Predictions were within about $35-$40k on average.

It's a little fuzzy in the higher end of prices, I believe due to the small sample size. There are a few outliers that can probably be reduced with some deeper cleaning however I was worried about going too far and creating a different story. An "ideal" model in this case would look like a straight line.

Conclusion

This model was designed with a focus on quality and consistency. With some refinement, the margin of error should be able to be reduced to a reasonable number and then reliable, accurate predictions can be made for any application where there is a need to assess the value of a property.

I think a large limiting factor here is the size of the dataset compared to the quality of the features provided. There are more features from this dataset that can be included but I think the largest gains will be had from simply feeding in more data. As you stray from the "low hanging fruit" features, the quality of your model overall starts to go down.

Here's an interesting case, Overall Condition of Property:

You would expect sale price to increase with quality, no? Yet it goes down.. Why?

I believe it's because a lot of sellers want to say that their house is of highest quality, no matter the condition. It seems that most normal people (who aren't liars) dont't care to rate their property and just say it's average. Both of these combined actually create a negative trend for quality which definitely won't help predictions!

I would like to expand this in the future, maybe scraping websites like Zillow to gather more data.

We'll see.

snek

2022-05-20

snek is a simple snake game made with JS/Canvas.