Maya's Project Blog Talk Data to Me:
    About     Archive     Feed

Building a US College Application Tool

Project Design

Goals

Having just graduated from college and with grad school somewhere on the horizon, I’ve been thinking a lot about how students apply for schools. When I was thinking about this project, I knew that I wanted to do something related to education. The number of students attending college numbers in the millions and is generally trending upwards. Yet there hasn’t been any change in the names of schools that come up in the media, from teachers, and from parents. I personally think that the quality of education depends on both the students and the school, and as such there are many underrated schools out there.

Plan

Given that a student considering colleges probably knows at least one school they want to go to and their chances at that schools, I wanted to suggest to them other schools that were similar but they may not have considered. I planned to do this without including rankings, since they are susceptible to pre-made assumptions. I was also curious to see what schools clustered together to see if there were any better classification systems out there.

Process

Data

Gathering

I gathered the data from four sources.

  1. College Scorecard: This was my main dataset. It’s released every year through the US Department of Education, and is available via csv downloads and and API. I took the CSV download of the information from the 2015-2016 school year, as this was the most year available in this format.
    • This dataset had 1500+ columns about 7000+ schools across America.
  2. US News: Based on my personal experiences, US News rankings are the most commonly used by applicants. This is where I scraped liberal arts college rankings for my initial visualizations, along with tuition and fees, enrollment, location, and whether the school was public or private.
    • I used some of this in my cleaning
    • There were ~200 ranked schools
    • Rankings change only marginally on a year to year basis (see: rankings based on previous assumptions), so there wasn’t a major problem with matching the data.
  3. Wikipedia: I scraped NCAA divisions from their respective Wikipedia pages. I also collected some more location and school type (public/private) data to use while cleaning.
    • ~1000 Schools
  4. Data.world: There was a premade csv here of US News University Rankings, including the rank and other info that was again useful for cleanings.

Cleaning

There was a significant amount of cleaning involved with the data. First, I had to get the scraped datasets to look clean, and then I had to impute data. After cleaning the format of individual datasets, I joined the tables in SQL. I focused only on imputing data for columns that students would be interested In, such as the percentage of white students (diversity), gender, and SAT scores. This was because of the sheer number of nulls and columns in the data. I used the mode/means of given columns based on other ones and filled in with other columns where applicable. I also dropped schools that didn’t offer accredited 4 year bachelors degrees because those weren’t relevant to my target audience.

EDA

There were enough columns that were highly correlated that I felt comfortable being liberal with dropping features. Here is a pairplot in all its glory for your viewing pleasure.

Algorithms

Content Recommender

I build a cosine similarity based content recommender which gave me results that didn’t make sense with what I knew about the colleges I tested it on.

Clustering I

This was the major part of my project. I had a class built, which was a pipeline including multiple clustering algorithms and tsnes. It also enabled me to test different PCA components. In terms of clustering, first modified my data so that it was all in int/float format. The remainder of the process was as follows:

  1. Dimensionality Reduction: Tested different number of components in
  2. Perplexity: Tested different perplexities within TSNE’s for visualization
  3. Clustering: Attempted the following, modifying parameters on each in order to id the best result (except spectral, due to time constraints): 1. KMeans, 2. DBSCAN, 3. Mean Shift, 4. Spectral Shift
  4. Choosing and Analyzing Clusters: Chose the best cluster based on TSNE’s, then looked at the closest and furthest points wihin each cluster to discern meaning.
  5. Making Recommendations: Created a cosine similarity based function to pull the 3 closest schools to an inputted school within each cluster.

TSNE on Reduced Data

This was a simple visualization of the data with a significantly reduced set of features for each school, for the purposes of future work.

Results

I went with KMeans, with 10 clusters and 100 Components in my PCA, explaining 80% of the variance, and used a perplexity of 30 for my TSNE.

Flask App

The last thing I did was build a flask app whose format was modelled off of the typical application format, wherein students choose 4 schools in the following three categories: reach, target, safety.

Going Forward

Looking Back

I was very pleased with how this project turned out, although I was concerned about the amount of time I’d spent cleaning data. I would have gone ahead and done what’s outlined before had I not run into this data roadblock. I would prepare for this ahead of time but spending more time thinking about where I could get the data before starting.

More To Do

As stated in my presentation, I’d like to simplify my model to accommodate students who have no idea what schools they are interested in, i.e. for those who are just starting out with applying (high school juniors). I’d also want to drop acceptance rate, which I’d kept to retain the “reach, target, safety” schema but can be a proxy for rank.

Data: 4 Sources

Tools: SQL, Pandas, Flask, Tableau

Algorithms: Clustering, Recommender Systems

Detecting Categories of Feedback in Foundations

Table of Contents

  1. Scrape: scraping_sephora.ipynb
  2. Topic Modeling Overall: topic_modeling.ipynb
  3. Pipeline Class for Models: clustering_algorithms.py
  4. TSNE Dimensions and N Clusters: checking_dimensions.ipynb
  5. Running Models: f’{model_name}.ipynb’
  6. For Negative Reviews: Same names as for overall + ‘_ negative’
  7. Slides: fletcher_slides.pdf

Overview

Throughout this project, I wanted to maintain the goal of accomplishing something that would be useful to either Fenty Beauty or Sephora as a business. The brunt of what I did for this project therefore did not make it into my final presentation because it was unsuccessful in that respect.

The general process I had was more or less:

  1. Scraping
  2. Overall Reviews:
    • Topic Model
    • Classify
    • Sentiment Analysis
  3. Over Negative Reviews:
    • Topic Model
    • Classify

Scraping

I did an involved scrape of Sephora, a company which for some reason didn’t built a site that would be friendly to data mongers. A primarily brick and mortar beauty company doesn’t care about techies trying to creep on them? Who would’ve thought.

The site was rendered using javascript and you could only click through 6 reviews at a time, which all accumulated as you added more. Eventually, the page would time out because it couldn’t handle all the reviews on the screen. Most of the class names referenced css and were something like css-123asjkdf; because they couldn’t have called a class ‘review’ apparently.

Technically the reviews were auto sorted by most helpful, which is a poor idea because then people will only be seeing the same reviews (hence my helpfulness predictor), although there is an option to organize the page by the most recent reviews. Weirdly, though it said it was sorting by the most helpful reviews it wasn’t actually organized in that order.

I tried a 9 hour, a 6 hour, and a full day scrape, reminding myself that this is not a primarily tech company throughout. The most I got was 2,657 reviews. Some people had posted reviews multiple times, and some hadn’t written anything, so I dropped those. Thankfully there were only a few of these.

Topic Modeling on Overall and Negative Reviews

I ended up using count vectorizer for my topic modeling.

I tried all three topic modeling methods we talked about in class, LSA, LDA, and NMF.

This was iterative, so I went back and played with parameters as needed. For example, I removed stop words as I saw them, changed the max_df a few times, etc…

In the end, I had similar results for both overall and negative reviews, finding the most coherent topics with LDA using Truncated SVD and a max_df of 0.6. This was what I used as needed for the remainder of the project,

Classification on overall reviews

I built a pseudo-package with this, which was essentially a pipeline for trying different classification methods and easily fiddling with the parameters.

Within the pipeline there was a class, which had functions for each of four models: KMeans, DBSCAN, Spectral Shift, and Meanshift. There was also a function to print out a TSNE and the number of items in a given cluster.

Outside of the class, I had two functions to print out the 4 closest reviews to the centroids for meanshift and kmeans using cosine similarity as the metric.

I used the pipeline-package to decide on the “best” clustering technique.

First, I checked for dimensions by running TSNE plots. The best one for all the reviews was 50.

Then I checked the SSE/Silhouette score plots to find the best number of clusters and printed the results.

Classification on negative reviews

I used the clustering method that was found for the overall reviews, i.e. KMeans on the topic space in order to maintain consistency. At this point I still thought both of these would make it to my final presentation.

The rest was essentially a repeat of the classification on overall reviews; i.e. changing dimensions, checking the SSE/silhouette score for the number of clusters, printing out the silhouette score per cluster, trying a few different cluster numbers, printing the four closest reviews, and skimming them for some relationship.

Star Rating

This was a classification problem whose purpose was to ensure that reviewers were leaving the correct stars.

I vectorized the data, split it into a train and test set and oversampled, then ran 6 models and compared the accuracy scores of each. I also checked the confusion matrices for the models. The best one, by accuracy score, was Naive Bayes Bernoulli, with a score of ~0.6.

Helpfulness

The general format of this was the same as that for star rating. This was not a classification problem though because the outcome was continuous (number of people who found the review helpful), so the models I ran were different. I also didn’t need to oversample. The comparison metric in this case was MSE, and the best performing model was Ridge, though LassoCV was a close second (it just took an impractical amount of time to run, and so was not ideal for the purposes of making predictions).

After finding the model, I made a prediction function that would output how many people found a given review helpful.

Sephora could use this model to promote reviews that were predicted as more helpful. This would both improve user experience and aid in marketing insights.

Final Thoughts

Looking back, there were a couple things that I would do differently. I’m definitely happy that I took on something that I wasn’t sure would have an answer (i.e. a good cluster), but I would have wanted to prepare for that earlier on.

Stepping back a bit and thinking about the role of these projects as portfolios for businesses, I wish I’d used MongoDB.