I live in a suburban area in New York. I drive of course but most of the time I walk around the streets to my local pharmacy to pick up some essentials. As I do my afternoon cardio, I noticed around me empty buildings with signs that say “Rent” and a phone number displayed underneath. Next to the pharmacy, I take a small peak from the window, just to get a good look inside. I can’t help but think “You can definitely open an animation studio in there.”
As promised in our series, “A Data Science Story” this blog will provide insights into data from surrounding villages in Nassau County as it relates to Education and the Digital Divide. The data set was downloaded from the US Census for the following Villages: Rockville Centre, Freeport Village, Garden City, The Village of Hempstead, and Lynbrook.
Step 1: In Data Science terms, we “wrangled” the data. That means, remove blanks, and organize in a structure that we can use. We used Excel, and a nice trick to “transpose” the rows and columns, then saved it as a comma separated values (CSV).
Step 2: Next we need to explore the data. So we will use Google Colab. It is an excellent tool for data exploration and analysis, again in Data Science terms, this is EDA or Exploratory Data Analysis.
During our EDA, we noticed that “computer and internet access” were reported in percentages. And while the percentages looked good all around, we wondered what the impact would look like not in percentages but in terms of individual persons.
Here is our analysis:
Step 1: Data Wrangling
Using the US Census data (estimates for 2019) we wrangled the data and created a Utility Matrix that we will use for the calculations. Since we are focused on the Digital Divide and how it affects education we used the following data fields:
Total House Holds
Persons Per House Holds
Percent of House Holds with Computers
Percent of House Holds with Broadband Internet Access
Step 2: EDA & Hypothesis
Lets look at the bar chart as percentages
As you can see there is no dramatics differences exposed in this visualizations. Our hypothesis, or question we asked ourselves was: Would the impact look the same if we converted from percentages to actual numbers?
So we compute the number of Households and the Number of Persons affected by the digital divide in each of these villages. We took the inverse of the percentages for computers and internet access and use them in our computation.
Total households without computers = Total Households x (1 – PCT With Computers)
Total households without internet = Total Households x (1 – PCT with Internet)
Total persons without computers = (Total households without computers) x (Person Per Household)
Total persons without internet = (Total Household without internet) x (Persons per Household)
The results are shown by the table below:
Step 3: Conclusion & Impact
By taking the percentages and converting them to numbers show the real impact the “Digital Divide” has on communities in our area. Sometimes showing impact a as percentage does not bring to light the seriousness of the problem. In the case of Hempstead we can see that the access to computers and internet affects 5,010 + 10,891. A total of over 15,000 persons are impacted by the digital divide.
Now let’s take a look at the visualization and not percentages but as actual persons affected.
The results are obvious. The digital divide impact is now clear between villages in our select data sets. As aspiring data scientist, we are anomaly spotters and we let the data speak for itself.
Data scientists are “big data” wranglers, gathering and analyzing large sets of structured and unstructured data. A data scientist’s role combines computer science, statistics, and mathematics. Data scientists are in high demand. They analyze, process, and model data then interpret the results to create actionable plans for companies and other organizations.
“Data scientists are anomaly spotters”, said Dr. Steven C. Lindo, Chairman & CEO of SpringBoard Incubators Inc. Meaning that they follow a technique for Exploratory Data Analysis (or EDA). This method uses data visualizations techniques to look for outliers in datasets.
At SpringBoard, our Data Science workshops use the Python programming language for data analysis. We use it natively or with platforms like Google Colab or Jupyter IPython.
Python is perfect for scientific computing, here are the main components you will learn to use at SpringBoard: