This is the memo of the 8th course (23 courses in all) of ‘Machine Learning Scientist with Python’ skill track.
You can find the original course HERE.
Course Description
This course covers the basics of how and when to perform data preprocessing. This essential step in any machine learning project is when you get your data ready for modeling. Between importing and cleaning your data and fitting your machine learning model is when preprocessing comes into play. You’ll learn how to standardize your data so that it’s in the right form for your model, create new features to best leverage the information in your dataset, and select the best features to improve your model fit. Finally, you’ll have some practice preprocessing by getting a dataset on UFO sightings ready for modeling.
Table of contents
1.1 What is data preprocessing?
1.1.1 Missing data – columns
We have a dataset comprised of volunteer information from New York City. The dataset has a number of features, but we want to get rid of features that have at least 3 missing values.
How many features are in the original dataset, and how many features are in the set after columns with at least 3 missing values are removed?
`volunteer.shape
volunteer.dropna(axis=1,thresh=3).shape
1.1.2 Missing data – rows
Taking a look at the volunteer
dataset again, we want to drop rows where the category_desc
column values are missing. We’re going to do this using boolean indexing, by checking to see if we have any null values, and then filtering the dataset so that we only have rows with those values.
`# Check how many values are missing in the category_desc column
print(volunteer[‘category_desc’].isnull().sum())
volunteer_subset = volunteer[volunteer[‘category_desc’].notnull()]
print(volunteer_subset.shape)
1.2 Working with data types
1.2.1 Exploring data types
Taking another look at the dataset comprised of volunteer information from New York City, we want to know what types we’ll be working with as we start to do more preprocessing.
Which data types are present in the volunteer
dataset?
set(volunteer.dtypes.values) {dtype('int64'), dtype('float64'), dtype('O')}
1.2.2 Converting a column type
If you take a look at the volunteer
dataset types, you’ll see that the column hits
is type object
. But, if you actually look at the column, you’ll see that it consists of integers. Let’s convert that column to type int
.
`volunteer[“hits”].dtype
volunteer[“hits”] = volunteer[“hits”].astype(‘int’)
volunteer[“hits”].dtype
1.3 Class distribution
1.3.1 Class imbalance
In the volunteer
dataset, we’re thinking about trying to predict the category_desc
variable using the other features in the dataset. First, though, we need to know what the class distribution (and imbalance) is for that label.
Which descriptions occur less than 50 times in the volunteer
dataset?
volunteer.category_desc.value_counts() Strengthening Communities 307 Helping Neighbors in Need 119 Education 92 Health 52 Environment 32 Emergency Preparedness 15 Name: category_desc, dtype: int64
1.3.2 Stratified sampling
We know that the distribution of variables in the category_desc
column in the volunteer
dataset is uneven. If we wanted to train a model to try to predict category_desc
, we would want to train the model on a sample of data that is representative of the entire dataset. Stratified sampling is a way to achieve this.
`# Create a data with all columns except category_desc
volunteer_X = volunteer.drop(‘category_desc’, axis=1)
volunteer_y = volunteer[[‘category_desc’]]
X_train, X_test, y_train, y_test = train_test_split(volunteer_X, volunteer_y, stratify=volunteer_y)
print(y_train[‘category_desc’].value_counts())`
Strengthening Communities 230 Helping Neighbors in Need 89 Education 69 Health 39 Environment 24 Emergency Preparedness 11 Name: category_desc, dtype: int64
``
``
2.1 Standardizing Data
2.1.1 When to standardize
2.1.2 Modeling without normalizing
Let’s take a look at what might happen to your model’s accuracy if you try to model data without doing some sort of standardization first. Here we have a subset of the wine
dataset. One of the columns, Proline
, has an extremely high variance compared to the other columns. This is an example of where a technique like log normalization would come in handy, which you’ll learn about in the next section.
| 123456789 | <p># Split the dataset and labels into training and test sets
</p><p>X_train, X_test, y_train, y_test = train_test_split(X, y)
</p><p>
</p><p># Fit the k-nearest neighbors model to the training data
</p><p>knn.fit(X_train,y_train)
</p><p>
</p><p># Score the model on the test data
</p><p>print(knn.score(X_test,y_test))
</p><p># 0.5333333333333333
</p> |
| ——— | ———————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————- |
2.2 Log normalization
2.2.1 Checking the variance
Check the variance of the columns in the wine
dataset. Which column is a candidate for normalization?
| 12345678910111213141516 | <p>wine.var()
</p><p>Type 0.600679
</p><p>Alcohol 0.659062
</p><p>Malic acid 1.248015
</p><p>Ash 0.075265
</p><p>Alcalinity of ash 11.152686
</p><p>Magnesium 203.989335
</p><p>Total phenols 0.391690
</p><p>Flavanoids 0.997719
</p><p>Nonflavanoid phenols 0.015489
</p><p>Proanthocyanins 0.327595
</p><p>Color intensity 5.374449
</p><p>Hue 0.052245
</p><p>OD280/OD315 of diluted wines 0.504086
</p><p>Proline 99166.717355
</p><p>dtype: float64
</p> |
| ———————– | ————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————– |
Proline 99166.717355
2.2.2 Log normalization in Python
Now that we know that the Proline
column in our wine dataset has a large amount of variance, let’s log normalize it.
| 12345678910 | <p># Print out the variance of the Proline column
</p><p>print(wine.Proline.var())
</p><p># 99166.71735542436
</p><p>
</p><p># Apply the log normalization function to the Proline column
</p><p>wine['Proline_log'] = np.log(wine.Proline)
</p><p>
</p><p># Check the variance of the Proline column again
</p><p>print(wine.Proline_log.var())
</p><p># 0.17231366191842012
</p><p><code></code></p> |
| ———– | —————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————— |
2.3 Scaling data for feature comparison
2.3.1 Scaling data – investigating columns
We want to use the Ash
, Alcalinity of ash
, and Magnesium
columns in the wine dataset to train a linear model, but it’s possible that these columns are all measured in different ways, which would bias a linear model.
| 12345678910 | <p>wine[['Ash','Alcalinity of ash','Magnesium']].describe()
</p><p>Ash Alcalinity of ash Magnesium
</p><p>count 178.000000 178.000000 178.000000
</p><p>mean 2.366517 19.494944 99.741573
</p><p>std 0.274344 3.339564 14.282484
</p><p>min 1.360000 10.600000 70.000000
</p><p>25% 2.210000 17.200000 88.000000
</p><p>50% 2.360000 19.500000 98.000000
</p><p>75% 2.557500 21.500000 107.000000
</p><p>max 3.230000 30.000000 162.000000
</p> |
| ———– | ———————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————– |
2.3.2 Scaling data – standardizing columns
Since we know that the Ash
, Alcalinity of ash
, and Magnesium
columns in the wine dataset are all on different scales, let’s standardize them in a way that allows for use in a linear model.
| 1234567891011 | <p># Import StandardScaler from scikit-learn
</p><p>from sklearn.preprocessing import StandardScaler
</p><p>
</p><p># Create the scaler
</p><p>ss = StandardScaler()
</p><p>
</p><p># Take a subset of the DataFrame you want to scale
</p><p>wine_subset = wine[['Ash','Alcalinity of ash','Magnesium']]
</p><p>
</p><p># Apply the scaler to the DataFrame subset
</p><p>wine_subset_scaled = ss.fit_transform(wine_subset)
</p> |
| ————- | ———————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————– |
| 123456 | <p>wine_subset_scaled[:5]
</p><p>array([[ 0.23205254, -1.16959318, 1.91390522],
</p><p>[-0.82799632, -2.49084714, 0.01814502],
</p><p>[ 1.10933436, -0.2687382 , 0.08835836],
</p><p>[ 0.4879264 , -0.80925118, 0.93091845],
</p><p>[ 1.84040254, 0.45194578, 1.28198515]])
</p> |
| —— | —————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————- |
2.4 Standardized data and modeling
2.4.1 KNN on non-scaled data
Let’s first take a look at the accuracy of a K-nearest neighbors model on the wine
dataset without standardizing the data.
| 123456789 | <p># Split the dataset and labels into training and test sets
</p><p>X_train, X_test, y_train, y_test = train_test_split(X,y)
</p><p>
</p><p># Fit the k-nearest neighbors model to the training data
</p><p>knn.fit(X_train, y_train)
</p><p>
</p><p># Score the model on the test data
</p><p>print(knn.score(X_test, y_test))
</p><p># 0.6444444444444445
</p> |
| ——— | ———————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————– |
2.4.2 KNN on scaled data
The accuracy score on the unscaled wine
dataset was decent, but we can likely do better if we scale the dataset. The process is mostly the same as the previous exercise, with the added step of scaling the data.
12345678910111213 | <p># Create the scaling method. </p><p> ss = StandardScaler() </p><p> </p><p> # Apply the scaling method to the dataset used for modeling. </p><p> X_scaled = ss.fit_transform(X) </p><p> X_train, X_test, y_train, y_test = train_test_split(X_scaled, y) </p><p> </p><p> # Fit the k-nearest neighbors model to the training data. </p><p> knn.fit(X_train,y_train) </p><p> </p><p> # Score the model on the test data. </p><p> print(knn.score(X_test,y_test)) </p><p> # 0.9555555555555556 </p> |
---|---|
3.1 Feature engineering
3.1.1 Examples for creating new features
Timestamps can be broken into days or months, and headlines can be used for natural language processing.
3.1.2 Identifying areas for feature engineering
| 123 | <p>volunteer[['title','created_date','category_desc']].head(1)
</p><p>title created_date category_desc
</p><p>0 Volunteers Needed For Rise Up & Stay Put! Home... January 13 2011 Strengthening Communities
</p> |
| — | —————————————————————————————————————————————————————————————————————————————————————– |
All of these columns will require some feature engineering before modeling.
3.2 Encoding categorical variables
3.2.1 Encoding categorical variables – binary
Take a look at the hiking
dataset. There are several columns here that need encoding, one of which is the Accessible
column, which needs to be encoded in order to be modeled. Accessible
is a binary feature, so it has two values – either Y
or N
– so it needs to be encoded into 1s and 0s. Use scikit-learn’s LabelEncoder
method to do that transformation.
| 12345678 | <p># Set up the LabelEncoder object
</p><p>enc = LabelEncoder()
</p><p>
</p><p># Apply the encoding to the "Accessible" column
</p><p>hiking['Accessible_enc'] = enc.fit_transform(hiking.Accessible)
</p><p>
</p><p># Compare the two columns
</p><p>print(hiking[['Accessible', 'Accessible_enc']].head())
</p> |
| ——– | ——————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————- |
| 123456 | <p>Accessible Accessible_enc
</p><p>0 Y 1
</p><p>1 N 0
</p><p>2 N 0
</p><p>3 N 0
</p><p>4 N 0
</p> |
| —— | ———————————————————————————————————————————————————————————————- |
3.2.2 Encoding categorical variables – one-hot
One of the columns in the volunteer
dataset, category_desc
, gives category descriptions for the volunteer opportunities listed. Because it is a categorical variable with more than two categories, we need to use one-hot encoding to transform this column numerically.
| 12345 | <p># Transform the category_desc column
</p><p>category_enc = pd.get_dummies(volunteer["category_desc"])
</p><p>
</p><p># Take a look at the encoded columns
</p><p>print(category_enc.head())
</p> |
| —– | —————————————————————————————————————————————————————————————————————————————————————— |
| 12345678 | <p>Education Emergency Preparedness ... Helping Neighbors in Need Strengthening Communities
</p><p>0 0 0 ... 0 0
</p><p>1 0 0 ... 0 1
</p><p>2 0 0 ... 0 1
</p><p>3 0 0 ... 0 1
</p><p>4 0 0 ... 0 0
</p><p>
</p><p>[5 rows x 6 columns]
</p> |
| ——– | ———————————————————————————————————————————————————————————————————————————————————————————————————————————————————— |
3.3 Engineering numerical features
3.3.1 Engineering numerical features – taking an average
A good use case for taking an aggregate statistic to create a new feature is to take the mean of columns. Here, you have a DataFrame of running times named running_times_5k
. For each name
in the dataset, take the mean of their 5 run times.
| 123456789 | <p># Create a list of the columns to average
</p><p>run_columns = ['run1', 'run2', 'run3', 'run4', 'run5']
</p><p>
</p><p># Use apply to create a mean column
</p><p># axis=1 = row wise
</p><p>running_times_5k["mean"] = running_times_5k.apply(lambda row: row[run_columns].mean(), axis=1)
</p><p>
</p><p># Take a look at the results
</p><p>print(running_times_5k)
</p> |
| ——— | ———————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————— |
| 1234567 | <p>name run1 run2 run3 run4 run5 mean
</p><p>0 Sue 20.1 18.5 19.6 20.3 18.3 19.36
</p><p>1 Mark 16.5 17.1 16.9 17.6 17.3 17.08
</p><p>2 Sean 23.5 25.1 25.2 24.6 23.9 24.46
</p><p>3 Erin 21.7 21.1 20.9 22.1 22.2 21.60
</p><p>4 Jenny 25.8 27.1 26.1 26.7 26.9 26.52
</p><p>5 Russell 30.9 29.6 31.4 30.4 29.9 30.44
</p> |
| ——- | ————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————— |
3.3.2 Engineering numerical features – datetime
There are several columns in the volunteer
dataset comprised of datetimes. Let’s take a look at the start_date_date
column and extract just the month to use as a feature for modeling.
| 12345678 | <p># First, convert string column to date column
</p><p>volunteer["start_date_converted"] = pd.to_datetime(volunteer["start_date_date"])
</p><p>
</p><p># Extract just the month from the converted column
</p><p>volunteer["start_date_month"] = volunteer["start_date_converted"].apply(lambda row: row.month)
</p><p>
</p><p># Take a look at the converted and new month columns
</p><p>print(volunteer[['start_date_converted', 'start_date_month']].head())
</p> |
| ——– | ———————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————— |
| 123456 | <p>start_date_converted start_date_month
</p><p>0 2011-07-30 7
</p><p>1 2011-02-01 2
</p><p>2 2011-01-29 1
</p><p>3 2011-02-14 2
</p><p>4 2011-02-05 2
</p> |
| —— | ——————————————————————————————————————————————————————————————————————————————————- |
3.4 Text classification
3.4.1 Engineering features from strings – extraction
The Length
column in the hiking
dataset is a column of strings, but contained in the column is the mileage for the hike. We’re going to extract this mileage using regular expressions, and then use a lambda in Pandas to apply the extraction to the DataFrame.
| 1234567891011121314 | <p># Write a pattern to extract numbers and decimals
</p><p>def return_mileage(length):
</p><p>pattern = re.compile(r"\d+\.\d+")
</p><p>
</p><p># Search the text for matches
</p><p>mile = re.match(pattern, length)
</p><p>
</p><p># If a value is returned, use group(0) to return the found value
</p><p>if mile is not None:
</p><p>return float(mile.group(0))
</p><p>
</p><p># Apply the function to the Length column and take a look at both columns
</p><p>hiking["Length_num"] = hiking['Length'].apply(lambda row: return_mileage(row))
</p><p>print(hiking[["Length", "Length_num"]].head())
</p> |
| ——————- | ————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————— |
| 123456 | <p>Length Length_num
</p><p>0 0.8 miles 0.80
</p><p>1 1.0 mile 1.00
</p><p>2 0.75 miles 0.75
</p><p>3 0.5 miles 0.50
</p><p>4 0.5 miles 0.50
</p> |
| —— | ——————————————————————————————————————————————————————————————————————————————— |
3.4.2 Engineering features from strings – tf/idf
Let’s transform the volunteer
dataset’s title
column into a text vector, to use in a prediction task in the next exercise.
| 12345678 | <p># Take the title text
</p><p>title_text = volunteer['title']
</p><p>
</p><p># Create the vectorizer method
</p><p>tfidf_vec = TfidfVectorizer()
</p><p>
</p><p># Transform the text into tf-idf vectors
</p><p>text_tfidf = tfidf_vec.fit_transform(title_text)
</p> |
| ——– | ————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————- |
| 123 | <p>text_tfidf
</p><p><665x1136 sparse matrix of type '<class 'numpy.float64'>'
</p><p>with 3397 stored elements in Compressed Sparse Row format>
</p> |
| — | ———————————————————————————————————————————————————————————————————– |
3.4.3 Text classification using tf/idf vectors
Now that we’ve encoded the volunteer
dataset’s title
column into tf/idf vectors, let’s use those vectors to try to predict the category_desc
column.
| 12345678910111213141516171819 | <p>text_tfidf.toarray()
</p><p>array([[0., 0., 0., ..., 0., 0., 0.],
</p><p>[0., 0., 0., ..., 0., 0., 0.],
</p><p>[0., 0., 0., ..., 0., 0., 0.],
</p><p>...,
</p><p>[0., 0., 0., ..., 0., 0., 0.],
</p><p>[0., 0., 0., ..., 0., 0., 0.],
</p><p>[0., 0., 0., ..., 0., 0., 0.]])
</p><p>
</p><p>text_tfidf.toarray().shape
</p><p># (617, 1089)
</p><p>
</p><p>volunteer["category_desc"].head()
</p><p>1 Strengthening Communities
</p><p>2 Strengthening Communities
</p><p>3 Strengthening Communities
</p><p>4 Environment
</p><p>5 Environment
</p><p>Name: category_desc, dtype: object
</p> |
| —————————– | ———————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————– |
| 12345678910 | <p># Split the dataset according to the class distribution of category_desc
</p><p>y = volunteer["category_desc"]
</p><p>X_train, X_test, y_train, y_test = train_test_split(text_tfidf.toarray(), y, stratify=y)
</p><p>
</p><p># Fit the model to the training data
</p><p>nb.fit(X_train, y_train)
</p><p>
</p><p># Print out the model's accuracy
</p><p>print(nb.score(X_test, y_test))
</p><p># 0.567741935483871
</p> |
| ———– | ————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————– |
Notice that the model doesn’t score very well. We’ll work on selecting the best features for modeling in the next chapter.
4.1 Feature selection
4.1.1 Identifying areas for feature selection
Take an exploratory look at the post-feature engineering hiking
dataset. Which of the following columns is a good candidate for feature selection?
| 123456 | <p>hiking.columns
</p><p>Index(['Accessible', 'Difficulty', 'Length', 'Limited_Access', 'Location',
</p><p>'Name', 'Other_Details', 'Park_Name', 'Prop_ID', 'lat', 'lon',
</p><p>'Length_extract', 'accessible_enc', '', 'Easy', 'Easy ',
</p><p>'Easy/Moderate', 'Moderate', 'Moderate/Difficult', 'Various'],
</p><p>dtype='object')
</p> |
| —— | ————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————— |
All three of these columns are good candidates for feature selection.
4.2 Removing redundant features
4.2.1 Selecting relevant features
Now let’s identify the redundant columns in the volunteer
dataset and perform feature selection on the dataset to return a DataFrame of the relevant features.
For example, if you explore the volunteer
dataset in the console, you’ll see three features which are related to location: locality
, region
, and postalcode
. They contain repeated information, so it would make sense to keep only one of the features.
There are also features that have gone through the feature engineering process: columns like Education
and Emergency Preparedness
are a product of encoding the categorical variable category_desc
, so category_desc
itself is redundant now.
Take a moment to examine the features of volunteer
in the console, and try to identify the redundant features.
| 12345678 | <p># Create a list of redundant column names to drop
</p><p>to_drop = ["locality", "region", "category_desc", "created_date", "vol_requests"]
</p><p>
</p><p># Drop those columns from the dataset
</p><p>volunteer_subset = volunteer.drop(to_drop, axis=1)
</p><p>
</p><p># Print out the head of the new dataset
</p><p>print(volunteer_subset.head())
</p> |
| ——– | —————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————- |
| 12345 | <p>volunteer_subset.columns
</p><p>Index(['title', 'hits', 'postalcode', 'vol_requests_lognorm', 'created_month',
</p><p>'Education', 'Emergency Preparedness', 'Environment', 'Health',
</p><p>'Helping Neighbors in Need', 'Strengthening Communities'],
</p><p>dtype='object')
</p> |
| —– | —————————————————————————————————————————————————————————————————————————————————————————————————————————————————————— |
4.2.2 Checking for correlated features
Let’s take a look at the wine
dataset again, which is made up of continuous, numerical features. Run Pearson’s correlation coefficient on the dataset to determine which columns are good candidates for eliminating. Then, remove those columns from the DataFrame.
| 12345678 | <p># Print out the column correlations of the wine dataset
</p><p>print(wine.corr())
</p><p>
</p><p># Take a minute to find the column where the correlation value is greater than 0.75 at least twice
</p><p>to_drop = "Flavanoids"
</p><p>
</p><p># Drop that column from the DataFrame
</p><p>wine = wine.drop(to_drop, axis=1)
</p> |
| ——– | —————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————– |
| 1234567 | <p>print(wine.corr())
</p><p>Flavanoids Total phenols Malic acid OD280/OD315 of diluted wines Hue
</p><p>Flavanoids 1.000000 0.864564 -0.411007 0.787194 0.543479
</p><p>Total phenols 0.864564 1.000000 -0.335167 0.699949 0.433681
</p><p>Malic acid -0.411007 -0.335167 1.000000 -0.368710 -0.561296
</p><p>OD280/OD315 of diluted wines 0.787194 0.699949 -0.368710 1.000000 0.565468
</p><p>Hue 0.543479 0.433681 -0.561296 0.565468 1.000000
</p><p><code></code></p> |
| ——- | —————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————– |
4.3 Selecting features using text vectors
4.3.1 Exploring text vectors, part 1
Let’s expand on the text vector exploration method we just learned about, using the volunteer
dataset’s title
tf/idf vectors. In this first part of text vector exploration, we’re going to add to that function we learned about in the slides. We’ll return a list of numbers with the function. In the next exercise, we’ll write another function to collect the top words across all documents, extract them, and then use that list to filter down our text_tfidf
vector.
| | <p>vocab
</p><p>{1048: 'web',
</p><p>278: 'designer',
</p><p>1017: 'urban',
</p><p>...}
</p><p>
</p><p>tfidf_vec
</p><p>TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
</p><p>dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
</p><p>lowercase=True, max_df=1.0, max_features=None, min_df=1,
</p><p>ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
</p><p>stop_words=None, strip_accents=None, sublinear_tf=False,
</p><p>token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
</p><p>vocabulary=None)
</p><p>
</p><p>tfidf_vec.vocabulary_
</p><p>{'web': 1048,
</p><p>'designer': 278,
</p><p>'urban': 1017,
</p><p>...}
</p><p>
</p><p>text_tfidf
</p><p><617x1089 sparse matrix of type '<class 'numpy.float64'>'
</p><p>with 3172 stored elements in Compressed Sparse Row format>
</p> |
| - | —————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————- |
| | <p># Add in the rest of the parameters
</p><p>def return_weights(vocab, original_vocab, vector, vector_index, top_n):
</p><p>zipped = dict(zip(vector[vector_index].indices, vector[vector_index].data))
</p><p>
</p><p># Let's transform that zipped dict into a series
</p><p>zipped_series = pd.Series({vocab[i]:zipped[i] for i in vector[vector_index].indices})
</p><p>
</p><p># Let's sort the series to pull out the top n weighted words
</p><p>zipped_index = zipped_series.sort_values(ascending=False)[:top_n].index
</p><p>return [original_vocab[i] for i in zipped_index]
</p><p>
</p><p># Print out the weighted words
</p><p>print(return_weights(vocab, tfidf_vec.vocabulary_, text_tfidf, vector_index=8, top_n=3))
</p><p># [189, 942, 466]
</p> |
| - | ——————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————— |
4.3.2 Exploring text vectors, part 2
Using the function we wrote in the previous exercise, we’re going to extract the top words from each document in the text vector, return a list of the word indices, and use that list to filter the text vector down to those top words.
| | <p>def words_to_filter(vocab, original_vocab, vector, top_n):
</p><p>filter_list = []
</p><p>for i in range(0, vector.shape[0]):
</p><p>
</p><p># Here we'll call the function from the previous exercise, and extend the list we're creating
</p><p>filtered = return_weights(vocab, original_vocab, vector, i, top_n)
</p><p>filter_list.extend(filtered)
</p><p># Return the list in a set, so we don't get duplicate word indices
</p><p>return set(filter_list)
</p><p>
</p><p># Call the function to get the list of word indices
</p><p>filtered_words = words_to_filter(vocab, tfidf_vec.vocabulary_, text_tfidf, 3)
</p><p>
</p><p># By converting filtered_words back to a list, we can use it to filter the columns in the text vector
</p><p>filtered_text = text_tfidf[:, list(filtered_words)]
</p><p>
</p><p>filtered_text
</p><p><617x1008 sparse matrix of type '<class 'numpy.float64'>'
</p><p>with 2948 stored elements in Compressed Sparse Row format>
</p> |
| - | ————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————— |
4.3.3 Training Naive Bayes with feature selection
Let’s re-run the Naive Bayes text classification model we ran at the end of chapter 3, with our selection choices from the previous exercise, on the volunteer
dataset’s title
and category_desc
columns.
| | <p># Split the dataset according to the class distribution of category_desc, using the filtered_text vector
</p><p>train_X, test_X, train_y, test_y = train_test_split(filtered_text.toarray(), y, stratify=y)
</p><p>
</p><p># Fit the model to the training data
</p><p>nb.fit(train_X, train_y)
</p><p>
</p><p># Print out the model's accuracy
</p><p>print(nb.score(test_X,test_y))
</p><p># 0.567741935483871
</p> |
| - | —————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————— |
You can see that our accuracy score wasn’t that different from the score at the end of chapter 3. That’s okay; the title
field is a very small text field, appropriate for demonstrating how filtering vectors works.
4.4 Dimensionality reduction
4.4.1 Using PCA
Let’s apply PCA to the wine
dataset, to see if we can get an increase in our model’s accuracy.
| | <p>from sklearn.decomposition import PCA
</p><p>
</p><p># Set up PCA and the X vector for diminsionality reduction
</p><p>pca = PCA()
</p><p>wine_X = wine.drop("Type", axis=1)
</p><p>
</p><p># Apply PCA to the wine dataset X vector
</p><p>transformed_X = pca.fit_transform(wine_X)
</p><p>
</p><p># Look at the percentage of variance explained by the different components
</p><p>print(pca.explained_variance_ratio_)
</p> |
| - | ———————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————— |
| | <p>[9.98091230e-01 1.73591562e-03 9.49589576e-05 5.02173562e-05
</p><p>1.23636847e-05 8.46213034e-06 2.80681456e-06 1.52308053e-06
</p><p>1.12783044e-06 7.21415811e-07 3.78060267e-07 2.12013755e-07
</p><p>8.25392788e-08]
</p> |
| - | ——————————————————————————————————————————————————————————————————————————————————————————————— |
4.4.2 Training a model with PCA
Now that we have run PCA on the wine
dataset, let’s try training a model with it.
| | <p># Split the transformed X and the y labels into training and test sets
</p><p>X_wine_train, X_wine_test, y_wine_train, y_wine_test = train_test_split(transformed_X,y)
</p><p>
</p><p># Fit knn to the training data
</p><p>knn.fit(X_wine_train,y_wine_train)
</p><p>
</p><p># Score knn on the test data and print it out
</p><p>knn.score(X_wine_test,y_wine_test)
</p><p># 0.6444444444444445
</p> |
| - | ————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————— |
5.1 UFOs and preprocessing
| | <p>ufo.head()
</p><p>date city state country type seconds \
</p><p>2 2002-11-21 05:45:00 clemmons nc us triangle 300.0
</p><p>4 2012-06-16 23:00:00 san diego ca us light 600.0
</p><p>7 2013-06-09 00:00:00 oakville (canada) on ca light 120.0
</p><p>8 2013-04-26 23:27:00 lacey wa us light 120.0
</p><p>9 2013-09-13 20:30:00 ben avon pa us sphere 300.0
</p><p>
</p><p>length_of_time desc \
</p><p>2 about 5 minutes It was a large, triangular shaped flying ob...
</p><p>4 10 minutes Dancing lights that would fly around and then ...
</p><p>7 2 minutes Brilliant orange light or chinese lantern at o...
</p><p>8 2 minutes Bright red light moving north to north west fr...
</p><p>9 5 minutes North-east moving south-west. First 7 or so li...
</p><p>
</p><p>recorded lat long
</p><p>2 12/23/2002 36.021389 -80.382222
</p><p>4 7/4/2012 32.715278 -117.156389
</p><p>7 7/3/2013 43.433333 -79.666667
</p><p>8 5/15/2013 47.034444 -122.821944
</p><p>9 9/30/2013 40.508056 -80.083333
</p> |
| - | ——————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————— |
5.1.1 Checking column types
Take a look at the UFO dataset’s column types using the dtypes
attribute. Two columns jump out for transformation: the seconds column, which is a numeric column but is being read in as object
, and the date
column, which can be transformed into the datetime
type. That will make our feature engineering efforts easier later on.
| | <p># Check the column types
</p><p>print(ufo.dtypes)
</p><p>
</p><p># Change the type of seconds to float
</p><p>ufo["seconds"] = ufo.seconds.astype('float')
</p><p>
</p><p># Change the date column to type datetime
</p><p>ufo["date"] = pd.to_datetime(ufo['date'])
</p><p>
</p><p># Check the column types
</p><p>print(ufo[['seconds','date']].dtypes)
</p> |
| - | —————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————— |
| | <p>date object
</p><p>city object
</p><p>state object
</p><p>country object
</p><p>type object
</p><p>seconds object
</p><p>length_of_time object
</p><p>desc object
</p><p>recorded object
</p><p>lat object
</p><p>long float64
</p><p>dtype: object
</p><p>seconds float64
</p><p>date datetime64[ns]
</p><p>dtype: object
</p> |
| - | —————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————— |
5.1.2 Dropping missing data
Let’s remove some of the rows where certain columns have missing values. We’re going to look at the length_of_time
column, the state
column, and the type
column. If any of the values in these columns are missing, we’re going to drop the rows.
| 12345678910 | <p># Check how many values are missing in the length_of_time, state, and type columns
</p><p>print(ufo[['length_of_time', 'state', 'type']].isnull().sum())
</p><p>
</p><p># Keep only rows where length_of_time, state, and type are not null
</p><p>ufo_no_missing = ufo[ufo['length_of_time'].notnull() &
</p><p>ufo['state'].notnull() &
</p><p>ufo['type'].notnull()]
</p><p>
</p><p># Print out the shape of the new dataset
</p><p>print(ufo_no_missing.shape)
</p> |
| ———– | ———————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————- |
| 12345 | <p>length_of_time 143
</p><p>state 419
</p><p>type 159
</p><p>dtype: int64
</p><p>(4283, 4)
</p> |
| —– | —————————————————————————————————————————————————————————- |
5.2 Categorical variables and standardization
5.2.1 Extracting numbers from strings
The length_of_time
field in the UFO dataset is a text field that has the number of minutes within the string. Here, you’ll extract that number from that text field using regular expressions.
| 123456789101112131415 | <p>def return_minutes(time_string):
</p><p>
</p><p># Use \d+ to grab digits
</p><p>pattern = re.compile(r"\d+")
</p><p>
</p><p># Use match on the pattern and column
</p><p>num = re.match(pattern, time_string)
</p><p>if num is not None:
</p><p>return int(num.group(0))
</p><p>
</p><p># Apply the extraction to the length_of_time column
</p><p>ufo["minutes"] = ufo["length_of_time"].apply(return_minutes)
</p><p>
</p><p># Take a look at the head of both of the columns
</p><p>print(ufo[['length_of_time','minutes']].head())
</p> |
| ——————— | ————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————– |
| 123456 | <p>length_of_time minutes
</p><p>2 about 5 minutes NaN
</p><p>4 10 minutes 10.0
</p><p>7 2 minutes 2.0
</p><p>8 2 minutes 2.0
</p><p>9 5 minutes 5.0
</p> |
| —— | —————————————————————————————————————————————————————————————————————————————————– |
5.2.2 Identifying features for standardization
In this section, you’ll investigate the variance of columns in the UFO dataset to determine which features should be standardized. After taking a look at the variances of the seconds
and minutes
column, you’ll see that the variance of the seconds
column is extremely high. Because seconds
and minutes
are related to each other (an issue we’ll deal with when we select features for modeling), let’s log normlize the seconds
column.
| 12345678 | <p># Check the variance of the seconds and minutes columns
</p><p>print(ufo[['seconds','minutes']].var())
</p><p>
</p><p># Log normalize the seconds column
</p><p>ufo["seconds_log"] = np.log(ufo[['seconds']])
</p><p>
</p><p># Print out the variance of just the seconds_log column
</p><p>print(ufo["seconds_log"].var())
</p> |
| ——– | ————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————- |
| 1234 | <p>seconds 424087.417474
</p><p>minutes 117.546372
</p><p>dtype: float64
</p><p>1.1223923881183004
</p> |
| —- | ——————————————————————————————————————————————————————- |
5.3 Engineering new features
5.3.1 Encoding categorical variables
There are couple of columns in the UFO dataset that need to be encoded before they can be modeled through scikit-learn. You’ll do that transformation here, using both binary and one-hot encoding methods.
| 123456789101112 | <p># Use Pandas to encode us values as 1 and others as 0
</p><p>ufo["country_enc"] = ufo["country"].apply(lambda x: 1 if x=='us' else 0)
</p><p>
</p><p># Print the number of unique type values
</p><p>print(len(ufo['type'].unique()))
</p><p># 21
</p><p>
</p><p># Create a one-hot encoded set of the type values
</p><p>type_set = pd.get_dummies(ufo['type'])
</p><p>
</p><p># Concatenate this set back to the ufo DataFrame
</p><p>ufo = pd.concat([ufo, type_set], axis=1)
</p> |
| ————— | ——————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————— |
5.3.2 Features from dates
Another feature engineering task to perform is month and year extraction. Perform this task on the date
column of the ufo
dataset.
| | <p># Look at the first 5 rows of the date column
</p><p>print(ufo['date'].head())
</p><p>
</p><p># Extract the month from the date column
</p><p>ufo["month"] = ufo["date"].apply(lambda x:x.month)
</p><p>
</p><p># Extract the year from the date column
</p><p>ufo["year"] = ufo["date"].apply(lambda x:x.year)
</p><p>
</p><p># Take a look at the head of all three columns
</p><p>print(ufo[['date','month','year']].head())
</p> |
| - | —————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————- |
| | <p>0 2002-11-21 05:45:00
</p><p>1 2012-06-16 23:00:00
</p><p>2 2013-06-09 00:00:00
</p><p>3 2013-04-26 23:27:00
</p><p>4 2013-09-13 20:30:00
</p><p>Name: date, dtype: datetime64[ns]
</p><p>date month year
</p><p>0 2002-11-21 05:45:00 11 2002
</p><p>1 2012-06-16 23:00:00 6 2012
</p><p>2 2013-06-09 00:00:00 6 2013
</p><p>3 2013-04-26 23:27:00 4 2013
</p><p>4 2013-09-13 20:30:00 9 2013
</p> |
| - | ————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————– |
5.3.3 Text vectorization
Let’s transform the desc
column in the UFO dataset into tf/idf vectors, since there’s likely something we can learn from this field.
| | <p># Take a look at the head of the desc field
</p><p>print(ufo["desc"].head())
</p><p>
</p><p># Create the tfidf vectorizer object
</p><p>vec = TfidfVectorizer()
</p><p>
</p><p># Use vec's fit_transform method on the desc field
</p><p>desc_tfidf = vec.fit_transform(ufo["desc"])
</p><p>
</p><p># Look at the number of columns this creates
</p><p>print(desc_tfidf.shape)
</p> |
| - | —————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————- |
| | <p>0 It was a large, triangular shaped flying ob...
</p><p>1 Dancing lights that would fly around and then ...
</p><p>2 Brilliant orange light or chinese lantern at o...
</p><p>3 Bright red light moving north to north west fr...
</p><p>4 North-east moving south-west. First 7 or so li...
</p><p>Name: desc, dtype: object
</p><p>(1866, 3422)
</p> |
| - | ——————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————— |
5.4 Feature selection and modeling
5.4.1 Selecting the ideal dataset
Let’s get rid of some of the unnecessary features. Because we have an encoded country column, country_enc
, keep it and drop other columns related to location: city
, country
, lat
, long
, state
.
We have columns related to month
and year
, so we don’t need the date
or recorded
columns.
We vectorized desc
, so we don’t need it anymore. For now we’ll keep type
.
We’ll keep seconds_log
and drop seconds
and minutes
.
Let’s also get rid of the length_of_time
column, which is unnecessary after extracting minutes
.
| | <p># Check the correlation between the seconds, seconds_log, and minutes columns
</p><p>print(ufo[['seconds','seconds_log','minutes']].corr())
</p><p>
</p><p># Make a list of features to drop
</p><p>to_drop = ['city', 'country', 'date', 'desc', 'lat', 'length_of_time', 'long', 'minutes', 'recorded', 'seconds', 'state']
</p><p>
</p><p># Drop those features
</p><p>ufo_dropped = ufo.drop(to_drop,axis=1)
</p><p>
</p><p># Let's also filter some words out of the text vector we created
</p><p>filtered_words = words_to_filter(vocab, vec.vocabulary_, desc_tfidf, 4)
</p> |
| - | —————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————- |
| | <p>seconds seconds_log minutes
</p><p>seconds 1.000000 0.853371 0.980341
</p><p>seconds_log 0.853371 1.000000 0.824493
</p><p>minutes 0.980341 0.824493 1.000000
</p> |
| - | ——————————————————————————————————————————————————————————————————————————— |
5.4.2 Modeling the UFO dataset, part 1
In this exercise, we’re going to build a k-nearest neighbor model to predict which country the UFO sighting took place in. Our X
dataset has the log-normalized seconds column, the one-hot encoded type columns, as well as the month and year when the sighting took place. The y
labels are the encoded country column, where 1 is us
and 0 is ca
.
| | <p># Take a look at the features in the X set of data
</p><p>print(X.columns)
</p><p>
</p><p># Split the X and y sets using train_test_split, setting stratify=y
</p><p>train_X, test_X, train_y, test_y = train_test_split(X,y,stratify=y)
</p><p>
</p><p># Fit knn to the training sets
</p><p>knn.fit(train_X,train_y)
</p><p>
</p><p># Print the score of knn on the test sets
</p><p>print(knn.score(test_X,test_y))
</p><p># 0.8693790149892934
</p> |
| - | ————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————— |
| | <p>Index(['seconds_log', 'changing', 'chevron', 'cigar', 'circle', 'cone',
</p><p>'cross', 'cylinder', 'diamond', 'disk', 'egg', 'fireball', 'flash',
</p><p>'formation', 'light', 'other', 'oval', 'rectangle', 'sphere',
</p><p>'teardrop', 'triangle', 'unknown', 'month', 'year'],
</p><p>dtype='object')
</p> |
| - | ———————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————- |
5.4.3 Modeling the UFO dataset, part 2
Finally, let’s build a model using the text vector we created, desc_tfidf
, using the filtered_words
list to create a filtered text vector. Let’s see if we can predict the type
of the sighting based on the text. We’ll use a Naive Bayes model for this.
| 123456789101112 | <p># Use the list of filtered words we created to filter the text vector
</p><p>filtered_text = desc_tfidf[:, list(filtered_words)]
</p><p>
</p><p># Split the X and y sets using train_test_split, setting stratify=y
</p><p>train_X, test_X, train_y, test_y = train_test_split(filtered_text.toarray(), y, stratify=y)
</p><p>
</p><p># Fit nb to the training sets
</p><p>nb.fit(train_X,train_y)
</p><p>
</p><p># Print the score of nb on the test sets
</p><p>print(nb.score(test_X,test_y))
</p><p># 0.16274089935760172
</p> |
| ————— | —————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————— |
As you can see, this model performs very poorly on this text data. This is a clear case where iteration would be necessary to figure out what subset of text improves the model, and if perhaps any of the other features are useful in predicting type
.
Thank you for reading and hope you’ve learned a lot.