AI tales, your first Machine Learning model with Python, Conda, Scikit-learn& Jupyter.

Loïc Combis
ShopStyle Engineering
11 min readJun 11, 2021

Wow! This is very exciting!! The first (of many) AI-related posts. 😁 In just a few minutes, you will have your first Machine Learning model running. This is the first tale of series focusing on Artificial intelligence.

Artificial intelligence tales by Loïc Combis
Source: Pixabay

Once upon a time, a young and curious engineer wanted to get a better understanding of “What is AI?”, get hands-on experience with real-life tools, and of course, have fun (why not after all 😃)! So, little by little, building knowledge, and models (lots of them…), I started looking back, and retrospectively, a couple of questions popped into his mind…

How do you actually get started? And how can you navigate in such a large & diverse ecosystem as AI?

Well, the answer is not as trivial as you’d imagine… One needs to ask about theory vs practice, what tools are the best to achieve their objectives, how much time does one has… Which is even more questions than before…

In the end, I figured that you want to build AI right? Use real-life tools. Not just build knowledge about the mathematical models involved 🤯… Therefore, these tales will focus on hands-on experience. We’ll try to dispatch theoretical ideas here and there but we’ll mainly spend time actually developing AI. Because I believe that with practice, the theory will naturally come along the way.

🚨 Let’s take a deep breath and remember that moment 🧘‍♂️… From zero to hero… Let’s cross that bridge together.

Source: Pixabay

Setup

We’ll need a bunch of tools to get started: Python, Pandas, NumPy, MatPlotLib, Scikit-Learn, Jupyter

OMG what’s all this 😩… Don’t worry, we’re about to discover it. As you can imagine, it would be quite painful to install all these one by one, so we’ll use a tool manager called Conda. There are different versions (Anaconda, the full manager, and MiniConda, the light version). We’ll use ⚠️⚠️ MiniConda ⚠️⚠️. You can download it here.

🚨 Note: Make sure to download MiniConda for python 3.8+.

MiniConda Installer Downloa Page

Once you’ve downloaded the installer, open the package and follow the instructions (the process might differ if you’re on Windows or Linux):

MiniConda installation window

After finishing installing Conda, you should be able to run the following command in a terminal:

conda --version

In my case, I’m using conda 4.9.2. If you run into issues while installing conda, refer to the installation guide. You can also find similar issues/solutions on Stackoverflow.

Now that we have conda running, we can create the ecosystem containing all the tools we need and get started. To do so go to your tutorial root directory and run the following command:

conda create --prefix ./first-model/env/

We just created an empty environment. Now you can activate this environment and install the tools we’ll need.

// Activate first-model environment.
conda activate ./first-model/env/
// Install the tool we need.
conda install python jupyter pandas numpy scikit-learn matplotlib seaborn

It will ask you if you want to install a bunch of packages, approve and let conda’s magic happen.

🚨 Note: Deactivate an environment by running: conda deactivate. Which will default to the “base” environment.

You just set up your AI ecosystem. Well done! 👏👏👏

Source: Giphy

Your first notebook

AI model development is quite different in the way we approach it. It’s not just about writing a succession of methods/classes that, when inputted a set of parameters will logically output a consistent result. The actual “code” or logic is pretty straightforward in our case. The most important aspect is data analysis and preparation.

→ To build a good model, we need to understand the data we have. We need to explore, get familiar with the meaning of each feature (or type of data), and get an overview of the general trends. We need to make sense out of the raw data.

→ Machine Learning algorithms expect formatted data. We can’t just take raw data and plug it into an algorithm. At best, we’d get random results but for most cases, it just won’t work.

→ After we’ve trained our model, we want to measure the results so we can iterate with different parameters and optimize the results.

All this process mixes both logic writing and data visualization. So instead of separating the analysis/visualization from the code, we will use a Jupyter Notebook. Here is how Jupyter describes its notebook feature:

The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.

So basically, using a notebook, we will build AI in a sensical way. In today’s article, we won’t go into much details asthis is more like a setup and get started post. But in the following, we are going go through these steps several times and explain more about the underlying reasoning for each one of them.

→ Setup the notebook and define its purpose.

→ Exploratory data analysis (EDA).

→ Data preparation.

→ Model configuration & training.

→ Result measurement.

No need for more explanations, let’s get started!! In your terminal, run the following command (make sure the correct environment is activated).

jupyter notebook

This will open a window in your default browser that should look like this:

Jupyter Notebook browser
Jupyter Web client

To create a new notebook, click new > Python 3. This will open a new tab with an empty notebook. You can rename it “first-notebook” as I did so you can find it more easily later.

First Jupyter Notebook

If you go back to the first tab, you will notice that the notebook appears in the file list. Same if you go to your project folder, the notebook will be there as well.

So, here is our empty notebook. A notebook works with cells, following these rules:

Each cell has two modes, edit, and view. Use your keyboard arrows to navigate cells and press Enteron a selected cell to switch to the edit mode. Press ESC to switch back to the view mode.

There are two types of cells, Code (to run python code) & Text (to write markdown). You can switch between the two with ESC + yor ESC + m when the cell is selected. Type CTRL + Enter (or CMD + Enter on Mac) to run a cell.

The “code” cells share the same runtime. Meaning that if you can access & modify the state defined in other cells.

Let’s play with all this for a moment. Select in the first cell, type the following:

# First Notebook

Then type ESC + m (text cell) and then type CTRL/CMD + Enter (run the cell).

In a second cell below, type ESC + y (code cell) and paste the following:

print("Hello World")a = 1
b = 2

Then run the cell. Finally, paste the following in a third cell:

print(a + b)b = 3print(a + b)

Run this cell as well, and normally your notebook should look like this:

Jupyter Notebook with Code & Markdown cells.

🚨 Note: If you face difficulties using the notebook, I would recommend reading this Introduction to Jupyter Notebooks. You will find all the documentation you need to get unblocked.

Feel free to experiment more and get familiar with the tool. The next section will focus on building your first AI model…

What?! Already?!!

Indeed…

Your first model

Let’s start by creating a new notebook (as explained in the previous section), and call it “first-model.

Load the data

The first part consists in loading our dataset and peaking at the different features, to understand the data. In the first cell, paste the following:

from sklearn import datasets
import pandas as pd
import matplotlib.pyplot as plt

sklearn package contains numbers of Machine learning models representing different algorithms used to train on a dataset, as well as useful methods to preprocess data. In addition, it contains several example datasets that can be used to learn machine learning. More about Scikit-Learn.

pandas is a Python library defining two main objects — DataFrame & Serie — which facilitate data analysis. More about Pandas.

matplotlib is another library used to print all sorts of charts. More about MatplotLib.

Then we load the Iris dataset.

# Load Scikit Learn Iris Demo dataset.
iris = datasets.load_iris()

Turn the dataset into a DataFrame and print the first five rows.

df = pd.DataFrame(iris.data, columns=iris.feature_names)df['target'] = iris.targetdf.head()

🚨🚨 Note: Don’t forget to run the cells in the correct order!! You can see the index at which the cell was run for the last time In [...].

First Model, Data Analysis.

As you can see on the screenshot above, the dataset contains 4 features: sepal length, sepal width, petal length & petal width. The target columns determines the type of Iris the row represent.

0 → Setosa

1 → Versicolour

2 → Virginica

len(df) # gives us the number of rows in the dataframe.

We also notices that the dataset is not missing any value so we’re good to go for the data preparation. 😁

🚨🚨 Note: As we’ll see in the following posts, this does not constitute an actual Exploratory Data Analysis (EDA). We face a pretty straightforward dataset, but in the real world, the data is generally way more complex… And requires to take more time studying the different features, there shape etc…

Add & run the following cell:

from sklearn.utils import shuffle# Shuffle data set rows 
shuffled = shuffle(df)
# Separate features (X) from the prediction target.
X = shuffled.drop('target', axis=1)
y = shuffled['target']

As the dataset is ordered (all rows with 0, then all rows with 1…), we shuffle the dataset. ⚠️ This is very important!! Otherwise the model could pick up misleading patterns while training.

Second, we’re going to split the data set in two parts: Training & Testing. To do so, we’ll use the very handy method train_test_split from sklearn.model_selection.

from sklearn.model_selection import train_test_split# Split the data set for training & validation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

test_size defines the ratio of the dataset to be used for testing (20% here).

random_state defines the seed to be used for randomizing the split. It gives us the ability to reproduce the same results despite the randomness of the split.

Finally verify the different subsets.

len(X_train), len(X_test), len(y_train), len(y_test)
# (120, 30, 120, 30)

Almost there… T minus 2 minutes before your first model becomes reality!!

Source: Giphy.

Train & test the model

Although we already prepared the data, to reach a better accuracy, we’re going to normalize the values (scale everything between 0 and 1). It is meant to prevent features to weigh more than others due to scale/unit differences (e.g.: compare a human height in meters with their weight in grams).

The good thing is that sklearn already has everything we need to create a pipeline normalizing the data and then fitting (training) the model.

from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
# Create a pipeline to normalize the data & fit the model.
pipe = make_pipeline(StandardScaler(), RandomForestClassifier())
# Train the model
pipe.fit(X_train, y_train)
# Score the model with the test data
pipe.score(X_test, y_test)

In the cell above, StandardScaler will take care of normalizing the data and RandomForestClassifier is our actual Machine Learning model!!!

Run the cell and you should obtain an accuracy around 93.33%.

OMG.. 3 lines of codes to scale our data, train & score the model… Are you kidding me?

Unfortunately no… sklearn makes it really easy for us to build AI… ⚠️But, remember that we are using a small demo dataset. You’ll see in the next posts that it gets trickier really quick!

Let’s go a bit in further in the assessment of our model. First, let’s get the actual predictions:

y_preds = pipe.predict(X_test)y_preds# array([1, 0, 1, 1, 0, 1, 1, 2, 2, 0, 2, 0, 2, 1, 0, 2, 1, 1, 1, 0, # 2, 2, 0, 1, 2, 2, 2, 0, 1, 2])

Now, using matplotlib and seaborn (UI lib for matplotlib), we are going to print out the confusion_matrix .

from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns
sns.set(font_scale=1.5) # Increase font sizefig, ax = plt.subplots(figsize=(3, 3))
ax = sns.heatmap(confusion_matrix(y_test, y_preds),
annot=True, # Annotate the boxes
cbar=False)
plt.xlabel("True Label")
plt.ylabel("Prediction")
Model Confusion Matrix.

But what does that mean? This matrix compares the prediction per label vs the actual label.

→ On the diagonal, you have all the count of correct prediction(e.g.: 0x0, the model predicted 0 and the actual label is 0).

→ The sum of each row (minus diagonal value) gives us the false positives (e.g.: Row with prediction 1, we predicted twice 1 whereas the true label is 2).

→ The sum of each column (minus diagonal value) gives the the false negatives (e.g.: Column with true label 2, we predicted twice something other than two where as the true label is 2).

The confusion matrix can give us interesting insights on labels that are the most difficult to choose from.

Another interesting table to asses our model is the classification.

print(classification_report(y_test, y_preds))
Model Classification Report.

Precision. Ratio of true positives / occurences.

Recall. Ratio of true positives / (true positives + false negatives). It is the number of time we predicted a label versur how many times we should have.

F1-Score. Weighted average of the precision and recall.

Support. Number of occurrences to “support” the scores (e.g.: There was 8 occurrences of the label 0 in y_test).

Conclusion

We’ve done it! Our first Machine Learning Model… Today, we saw how one can quickly get started with AI, install all the tools (Conda, Python, Scikit-Learn, Pandas). How to create Jupyter notebooks, load and prepare a dataset, and finally, how to create a pipeline allowing you to scale your data, train and assess am model. But… This just the beginning. There is so much more going on… So many more ML models to build… Next time, we’ll dive deep into Exploratory Data Analysis, Machine Learning models, Fine Tuning and Model Assessment.

All the notebooks in this post are available on the AI tales repo.

I really hope you enjoyed this first step into the world of AI. Do not hesitate to clap if it’s case, If you didn’t, feel free to yell at me on me on Twitter.

⚠️⚠️⚠️ Comming Soon — AI tales, Machine Learning in Depth.⚠️⚠️⚠️

More Links

If you’re interested in other things I might be doing 😋, here is my personal website.

My other posts on medium.

My Github.

What do we do at ShopStyle?

ShopStyle main website.

--

--