Data Science Flashcards

Expand your knowledge on Data Science and Machine Learning through this comprehensive collection of flashcards. Contribute

Statistics

Types of data

What are the three main types of data you can encounter in data science?
The three main types of data are:
- numerical data: the data represents some kind of quantitative value that you can measure. It could be a temperature or a weight. Numerical data can be divided into
  - discrete data: it is integer based and can take a limited set of values, for example like the amount of money you have on your account.
  - continuous data: it has an infinite number of possible values like the weight of a person, which can be 60 Kg or 60.345 Kg or 60.345678 Kg. In this example you can take the weight with as many decimals as you like.
- categorical data: it is data that can have numerical values but those numbers have no intrinsic mathematical meaning. Some examples are gender, race, religion. You can map values to numbers, like say male is 0 and female is 1. But those numbers have no mathematical meaning, i.e. the male value has no numerical relationship with the female value.
- ordinal data: this is a mixture of numerical and categorical data. More specifically, it is categorical data with mathematical meaning. For example the hotel rating stars from 1 to 5 have a numerical meaning: a five star hotel is better than a two star hotel.
What is a random variable? Can you make an example?

A random variable is a variable that associates a numerical value with each possible outcome of a random phenomenon. Each numerical value is linked to a specific probability according to a probability distribution

A classic example of random variable is the outcome of rolling a six-sided dice. In this case, the random variable represents the possible results that can result from rolling the dice. Let X be the random variable denoting the outcome of the dice. The random variable X can take on the values 1, 2, 3, 4, 5, 6, with each value having an equal probability of 1/6.
What is the difference between discrete and continuous random variables? Can you make a couple of examples of both?

Discrete random variables have a finite number of possible outcomes. An example of discrete random variables is the result of tossing a coin. There are two possible outcomes (head or tail).

Continuous random variables have a non-finite number of possible outcomes. In other words a continuous random variable can assume any value in an interval. An example of continuous random variable is the temperature in a room. The temperature can take any value within a range.

Averages

There are three different sorts of average: mean, median, mode. How do they differ?

Mean (μ) =
sum all items / number of items

Median = value in the middle when the values are sorted. When the number of items is even there is no a single middle value and the mean of the two middle numbers is taken instead.

Mode = most frequent value in a dataset

For example given 1,3,4,4,2006 we have
Mean = (1+3+4+4+2006) / 5
Median = 4
Mode = 4

For example given 1,3,4,5,6,8 we have 6 numbers. The median is given by the mean of the two middle values i.e. (4+5)/2
When is the mean more useful and when the median is more useful?

The mean is more useful when there are no outliers. Indeed the mean is more affected by outliers than the median. The mean tends to move in the directions of the outliers.

The median is more useful when there are outliers.

In other words, the mean is more representative when the data distribution is normal while the median is more representative when the data distribution is skewed.
What does it mean when the median and the mean have the same value or different values?

If the distribution is a normal distribution then the mean and variance are equals. However, when the mean and variance are equal, this does not mean the distribution is normal.

If mean > median, the distribution is skewed to the right, i.e. the tail is on the right.

If mean < median, the distribution is skewed to the left, i.e. the tail is on the left.
Can you use the mean/median with categorical data?

No, it is not possible to use the mean/median with categorical data. Mean/median can be used only with numerical data. The average that you can use with categorical data is the mode.
When is the mode useful as an average quantity?
The mode is the most frequent value in a dataset. The mode is useful as an average quantity when
- the data is categorical. Indeed in this case the mode is the only average that can be used since the mean and median are used with numerical data.
- the data is numerical or categorical and there are two or more clusters. In that case it is more representive to take the node of each cluster.
  
  For example if you have the numbers (2, 2, 3, 3, 3, 4, 4, 40, 40, 41, 41, 41, 42, 42), we have two clusters. The more representative average values are then the modes 3 (mode of first cluster) and 41 (mode of second cluster).

Variability and spread

What quantities can you use to measure variability and spread?

You can use the following quantities: range, interquartile range, variance and standard deviation.
What is the range?

The range is given by the difference between the largest and the smallest values in a dataset. The range is one of the ways of measuring how much a dataset is spread out.
What are some advantages / disadvantages of the range?

The range has the advantage that is pretty simple and thus easy to understand.

The disadvantages of the range are that it does not provide information about the data within the range and it is heavily affected by outliers. Indeed outliers are always at the border of the dataset.
What is the interquartile range?
The interquartile range (IQR) is another way of measung how spread a data is. The idea is to provide a value that is less affected by outliers than it is the range. In simple terms the interquartile range can be thought of as a mini range around the center of the data.

The interquartile range can be found as follows:
1. sort the data.
2. Find 3 values Q1, Q2, Q3 that split your dataset into four equal parts. Q1, Q2, Q3 are called quartiles.
3. The interquartile range is then given by IQR = Q3 - Q1.
  Q3 is called upper quartile, Q1 is called lower quartile.
Notice that we take the quartiles around the center of the data. In this way outliers are excluded automatically from the computation of IQR.
When you compute the interquartile range, you have to find the so-called quartiles, points that divide the dataset into four equal parts. The middle quartile is equal to a certain type of average. What average is that?

The middle quartile is equal to the median.
What is the variance? What is its main disadvantage?

The variance is a way of measuring how the data is spread. The variance is defined as the average of the squared distance of all points from the mean.
Variance =
Σ (x - μ)² / n
=
Σ x² / n
- μ²
where μ = mean, n = number of points

The main disadvantage of the variance is that it describes the data spread using squared distances, and this is less intuitive.
What is the standard deviation?

The standard deviation ρ describes how spread the data is. It is actually the squared value of the variance. In this way the main weak point of the variance (the fact of using squared distances) is compensated.

ρ = √variance

A small standard deviation means that most of the points are close to the mean value. A high standard deviation means the points are spread out and far away from the mean.
What is the minimum value of the standard deviation?

The smallest value of the standard deviation is 0. This happens when all data points are equal.
What are standard scores? What is the main idea behind?

Standard scores (denoted by z) are numerical quantities that allow you to compare data distributions with different means and variances. In mathematical terms we have
z =
x - μ / ρ

where μ = mean, ρ = standard deviation.

The main idea behind standard scores is that the data distributions to compare are transformed into a new theoretical data distribution with μ = 0 and ρ = 1. This makes the comparison possible.

Probabilities

What is the definition of probability of occurence of event A, that is indicated by P(A)? What are the maximum and minimum values of P(A)?

The probability of occurence of event A is indicated by P(A) and is given by the number of ways that event A can happen divided by the number of all possible outcomes.

For example when you toss a coin, the probability of getting head is given by the number of ways you can get head (1 actually) divided by all possible outcomes (possible outcomes are 2: hear or tail), i.e. the probability of getting head is equal to 1 / 2.

In mathematical terms we have
P(A) = num cases for event A / num all cases

P(A) is a value between 0 (event not happening) and 1 (event happening for sure).
What is the meaning of the complementary event of A, that is indicated by A’?

P(A’) is the complementary event of A, i.e. it is the event that event A does not happen. The complementary event is given by
P(A') = 1-P(A)
What does it mean two events are mutually exclusive or disjoint?

Two events are mutually exclusive or disjoint when only one of the two events can occur.
What happens when two events intersect?

When two events intersect, they can occur simultaneously.
What is the meaning of the probability P(A ∪ B)? What is the formula for the probability P(A ∪ B)?

P(A ∪ B) is the probability that either event A or event B happens.

In mathematical terms we have
P(A ∪ B) = P(A) + P(B) - P(A ∩ B)

Notice that it is necessary to remove P(A ∩ B) because otherwise some probabilities would be counted twice.
If events A and B are disjoint, what is the formula for the probability P(A ∪ B)?

Since events A and B are disjoint, we have that P(A ∩ B) = 0. As a result
P(A ∪ B) = P(A) + P(B)
When are events A and B exhaustive?

Events A and B are exhaustive when P(A ∪ B) = 1.
What does it mean that two events are independent?

Two independent events are two events that do not affect each other. If one event occurs, the probability of the other occuring remains exactly the same.

When events A and B are independent, the conditional probability of A happening given that B has already happened is equal to the probability of A happening. We can write
P(A|B) = P(A)
If events A and B are independent, what is the probability of both events happen?

Since events A and B are independent, we can write P(A|B) = P(A).
If we also recall that P(A|B) = P(A ∩ B) / P(B) we can then write
P(A ∩ B) = P(A) x P(B)
What diagram is particularly useful for visualizing probabilities? Can you use it to visualize P(A), P(A'), P(A ∪ B), P(A ∩ B)?
The Venn diagram is very useful for visualizing probabilities. The Venn diagram is a diagram where the space of all possible outcomes is represented by a rectangle and events are represented by circles.

In the following image you can see the following probabilities represented with the Venn diagram:
- P(A) = probability of event A happening.
- P(A') = probability of event A' happening, i.e. probability of event A not happening.
- P(A ∪ B) = probability of event A or event B happening.
- P(A ∩ B) = probability of events A and B happening.
What is the meaning of conditional probability P(A|B)? What is its formula?

The conditional probability P(A|B) is the probability that event A happens given that event B has already happened. P(A|B) can be calculated as follows

P(A|B) =
P(A ∩ B) / P(B)
What diagram is particularly useful for visualizing conditional probabilities? Can you use this diagram to visualize the conditional probability P(A|B)?

Conditional probabilities can be well visualized with probability trees, while Venn diagram are not so useful for conditional probabilities.

In the picture below you can see an example of probability tree. It is a tree structure with the leaves A, A', B, B' that are grouped as sets of exaustive events. More specifically, B and B' are exhaustive as well as A and A'. The first set of branches shows the probabilities P(B) and P(B'). The second set of branches shows the probabilities of P(A), P(A') given that the probabilities of the first set of branches have happened. Notice that the probabilites of each set of branches sums to 1.
When is the Venn diagram useful and when is the probability tree useful?

The Venn diagram is useful when you want to show intersecting and mutually exclusive events.

The probability tree is useful when you want to show conditional probabilities.
What is the Law of Total Probability?

The Law of Total Probability allows you to find the probability of an event by using conditional probabilities. The key idea is that we find the probability of an event by adding together the probabilities of all different ways that event can happen. In the example below P(B) can happen in two ways, either with event A or without event A. In mathematical terms we have

P(B) = P(B ∩ A) + P(B ∩ A’) = P(A) P(B|A) + P(A’) P(B|A’)
What is the Bayes' Theorem? Can you find its formula?

The Bayes' Theorem allows you to find reverse conditional probabilites. In other words, given the probabilities P(B|A) and P(B|A'), the Bayes' Theorem allows you to find the reverse conditional probability P(A|B).

The formula for the Bayes' Theorem is

P(A|B) =
P(A ∩ B) / P(B)
=

P(A) * P(B|A) / P(B ∩ A) + P(B ∩ A')
=

P(A) * P(B|A) / P(A) * P(B|A) + P(A') * P(B|A')

Notice that the denominator is the law of total probability.
if two events are independent, can they be mutually exclusive?

If two events A and B are independent, both events can happen so they cannot be mutually exclusive. Indeed A and B would be mutually exclusive when only one of the events could happen.

Discrete probability distributions

What is a probability distribution?

A probability distribution is a plot that displays the probabilities associated with all possible outcomes of a random variable.
What is the symbol for a random variable and the symbol for the possible values it can take?

A random variable is represented by a capital letter like X or Y.

The possible values that the random variable can take are represented by a lowercase letter like x or y.
What is the expectation of a probability distribution?

The expectation E(X) (also indicated by μ) of a probability distribution X is basically the average value of the probability distribution. In other words, the expectation is the expected long-term average value. It is defined as

E(X) = Σ x * P(X = x)

For example given x = 1,2,3 and P(x) = 0.2, 0.5, 0.3 we have that E(X) equals 1*0.2 + 2*0.5 + 3*0.3 = 2.1
What are the formulas for the variance and standard deviation of a probability distribution?

The variance of a probability distribution is defined as:
Var(X) = E(X - μ)² = Σ (x - μ)² P(X=x)

The standard deviation of a probability distribution is defined as the root square of the variance.
ρ = √Var(X)
Let's suppose you know the expectation and variance of a probability distribution X, write the formulas of expectation and variance of the distribution Y = aX + b.

Y = aX + b means that Y is obtained from a linear transformation of X. With linear transformations the values are changed but the probabilities remain the same. This means that in the given example Y has different values from X but the same probabilities. As a result we can use the following shortcuts to compute E(Y) from E(X) and Var(Y) from Var(X).

E(Y) = E(aX + b) = a E(x) + b
Var(Y) = Var(aX+b) = a² Var(X)
Given a probability distribution X, what are the expectation and variance of n independent observations (X₁, X₂, ...X_n) that follow the same distribution?

Given n independent observations (X₁, X₂, ...X_n) we have

E(X₁, X₂, ...X_n) = n X(n)

Var(X₁, X₂, ...X_n) = n Var(x)
Given two independent probability distributions P(X) and P(Y) with expectations E(X) and E(Y) and variances Var(X) and Var(Y) respectively, what are the expectations and variances of the probability distributions P(X + Y) and P(X - Y)?

Because P(X) and P(Y) are independent probability distributions, we can write

E(X + Y) = E(X) + E(Y)

Var(X + Y) = Var(X) + Var(Y)

E(X - Y) = E(X) - E(Y)

Var(X - Y) = Var(X) + Var(Y)
Notice that even if we subtract X and Y, the variance increases.

Permutations and combinations

Given n objects, in how many ways can you arrange them in a line? In how many ways can you arrange them in a circle?

Given n objects, you can arrange them

n! times in a line

(n-1)! times in a circle
Given n objects where k of them are of a type and j of them of another type, in how many ways can you arrange them by type?

Given n objects, where k are of a type and j are of another type, the total numbers of ways the objects can be arranged by type is
n! / j! k!
In how many ways can you choose k items out of n items when the order matters? In how many ways can you choose k items when the order does not matter? How is it called the number of ways when the order counts and does not count?

Given n objects, you can pick k items in a fixed order in a number of ways that is indicated by ⁿP_k and is defined as
ⁿP_k =
n! / (n - k)!

Given n objects, you can pick k items without caring of the order in a number of ways that is indicated by ⁿC_k and is defined as
ⁿC_k =
n! / k! (n - k)!

ⁿP_k is the number of permutations (for which the order matters). ⁿC_k is the number of combinations (for which the order does not matter).
Given n items, when you pick k items, what is the difference between permutations and combinations?

Permutations are about choosing k items in a particular order.

Combinations are about choosing k items without caring of the order.

For example if you pick an apple and then a banana or if you pick a banana and then an apple, we have two different permutations (order matters) but only one combination (order does not matter).
Can it happen that the number of permutations is less than the number of combinations?

No, it can never happen that the number of permutations is less than the number of combinations. This is due to the fact that the order does matter for permutations. For example if we have an apple and a banana to pick, we have one possible combination but two possible permutations because picking first the banana is a different permutation from picking first the apple. On the contrary it is the same combination if we pick first the banana or the apple since the order does not matter for combinations.
What is an uniform distribution? Draw a plot of it.

An uniform distribution is a distribution where all values have the same probability.

Binomial, Poisson and geometric distributions

If the probability of success of an event is p and its probability of failure is q = 1 - p, what is the formula of the probability that the event succeeds after n attempts? What is the name of this formula?

Let p be the probability of success and q = 1 - p the probability of failure. The probability of success after n attempts is

P(X = n) = q^n-1 p

This formula is called the geometric distribution.
What real-world scenarios are described by a geometric distribution? What is its formula and distribution? What is the mode of any geometric distribution?
The scenarios that are described by a geometric distribution are those where
- you have a set of independent trials
- each trial can either succeed or fail. The probability of success at each trial is the same
- you are interested in the probability of succeeding after n trials.
Let X be the number of trials needed to get a successful outcome. The probability of X being a certain value r is
P(X = r) = q^r-1 p

The probability distribution of a geometric distribution is given by the image below.

The mode of any geometric distribution is 1 because P(X=r) has the maximum value at r=1 and decreases as r increases.
Given a geometric distribution X, what is the probability that more than r trials are needed to get the first successful outcome? What is the probability that r or less trials are needed to get the first succesful outcome? The probability of success of each trial is denoted by p.

In order to need more than r trials to succeed, we must fail at least r times. We can then write
P(X > r) = (1 - p)^r

The probability of needing r or less trials to succeed is given by
P(X <= r) = 1 - P(X > r) = 1 - (1 - p)^r
How to find the expectation of a geometric distribution X with p being the probability of success of a single trial? What is the meaning of the expectation of a geometric distribution? What is the formula of the variance instead?

The expectation E(X) of a geometric distribution X with p being the probability of success of a single trial is given by
E(X) =
1 / p

The expectation of a geometric distribution is the number of attempts that are needed in order to obtain a successful outcome.

The variance Var(X) of the geometric distribution above is given by
Var(X) =
q / p²
where q = 1 - p.
What are the conditions where we have a binomial distribution? What is the formula of a binomial distribution where we have n trials, probabilities of success p and insuccess q of each trial, and r successes?
A binomial distribution describes the following cases
- we have a number of independent trials
- each trial can either succeed or fail. The probability of success at each trial is the same
- You are interested in the number of successes out of n trials. The number of trials n is finite.
Let X be the number of succesful outcomes out of n trials, r the number of successes, p the probability of success and q the probability of insuccess of each trial. The formula of a binomial distribution is
P(X = r) = ⁿC_r p^r q^{n - r}
What are the formulas of expectation and variance of a binomial distribution?

The expectation of a binomial distribution is defined as
E(X) = np

The variance of a binomial distribution is defined as
Var(X) = npq
What scenarios are covered by a Poisson distribution?
The Poisson distribution covers scenarios where:
1. we have a set of independent events occurring at random in a given interval of space or time.
2. we know the mean number of occurrences in that interval. This mean is represented by the symbol λ (lambda).
How to indicate that the variable X follows a Poisson distribution with a mean of occurencies of λ? What is the formula for finding the probability that there are r occurencies in a specific interval?

To indicate that X follows a Poisson distribution with a mean of occurencies of λ we can write
X ~ Po(λ)

The formula for finding the probability of having r occurencies in a specific interval is
P(X = r) =
e^-λ λ^r / r!
What are the expectation and variance for the Poisson distribution?

Given a Poisson distribution X ~ Po(λ), the expectation E(X) and variance Var(X) are
E(X) = Var(X) = λ
What is the main difference between the Poisson distribution and the binomial and geometric distributions?

The biggest difference between the Poisson distribution and the binomial and geometric distributions is that the Poisson distribution does not involve a series of independent trials. The Poisson distribution describes the number of occurencies in a particular time or space interval. By contrast binomial and geometric distributions do involve a set of independent trials.
Given two independent Poisson distributions X ~ Po(λ_x) and Y ~ Po(λ_y), what is the distribution of X+Y?

If two distributions X and Y are independent, then
P(X + Y) = P(X) + P(Y)
E(X + Y) = E(X) + E(Y).
It follows that
X + Y ~ Po(λ_x + λ_y).
In other words, if X and Y are Poisson distributions, so is X + Y a Poisson distribution.
What types of discrete probability distributions are there?

There are the Bernoulli distribution, the Binomial distribution and the Poisson distribution.

The Bernoulli distribution is a random experiment that has 2 possible outcomes with probability p and q, where p+q = 1

The Binomial distribution is a sequence of repeated Bernoulli trials performed indipendently.
n = number of items, k = num items I pickup
All possible combinations are n! / ( (n-k)! * k! )

The Poisson distribution is a limiting version of the Binomial distribution where the number of trials n gets really large but the probability of success p gets smaller at the same time. In other words if you have a Binomial distribution where you let n tends to infinite and p tends to zero such that np = λ, then that distribution approaches a Poisson distribution with parameter λ.

Continuous probability distributions

When you have measured data, is this usually discrete or continuous data? When you have data that you can count, is this discrete or continuous data?

Measured data is continuous data because in that case you have an infinite number of values you can measure depending on the accuracy of the instrument of measure. For example a distance can be 4.5mm or 4.56mm, but there are potentially an infinite number of decimal numbers that are given by the accuracy of the measurement.

Data that you can count is discrete data since it has a finite number of values.
What is the difference between what we want to find for discrete and continuous probability distributions?

For discrete probabilitiy distributions we are interested in finding the probability of getting a particular value.

For continuous probabilitiy distributions we are interested in the probability of getting a range of values. The size of the range is given by the degree of accuracy. Indeed for continuous data you cannot get the probability of a specific value because that value can have an infinite number of decimals.
What is a probability density function? How to find the probabilities?

A probability density function describes the probability distribution of a continuous random variable.

The probabilities are given by the area under the curve.

Normal distribution

What is the normal distribution? What parameters do define the normal distribution?

The normal distribution is a density distribution where the data is distributed around a central value in the form of a bell-shaped curve. Random variables typically have a normal distribution.

The normal distribution is defined by the mean μ and the standard deviation σ². μ defines the center of the curve and σ² defines the spread of the data.

In mathematical terms the normal distribution is indicated as X ~ N(μ,σ²).

A normal distribution looks like the following image. In this example we have μ = 4 and σ = 1.
Why is the normal distribution called normal?

The normal distribution is called normal because it is seen as the distribution that you would expect for many continuous data in real life. In practice many distributions of continuous data resemble the normal distribution.
What is the Central Limit Theorem? Why is it important?

The Central Limit Theorem states that the sampling distribution of the sample means approaches a normal distribution as the sample size gets larger.

The Central Limit Theorem is important because under certain conditions you can approximate a non-normal distribution with a normal distribution.
What is the difference between a parametric and a non-parametric test?

Parametric tests are statistical tests that make assumptions about the distribution of the data. For example, a t-test assumes that the data is normally distributed. Non-parametric tests do not make assumptions about the distribution of the data. Parametric tests are generally more powerful than non-parametric tests, but they are only valid if the data meets the assumptions of the test. Non-parametric tests are less powerful than parametric tests, but they can be used on any type of data, regardless of the distribution. It is important to choose the right type of test for your data.

Data Analysis

What are the two main types of data?

The two main types of data are
- numeric.
- categorical. The variable can be only one of a finite set of values.
You can divide numerical data into two types. What are they?
The two types of numerical data are:
- interval data. Interval data is a data type that is measured on a scale where all points are at the same distance. Interval data does not have a zero point. Without a zero point the concept of multiplication and division is not applicable and thus only sum and difference operations can be applied to interval data. An example of interval data is the temperature in C and F. You cannot say that 20C is twice as hot as 10C. This is equivalent to 50F and 68F which is clearly not as twice as hot.
- ratio data. Ratio data is similar to interval data except that it has a meaningful zero value so that all the sum, subtraction, multiplication and division operations can be applied. Some examples of ratio data are the length and the weight.
You can divide categorical data into two types. What are they?
There are two types of categorical data:
- ordinal data = this data has some kind of order but the difference between items is not important. An example of ordinal data are the months or the happiness state (sad, happy, very happy etc).
- nominal data = data without order and without any quantitative value. Some examples of nominal data are all cities in the USA, the gender (male/female), the color hair (brown, black, blonde etc).
How to find outliers in a dataset?
Here are some tools and techniques you can use to find outliers:
- a histogram.
- a scatter plot.
- a boxplot.
- the interquartile range (IQR) and say that outliers are outside of the 1.5 IQR limit.
- The standard deviation and consider as outliers those points that are far more than 3 standard deviations from the mean.
- sort the data and look for unusual low or high values.

Distance metrices

What is the hamming distance?

The hamming distance is used for categorical features. If all the variables are categorical then the hamming distance is defined as the number of mismatches.
What is the Manhattan (or city block) distance?

It is the number of horizontal and/or vertical steps to go from point p1 to point p2
What is the Levenshtein distance?

The Levenshtein distance is the minimum number of single-character edits (insertions, deletions or substitutions) to change one word into one another

Feature engineering

What are the main techniques of feature engineering?
The main techniques of feature engineering are
- feature selection. It selects a subset of features without transforming them
- feaure learning. Use ML learning algorithms to extract the best features
- feature extraction. transform the existing features into new features. This is usually done by dim reduction
- feature combination. Combine single features into a more powerful feature
What are the main purposes of feature selection?
Reducing the number of features by feature selection is useful for
- making models more understandable
- reducing the training times
- avoid the curse of dimensionality
- reduce the risk of overfitting by removing noise in the data
What are the main techniques of feature selection?

The main techniques of feature selection are

- filter methods. Statistical techniques are used to evaluate the relationship between indiviual features and the target features. This technique is completely independent of the model

- wrapper method. This consists in training a machine learning model by adding/removing features so you can evauate their effect on the model based on a certain scoring

- embedded methods. The features are selected during the model training and more important features are assigned an higher rank. Examples of such models are decision trees and lasso regression models
What are some differences between the main techniques of feature selection?

Some differences between filter and wrapper methods are

- wrapper methods rely on machine learning models while filter methods do not. Filter methods rely on statistical values.

- filter methods can fail when there is not enough data to extract meaningful statistical values

- wrapper methods take much more time since they involve training a number of machine learning models.

- using features extracted with wrapper methods can lead to machine learning models affected by overfitting since those features have been extracted using machine learning models.
What is the Z-score normalization (also called standardization)?

The Z-score normalization is a technique that scales the values so that they will have the properties of a standard distribution with mean μ = 0 and standard deviation σ = 1.
The formula for the normalization is
z =
x - μ / σ
What is the min-max scaling (also called normalization)?

It is a technique that scales the values between 0 and 1.
The formula for the min max scaling is
value_scaled =
value_to_scale - min / max - min
When is Z-score normalization used? When is min-max scaling used instead?

The Z-score normalization is used when the data has a gaussian distribution.

The min-max scaling is more used when the distribution is not gaussian or not known. In addition, the min-max scaling works better than the Z-normalization when the standard deviation is very small. Indeed in such case the effect of scaling the data between 0 and 1 will not affect the standard deviation significantly since it is already small.
What is the disadvantage of min-max scaling compared to Z-score normalization concerning outliers?

The min-max scaling is more sensitive to outliers since the data is bounded between 0 and 1 which automatically leads to information loss about outliers.
The Z-score normalization is more suited when there are outliers since the data is not bounded but simply rescaled to have mean 0 and standard deviation 1.
What kind of algorithms are not affected by feature scaling? Can you make some examples of such algorithms?

Any algorithm that is not distance based is not affected by feature scaling.
Examples of algorithms not affected by feature scaling are Naive Bayes, Tree-based algorithms and Linear Discriminant Analysis.
How to treat missing values?
There are several possible approaches to treat missing values:
- remove items with missing values
- replace the missing values
When is it reasonably possible to drop items with missing values?

When the items to remove are few compared to the size of the training set
What are the common techniques for replacing missing values?
When values are missing randomly you can:
- replace the missing values with the mean
- interpolate values
- take the previous or next value

A/B Testing

What is A/B testing?

A/B testing is a method of testing two versions of a product, for example a website or an app, to see which version performs better. In an A/B test, two versions of the products are randomly assigned to users, and the performance of the two versions is compared.

A/B testing is a valuable tool for optimizing the performance of websites and apps. It can be used to test different designs, features, and content to see what works best for users.

Correlation

What is correlation? Is correlation related with causation? What is the difference between them?

Correlation is a measure of the strength and direction of the relationship between two variables.

Causation is the relationship between two variables where one variable causes the other variable to change.

Just because two variables are correlated does not mean that there is a causal relationship between them. For example, there is a correlation between the number of ice creams sold and the number of shark attacks. However, this does not mean that eating ice cream causes shark attacks.
How to determine if there is a causal relationhip between two variables?

To determine whether there is a causal relationship between two variables, it is necessary to conduct a controlled experiment. In a controlled experiment, the researcher manipulates one variable (the independent variable) and observes the effect on the other variable (the dependent variable). If the dependent variable changes when the independent variable is changed, then the researcher can conclude that there is a causal relationship between the two variables.
What range of values can have correlation? What values are significant?

Correlation is a statistical value that quantifies the dependency between two variables. The correlation value ranges from -1 to +1. The -1 (+1) value represents a perfectly negative (positive) correlation. The close the value to 0 is, the weakest the linear correlation is. As a thumb rule you can consider values < -0.5 or > +0.5 to have a significant relationship.
There are three main types of correlation, what are they and when are they used?

The three most common type of correlation are

- Pearson. It is the most common correlation and is used only with numerica data It measures the linear relationship between two variables. It also assumes that the variables are normally distributed.

- Kendall. It is a rank correlation measure. It works with interval, ratio and ordinal data. It measures how similar ranked orderings are. No linear correlation is assumed.

- Spearman. It is is a rank correlation measure. No linearity is assumed but a monotonic relationship, i.e. an increasing (var1 increases and also var2 increases) or decreasing (var1 decreases and also var2 decreases) relationship.
How can you use correlation for feature selection?

There are at least two approaches:
- use the the correlation of each feature with the target. You can keep those features that have a significant correlation, i.e. a correlation value < -0.5 or > +0.5.
- You can use the pairwise correlation between features. If two features are highly correlated then you can drop one of them since they provide the same information.
How can you use variance for feature selection?

Features with a low variance can be dropped since (almost) constant features do not provide any information.

Dimensionality reduction

What is dimensionality reduction?

Dimensionality reduction is a technique employed to decrease the number of features in a dataset while retaining as much relevant information as possible.

The primary goal of dimensionality reduction is to produce a simplified and more compact representation of the data. By reducing the dimensionality of a dataset, it addresses various challenges associated with high-dimensional data, including the curse of dimensionality, computational complexity, and overfitting in machine learning models.
What are the main types of dimensional reduction?
The main types of dimensionality reduction are:
- Linear Dimensionality Reduction: These techniques focus on linear transformations to reduce dimensionality while preserving the most important information in the data. Examples include Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA).
- Nonlinear Dimensionality Reduction: Nonlinear techniques are used when the relationships between variables are not well captured by linear methods. They aim to preserve the inherent structure in the data without imposing a linear constraint. Examples include t-Distributed Stochastic Neighbor Embedding (t-SNE), Isomap, and Locally Linear Embedding (LLE).
- Neural Network-Based Dimensionality Reduction: These methods use neural networks, specifically autoencoders, to learn reduced-dimensional representations of the data. Autoencoders can be adapted for both linear and nonlinear dimensionality reduction.
- Sparse Dimensionality Reduction: Sparse coding techniques aim to represent data points using a sparse combination of basis vectors, effectively reducing dimensionality while maintaining a sparse representation.
- Manifold Learning: Manifold learning techniques focus on modeling the underlying structure of data as a lower-dimensional manifold embedded within a high-dimensional space. Examples include Isomap, LLE, and spectral embedding.
- Feature Selection: Instead of transforming the data, feature selection techniques choose a subset of the most informative features to reduce dimensionality. This can be done through filter, wrapper, or embedded methods.
What are some common algorithms of dimensionality reduction?
Some common algorithms of dimensionality reduction are:
- Principal Component Analysis (PCA): PCA is a linear technique that reduces the dimensionality of data by transforming it into a new coordinate system based on the principal components, which are orthogonal and capture the most variance in the data.
- t-Distributed Stochastic Neighbor Embedding (t-SNE): t-SNE is a non-linear technique used for visualization and dimensionality reduction by preserving pairwise similarities between data points, making it suitable for exploring high-dimensional data.
- Linear Discriminant Analysis (LDA): LDA is a supervised technique that aims to maximize the separation between classes by projecting data points onto a lower-dimensional space.
- Autoencoders: Autoencoders are neural network-based models that can learn efficient representations of data by encoding it into a lower-dimensional space and then decoding it back to the original dimension. Variational Autoencoders (VAEs) are a popular variant.
- Isomap: Isomap is a non-linear technique that computes a lower-dimensional representation of data while preserving geodesic distances between data points in a manifold, which is useful for data with non-linear structures.
- Locally Linear Embedding (LLE): LLE is another non-linear method that preserves local relationships between data points, making it suitable for capturing the intrinsic structure of data.
What is PCA? When do you use it?

Principal component analysis (PCA) is a statistical method used in Machine Learning. It consists in projecting data in a higher dimensional space into a lower dimensional space by maximizing the variance of each dimension.

PCA is mostly used as a tool in exploratory data analysis and for making predictive models. It is often used to visualize genetic distance and relatedness between populations.
PCA is affected by scale. To get the optimal performance out of PCA, which is a crucial first step?

For optimal performance in Principal Component Analysis (PCA), a crucial first step is standardization or scaling of the data. This is essential because PCA is affected by the scale of the input features.
Standardizing the data involves transforming each feature to have a mean of 0 and a standard deviation of 1. This ensures that all features are on the same scale and prevents features with larger variances from dominating the PCA results. Standardization is typically performed using the following formula for each feature:

Standardized Value = (Original Value - Mean) / Standard Deviation

Text processing

What are some steps involved in the preprocessing of a text?
The preprocessing of a text can involve
- make text lower case
- remove punctuation
- tokenize the text (split up the words in a sentence)
- remove stop words as they convey grammar rather than meaning
- word stemming (reduce words to their stems)
What is a word cloud?

The word cloud is a visualization technique for text processing where the most frequent words in a text are shown in bigger.

The word cloud is not a really scientific method but it can be very useful to capture the attention of people, for example in presentations. An example of word cloud is

Charts

What is a pie chart? When should it be used / not used?

A pie chart is basically a circle that is divided into slices. Each slice represents a group. The size of each slice is proportional to the numbers of items of that group. For example in the picture below, given a set of different fruits, you can see the proportion of each type of fruit.

Pie charts are good at showing the relative quantities of items in an intuitive manner. In particular, pie chars are good at showing what groups are more or less predominant. In the picture apples are the predominant fruit.

Pie charts should not be used for exact quantitative comparisons of slices, in particular when slices have a similar size. Indeed it is hard for the human eye to compare areas. For example, considering the picture, pie charts should not be used to find how much the cherries slide is bigger/smaller than the bananas slice.
What is a bar chart? When should it be used?

A bar chart is a chart that represents categorical data with rectangular bars. The size of each bar is proportional to the value represented.

A bar chart is useful when you want to compare values. In particular a bar chart can also be used to compare values that are pretty similar. In this case a bar chart is a better choice than a pie chart because pie charts are not particularly suitable for comparing similar values.
What are some difference between a bar chart and a histogram?
A bar chart
- shows the frequency of each value of a categorical variable.
- each column represents a single value.
- there are gaps between columns.
An histogram
- is used to represent the distribution of a continuous variable.
- each column represents a range of values.
- there are no gaps between columns.
- the area of each column is proportional to the frequency.
What is a histogram particularly good at?

A histogram is particularly good at representing grouped data. The frequency of each group is represented by the area of the column. Because the width of the columns is proportional to the width of the groups we can have columns with different column widths. The height of the column is then given by frequency / width.
Can you draw a boxplot and explain it?
A boxplot is a chart specialized in visualizing and comparing ranges.

The elements of a boxplot are:
- upper extreme (95th Percentile) = exactly 5 percent of the values in the data are greater than this value (may instead be the highest value if not displayed separately as dots).
- lower extreme (5th Percentile) = exactly 5 percent of the values in the data are less than this value (may instead be the smallest value if not displayed separately as dots).
- lower quartile = lowest value of all quartiles. Quartiles are a set of three points that divide the dataset into four equal parts.
- upper quartile = highest value of all quartiles
- median = middle value of the sorted dataset
What is a heatmap? When is it used?
A heatmap is a two-dimensional grid that is used for visualizing numerical data organized in a table-like format.

Each cell of the heatmap corresponds to a specific row and column in the data, while the cell color represents the magnitude of the corresponding value. As a result, different values are mapped to different colors, allowing to identify similarities in the data.

Heatmaps are used to explore relationships and patterns in datasets. Some common uses are
- Correlation Analysis: positive and negative correlations between variables are visualized as different colors and thus easy to recognize.
- Confusion Matrix Analysis: the performance of classification models can be put in matrix form and displayed by heatmaps.
- Genome Analysis: heatmaps are widely used in genomics to visualize gene expression levels.
An example of table-like dataset and respective heatmap is shown below:
What is a violin plot? When is it particularly useful?
A violin plot is a statistical data visualization that can be thought of as an extension of a boxplot. To be more precise, a violin plot includes all the information contained in a boxplot with the addition of a kernel density plot.

While a boxplot shows statistical data such as mean, median and interquartile ranges, a violin plot also includes the data distribution.

A violin plot is particularly useful in the following scenarios:
- comparing distributions: by placing the distributions side by side, it is easier to compare distributions, allowing for a better exploratory analysis.
- multiple modes, skewness or outliers in the data: detecting such elements gives you a better understanding of the data.
A violin plot is shown in the following image. You can see the distributions of the average salaries for engineers and workers in a violin shape. The three dotted lines in a violin plot represent the lower quartile, the median, and the upper quartile of the data set.

Machine Learning

Machine Learning basic concepts

What is the difference between data science and machine learning?

Data science and Machine learning are closely related fields but they differ in their goals.

Data science is more focused on the gathering and analysis of data. Data science includes gathering data from different sources, cleaning the data, and analyzing the data to extract useful information.

Machine learning is more focused on creating a mathematical model from a dataset to predict something. For example given a dataset of house prices, you would like to predict how the house prices will evolve in the next months based on the historical data in the dataset. With machine learning algorithms you can create mathematical models that can make such a prediction.
What are the most common types of problems that machine learning can help solve?
Machine learning can help solve a large number of problems which can be grouped into the following types:
- Classification: the goal of classification is to predict the category of an input item. Examples of classification are email spam detection and disease diagnosis.
- Regression: the goal of regression is the prediction of a numerical value. Examples of regression are predicting the price of houses or the future demand for products.
- Clustering: the goal of clustering is to group items in a way that similar items are within the same group. Examples of clustering are grouping customers based on purchasing behavior or grouping documents by topic.
- Recommendation: the goal of recommendation is to provide personalized suggestions. Examples of recommendations are movie recommendations and personal marketing based on personal preferences.
- Anomality detection: the goal of anomaly detection is to identify unusual patterns in the data. Examples of anomaly detection are credit fraud detection and intrusion detection in computer networks.
- Time series analysis: the goal of time series analysis is to analyze data points collected at specific time intervals. Examples of time series analysis are energy consumption prediction and weather forecasting.
What is bias in machine learning?
Bias in machine learning is a phenomenon that occurs when a model produces consistently unfair results or inaccurate for certain groups of people due to erroneous assumptions in the machine learning (ML) process.

Bias can happen for several reasons:
- data bias: data bias can occur when the training data fails to adequately capture the diversity and complexity present in the real-world data for a specific problem.
- Algorithms bias: an algorithm itself can be biased, either due to the way it is trained or designed. For example, it can occur that an algorithm inadvertently learns biased patterns in the data. This can happen if an algorithm is not carefully designed to account for bias in the training data.
What does high bias mean for a machine learning model?

In machine learning, high bias refers to a situation where a model has a strong and often simplistic bias or assumption about the underlying data.

High bias can lead to the model underfitting the data, meaning it fails to capture the true underlying patterns and relationships in the data. In other words, the model is too simple or constrained to represent the complexity of the data, resulting in poor performance.
What is the variance of a machine learning model?

Variance in machine learning is the amount by which a model's predictions change when it is trained on different subsets of the training data.

In other words, variance measures how much the model overfits the training data. A model with high variance will work well with the training data but not generalize well to the test data.

A model with low variance will be less sensitive to the training data and will perform well also with the test data.
What are the implications of high variance in a machine learning model?

A machine learning with high variance is a model for which small changes in the training data set lead to big changes in the model prediction. This is equivalent to saying that the model overfits the training data, i.e. the model works well on the training data but does not generalize well on the test data.

In general, high variance in a machine learning model can lead to poor generalization performance, meaning that the model will not perform well on new data that it has not seen before.
What is the difference between bias and variance in machine learning?

Bias and variance are two important concepts in machine learning. Bias is the error that occurs when the model is not able to learn the true relationship between the input and output variables. Variance is the error that occurs when the model is too sensitive to the training data and does not generalize well to new data.

A model with high bias will underfit the training data and will not be able to capture the true relationship between the input and output variables. A model with high variance will overfit the training data and will not generalize well to new data. It is important to find a balance between bias and variance to create a model that is both accurate and generalizes well to new data.
Explain the bias-variance trade-off in the context of model complexity.
The bias-variance trade-off is a fundamental concept in machine learning that deals with the relationship between model complexity and its performance:
- High Bias (Underfitting): When a model is too simple or has too few parameters to capture the underlying patterns in the data, it exhibits high bias. This leads to underfitting, where the model performs poorly on both training and test data.
- High Variance (Overfitting): When a model is overly complex or has too many parameters, it becomes highly sensitive to noise and random variations in the training data. This results in high variance, leading to overfitting, where the model performs well on training data but poorly on new, unseen data.
Achieving a balance between bias and variance is crucial. Regularization techniques, cross-validation, and hyperparameter tuning can help manage the bias-variance trade-off.
Explain overfitting and underfitting in the context of machine learning models.

Overfitting occurs when a machine learning model performs very well on the training data but poorly on new, unseen data. Overfitting happens when the model is too complex, capturing noise and random fluctuations in the training data, rather than the underlying patterns.

Underfitting, on the other hand, happens when a model is too simple to capture the underlying patterns in the data. It performs poorly both on the training data and unseen data. It is characterized by high bias and low variance.
What are some techniques that you can use to prevent overfitting in machine learning?

Overfitting happens when your model works well with the training dataset but does not work well with new data, which means the model cannot generalize well.

To prevent overfitting you can for example use regularization and cross-validation.
What is regularization?

Regularization is a technique used to prevent overfitting in machine learning models. Regularization involves adding a penalty term to the loss function during training, which discourages the model from assigning too much importance to any particular feature. Common forms of regularization include L1 (Lasso) and L2 (Ridge) regularization.

Regularization helps in reducing the complexity of the model, leading to better generalization of unseen data. It is controlled by a regularization parameter, which determines the trade-off between fitting the training data well and keeping the model's complexity in check.
What is the curse of dimensionality?

The curse of dimensionality is a tendency that it is easier to overfit a dataset when there are few points and many features. Data needs to increase exponentially with the number of features in order not to have overfitting.
What is supervised learning?

Supervised learning is a machine learning process where the learning algorithms are trained on labeled datasets, which means that each point in the dataset is associated with a class or label.

The primary goal of supervised learning is to create a mathematical model that maps input data to the corresponding output or target variable, so that the model can make accurate predictions or classifications when given new, unseen input data.
What is unsupervised learning? Make also some examples of algorithms
Unsupervised learning is a machine learning process with unlabeled data.

Some examples of unsupervised learning are
- Clustering algorithms (K-means, hierarchical custering, Probabilistic clustering)
- Dimension reduction algorithms (PCA, Single Value Decomposition SVD)
What is the difference between supervised and unsupervised learning?

Supervised learning is a type of machine learning where the algorithm is trained on a labeled dataset, meaning that the input data is paired with corresponding output labels. The goal of supervised learning is to learn a mapping from inputs to outputs, making it suitable for tasks like classification and regression.

Unsupervised learning, on the other hand, deals with unlabeled data. The algorithm tries to find patterns or structure in the data without the guidance of labeled output. It includes tasks like clustering and dimensionality reduction.
What are the two main types of supervised learning algorithms?
The two main types of supervised learning algorithms are
- Classification: its goal is to predict the class label of new instances based on past observations. The algorithms learn from a labeled dataset, where each instance has a class label assigned. Some examples of classification problems are spam detection and image recognition.
- Regression: it deals with predicting a continuous numerical value rather than a categorical label. Some example of regression problems include predicting house prices, stock prices or the temperature.
Can you mention some supervised machine learning algorithms?

The most common supervised machine learning algorithms are
K-nearest neighbor, Naive Bayes, Decision Trees, Linear Regression, Support Vector Machines.
What is reinforcement learning?

Reinforcement learning is a machine learning algorithm that receives feedback on its output so that the accuracy of the output is improved based on this feedback. In other words, the algorithm learns through trial and error.
How do you handle imbalanced datasets in machine learning?

Imbalanced datasets are datasets where the target variable is not evenly distributed. For example, a dataset of fraudulent transactions may be imbalanced, with only a small percentage of transactions being fraudulent. There are several ways to handle imbalanced datasets in machine learning. One approach is to oversample the minority class. This involves creating synthetic data points for the minority class so that it is represented more evenly in the dataset. Another approach is to undersample the majority class. This involves removing data points from the majority class so that it is represented more evenly in the dataset. It is important to carefully select the approach to handling imbalanced datasets, as the wrong approach can lead to biased models.
What are hyperparameters?

Hyperparameters are parameters that are passed to machine learning models to control the learning process. The hyperparameters are set before starting the training model and their selection plays a critical role in achieving optimal results.

Examples of hyperparameters are the k value in the k-nearest neighbor algorithm or the number of nodes in a neural network.
What are the main differences between parameters and hyperparameters?

Parameters and hyperparameters are crucial in machine learning models, but they play different roles. Here are the key differences between them:

Parameters are internal variables within machine learning models. They are computed automatically by the machine learning algorithms during the training process. The purpose of parameters is to fine-tune the model's performance based on the training data, to minimize the model error or loss function.

Hyperparameters are passed to machine learning algorithms before the training process begins. Hyperparameters influence how the algorithms learn and generalize from data. Data scientists or machine learning engineers set these values, often utilizing various hyperparameter tuning techniques to optimize the model's performance.
What are ensemble methods in machine learning?

Ensemble methods are techniques that combine the results of several machine learning models rather than using a single model. The basic idea behind ensemble methods is that the combination of the results of several models should be more accurate than using a single model.

To provide an example in real life, ensemble methods can be compared to asking several doctors for a diagnosis rather than relying on a single doctor. In the former case, one would expect the diagnosis to be more accurate.

Evaluate machine learning methods

What is cross-validation in machine learning?
Cross-validation is a technique used to assess the performance and generalization of a machine learning model. It involves splitting the dataset into multiple subsets, typically a training set and a validation set, multiple times. The model is trained and evaluated on different subsets to get a more accurate estimate of its performance.

Cross-validation is essential because it helps in:
- Detecting overfitting or underfitting.
- Providing a more reliable estimate of a model's performance.
- Ensuring that the model can generalize well to new, unseen data.
What is K-fold cross-validation?
K-fold cross-validation is a specific type of cross-validation where the dataset is split into K equal parts.
The steps of K-fold cross-validation are:
- split a dataset into K equal parts. Each part is called a fold.
- take one fold i as the test set and take the remaining (K-1) folds as the training set.
- train your classification algorithm on the training set and evaluate its performance on the test set.
- repeat the process for each fold. As a result, for each fold, we get an accuracy value and in total, we get K accuracy values.
- the final accuracy value for your classification algorithm is given by the average of the K accuracy values.
What matrix can you use to evaluate the performance of a classifier? How does an ideal matrix look like?
Let's suppose we have a binary classifier that can assign an item to two possible classes, which are typically labeled as positive (1) and negative (0).

To evaluate the performance of the classifier you can use the confusion matrix. The confusion matrix displays the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN).
- True positives (TP) = number of items with label positive (1) that are classified correctly as positive.
- True negatives (TN) = number of items with label negative (0) that are classified correctly as negative.
- False positives (FP) = number of items with label negative (0) that are classified wrongly as positive.
- False negatives (FN) = number of items with label positive (1) that are classified wrongly as negative.
The confusion matrix looks like

actual 1 actual 0

predicted 1 TP FP

predicted 0 FN TN

An ideal confusion matrix has high values of TP and TN because these quantities indicate the correctly classified items.
Why is the confusion matrix called like that?

The confusion matrix is a table that is used to evaluate the performance of a classifier. It displays the number of correctly classified items (true positives and true negatives) and wrongly classified items (false positives and false negatives).

The confusion matrix is called like that because it helps you identify where the classifier has made wrong predictions, which is where the classifier was confused. As a result, you can gain insight about the performance of the classifier and take corrective action to minimize the number of wrongly classified items.
Can you mention some of the most common metrics used to evaluate binary classifiers?
A binary classifier is a machine learning model that assigns an item to one of two possible classes. The two classes are typically referred to as the positive class and the negative class.
In order to evaluate the performance of binary classifiers, the following terms are used:
- True positives (TP) = number of the positive class items that are classified correctly as positive.
- True negatives (TN) = number of the negative class items that are classified correctly as negative.
- False positives (FP) = number the negative class items that are classified wrongly as positive.
- False negatives (FN) = number of the positive class items that are classified wrongly as negative.
Now that we have defined TP, TN, FP and FN and given n as the total number of items, we can introduce some of the most common metrics for evaluating binary classifiers.
- Accuracy is the percentage of the correctly classified items.
  Accuracy =
  TP + TN / n
- true positive rate or sensitivity or recall is the percentage of the positive class items that are actually classified as positive.
  Sensitivity =
  TP / TP + FN
- true negative rate or specificity is the percentage of the negative class items that are actually classified as negative.
  Specificity =
  TN / TN + FP
- Precision is the percentage of the items classified as positive that actually belong to the positive class.
  Precision =
  TP / TP + FP
- False positive rate (FPR) represents the proportion of negative class items that are incorrectly classified as positive.
  FPR =
  FP / TN + FP
- False negative rate (FNR) represents the proportion of positive class items that are incorrectly classified as negative.
  FNR =
  TN / TP + FN
What is the difference between precision and recall in binary classification? How are they related?
Precision and recall are two important metrics in binary classification.
- Precision measures the proportion of true positive predictions among all positive predictions made by the model.
  Precision is calculated as
  TP / TP + FP
  , where TP is true positive, and FP is false positive.
- Recall (Sensitivity) measures the proportion of true positive predictions among all actual positive instances.
  Recall is calculated as
  TP / TP + FN
  , where FN is false negative.
Precision and recall are related through a trade-off. Increasing one often leads to a decrease in the other. A model with high precision makes fewer false positive errors but may miss some true positives. Conversely, a model with high recall captures more true positives but may produce more false positives.
What metrics are more meaningful with class-balanced datasets and which ones work better for class-unbalanced datasets?

In a class-balanced dataset, the number of items in each class is approximately the same.

When the dataset is class-balanced, accuracy is a good metric. Indeed, accuracy assigns equal weight to both true positives and true negatives. Accuracy is calculated as the ratio of correctly classified instances (both true positives and true negatives) to the total number of instances, as indicated by the formula
Accuracy =
TP + TN / n

When the dataset is class-unbalanced, one class has significantly more items than the other classes. In that case precision and recall are better metrics. In class-imbalanced scenarios, the class with fewer instances is often the class of more significance. For example, in fraud detection, the number of actual fraud cases is much lower than non-fraud cases, but correctly identifying fraud is crucial.
Let TP represent the number of correctly identified frauds. Precision and recall provide a better indication of the model's ability to identify the minority class accurately, as they have TP in the numerator as follows
Precision =
TP / TP + FP

Recall =
TP / TP + FN

By focusing on the true positives, precision and recall provide insights into how well the model performs in the context that matters most.

Precision: High precision indicates that when the model predicts the positive class, it is often correct. This is important when misclassifying positive instances (false positives) could have significant consequences. Precision tells you the proportion of correctly identified positive predictions out of all predicted positive instances.

Recall: High recall indicates that the model is effective at capturing most of the actual positive instances. This is crucial when you want to make sure you don't miss many of the positive instances. Recall tells you the proportion of correctly identified positive predictions out of all actual positive instances.

However, it's important to remember that precision and recall are trade-offs: increasing one can sometimes lead to a decrease in the other. Therefore, you should choose the metric that aligns with your specific goals and priorities in your application.
To evaluate machine learning models for cancer detection what metrics do doctors typically use?
Machine learning models for cancer detection aim to classify patients as having cancer (positive) or healthy (negative). In cancer detection, it is important to keep in mind that the datasets are typically strongly unbalanced, with few cancer samples and many healthy samples.

Because of this, it is evident that accuracy cannot be used since it requires balanced datasets. Instead, the typical metrics used for cancer detection are as follows:
- Sensitivity: sensitivity measures the ability of the model to correctly identify patients who have cancer among all the patients who have cancer. A high sensitivity indicates that the model is good at catching cancer cases.
- Specificity: specificity measures the ability of the model to correctly identify patients who do not have cancer among all the patients who do not have cancer. A high specificity indicates that the model is good at avoiding false alarms for patients without cancer.
The choice of what metric to use should be tailored to the specific clinical context, as different cancer types and clinical scenarios may have varying requirements.
What is the meaning and use of the F-measure (also called F1-score)?

The F-measure, also known as the F1-score, is a metric for the evaluation of binary classifiers that combines precision and recall using the harmonic mean.

F-measure =
2 * (Precision * Recall) / (Precision + Recall)

The F-measure value ranges between 0 (worst case) and 1 (best case). The F-measure can be used when you want to evaluate the performance of a binary classifier, taking into account both precision and recall. A high F-measure value suggests a good balance between precision and recall, while a low F-measure value indicates that either precision or recall might be low, highlighting potential issues with the classifier's performance.
If you want to evaluate the performance of a classifier with a curve what do you use?

To evaluate the performance of a classifier you can use the ROC curve (Receiver Operating Characteristic). The ROC curve is a graphical representation of the performance of a binary classification model for different values of a threshold that affects the classification performance. The ROC curve illustrates the trade-off between the classifier's sensitivity (True Positive Rate) and its specificity (the complement of True Negative Rate, often referred to as False Positive Rate) as you vary that threshold.

Sensitivity = probability of predicting that a real positive will be positive.
Specificity = probability of predicting that a real negative will be negative.

The goal when assessing a classifier's performance is to maximize sensitivity while minimizing False Positive Rate (1-specificity). In other words, a high sensitivity and low False Positive Rate are desirable.

The ROC curve is a graphical representation of this trade-off. It is a plot of sensitivity against 1-specificity for different threshold values. The closer the ROC curve is to the top-left corner (point [0, 1]), the better the classifier's performance. Alternatively, the farther the curve is from the diagonal line connecting points [0, 0] and [1, 1], the better the classifier performs. This diagonal line represents the expected performance of a random classifier.

To quantify the overall performance, the area under the ROC curve (AUC) is calculated. A perfect classifier would have an AUC of 1, indicating perfect discrimination. A random classifier would have an AUC of 0.5, corresponding to the diagonal line. Therefore, a higher AUC value indicates better classifier performance, with a value close to 1 being the ideal outcome."
To evaluate the performance of a machine learning model, you can use metrics like accuracy, precision, recall, and F1-score. When to use what?

The most common metrics that you can use for evaluating the performance of a machine learning model are accuracy, precision, recall, and score. It is important to be aware of in what case a metric better fits for evaluation.

Accuracy: is a good choice to start with if the classes are balanced.

Precision and Recall become more important when the classes are unbalanced. If the false positive predictions are worse than false negatives, aim for higher precision. If the false negative predictions are worse than the false positive, aim for higher recall.

F1-score is a combination of precision and recall. It is a good choice when you need to consider both false positives and false negatives. The F1 score is especially useful when you want a single metric that combines both precision and recall, and you are dealing with imbalanced datasets.

	actual 1	actual 0
predicted 1	TP	FP
predicted 0	FN	TN

Naive Bayes

What is Naive Bayes?

Naive Bayes is a family of machine learning algorithms that use Bayes' theorem to perform classification tasks. In particular Bayes' theorem is used to predict the probability of a class label based on some features. The naive assumption of Bayes' theorem is that the features are conditionally independent of each other, given the class label. This simplifies the computation and makes the algorithms scalable and efficient. Naive Bayes classifiers can handle both discrete and continuous features, and can be trained using maximum likelihood estimation
Why is Naive Bayes called naive?

The Naive Bayes is called naive because it is based on the idea that the predictor variables are independent of each other while in nature this is often not the case.

A Naive Bayes classifier calculates the probability of an item belonging to each possible output class. The class with the highest probability is then chosen as output.
What is Bayes' theorem?

Bayes' theorem is a mathematical formula for finding P(A|B) from P(B|A). The formula is

P(A|B) =
P(B|A) * P(A) / P(B)

where
P(A) = probability that event A happens.

P(A|B) = probability that event A happens assuming that event B has already happened.

P(B) = probability that event B happens.

P(B|A) = probability that event B happens assuming that event A has already happened.
What are the main advantages of Naive Bayes?

The Naive Bayes is a relatively simple and very quick algorithm since it is a probabilistic model that does not require a training step. This makes it very scalable.
When is the Naive Bayes usually used?
Naive Bayes is often used in
- text classification like spam filter
- real-time classification since it is fast
- multi-class prediction
When Naive Bayes is used with numerical variables, what condition is assumed on the data?

Numerical variables used in Naive Bayes are expected to have a normal distribution. This is not always the case in practice.
How does the Naive Bayes perform with categorical variables?

It performs well with categorical variables since no assumption is made on the data distribution. By contrast, with numerical variables, a normal distribution is assumed.

Regression

What is linear regression?

Linear Regression is a supervised Machine Learning algorithm. It is used to find the linear relationship between the dependent and the independent variables for predictive analysis
What are the drawbacks of a Linear Model?
Some drawbacks of a linear model are:
- The assumption of linearity of the model.
- It can't be used for count outcomes or binary outcomes.
- There are overfitting or underfitting problems that it can't solve.
What do you predict with linear regression?

Linear regression is used to predict the value of a dependent variable y based on the values of one or more independent variable.
A representation of a linear regression is
y = x₀·k₀ + x₁·k₁ + ... + x_n·k_n
where y is the dependent variable, x₀...x_n the independent variables and k₀...k_n the linear coefficients that represent the weight of each independent variable.
What are the most common techniques used for computing the coefficients in linear regression?
The most common techniques used for computing the coefficients in linear regression are
- The Ordinary Least Squares method aims to minimize the sum of the squared residuals, i.e. the distance between the points and the regression line.
- Gradient descendent works by inizializing randomly the coefficients. A learning rate is used as a scale factor and the coefficients are updated in the direction of minimizing the error. The process is repeated until a minimum sum squared error is achieved or no further improvement is possible.
What is logistic regression?

Logistic regression is a statistical technique that is used to analyze a dataset and predict the binary outcome. The outcome has to be a binary outcome that is either zero or one or a yes or no. In other words, the output of a logistic regression is always categorical (= discrete)
What do you predict with logistic regression?

Logistic regression predicts the probability that a variable belongs to one class or another.
Can you explain a real use case for logistic regression?

You can use logistic regression to predict the probability that a customer will buy a product.
How can you turn logistic regression into a classifier?

To turn logistic regression into a classifier you can use a threshold. If the the output of the logistic regression is above (below) the threshold, then the output can be classified as class zero (class one).
What are some metrics that you can use to evaluate regression models?
To evaluate regression models you can use a number of metrics including
- $R^2$ (also known as the coefficient of determination) represents how well the variables of the regression model describe the variance of the target variable.
- Mean absolute error (MAE) is the average of the absolute distances between predictions and actual values.
- Mean squared error (MSE) is the mean of the square of the absolute distances between predictions and actual values. Notice that larger values, i.e. outliers, are amplified because the values are squared.
To evaluate the performance of regression models you can use the R-squared ($R^2$) metric. What is R-squared? How is it computed?
Given a regression model, the R-squared ($R^2$) metric, also known as coefficient of determination, is a measure that represents how well the variables of the regression model describe the variance of the target variable.

$R^2$ ranges from negative infinity to 1:
- $R^2$ = 1 indicates that all the variance of the target variable is explained by the independent variable(s)
- $R^2$ > 0 indicates that a portion of the variance of the target variable is explained by the independent variable(s)
- $R^2$ = 0 indicates that no information variance of the target variable is explained by the independent variable(s)
- $R^2$ < 0 suggests that the independent variable(s) have no predictive power or could be negatively correlated with the target variable.
Given the real values $yreal_i$ and the predicted values $ypred_i$, the formula of R-squared $R^2$ is $$R^2 = 1 - \frac{\sum_{i=1}^{n} (yreal_i - ypred_i)^2} {\sum_{i=1}^{n} (yreal_i - \overline{yreal})^2} $$

K-Nearest neighbor algorithm (KNN)

What is the K-nearest neighbor (KNN) algorithm? For what is KNN used?

The k-nearest neighbor (KNN) algorithm is a machine learning algorithm that finds the K closest (= most similar) data points of a given data point based on a certain metric, typically the Euclidean distance.

The KNN algorithm can be used both for classification and regression. If you want to classify a data point, you can assign it to the most common class among the K neighbors. If you want to predict a value, you can compute the average value among the K neighbors.
What are some advantages and disadvantages of the KNN algorithm?
Some advantages of the KNN algorithm are
- simple concept
- builiding a model is cheap
- no assumption is made about the data distribution
Some disadvantages of the KNN algorithm are
- Classifying unknown records is expensive, in particular with big data sets, since you have to find the K nearest neighbors
- pretty affected by data missing and data scaling
What are some advantages of using the KNN algorithm on smaller datasets?
The KNN (K-nearest neighbors) algorithm has some advantages when used on smaller datasets:
1. computational complexity: the computational complexity of the KNN algorithm does increase as the dataset size increases because KNN needs to calculate all the distances between datapoints. This process is faster when the dataset is smaller.
2. curse of dimensionality: the performance of the KNN algorithm can degrade with high dimensional datasets. Indeed, in high-dimensional spaces, the data points become more sparsely distributed and this makes it more challenging to find the nearest neighbors. This can affect negatively the model performance.
3. data density: in smaller datasets, the data points are often more densely distributed. As a result, there is a higher likelihood of finding nearby data points that are relevant to capturing complex and nonlinear relationships within the datasets.
4. ease of interpretation and manual inspection: KNN provides interpretable results by directly identifying the nearest neighbors that affect a prediction. This makes KNN transparent and interpretable, allowing domain experts to manually inspect and verify the predictions. In situations where domain knowledge is crucial, this interpretability can be valuable.
5. simplicity: KNN is conceptually a simple algorithm that has only one parameter, that is the k the number of neighbors. The simplicity of the model and the limited number of parameters make it easier to fine-tune and experiment with different k values, particularly on smaller datasets.

Decision trees

What is a decision tree?

A decision tree is a supervised machine learning algorithm that generates a flowchart from a dataset. This flowchart can be represented as a tree-like structure. You can also think of a decision tree as a sequence of if-else conditions where a feature is evaluated at each if-else condition.

For example, the decision tree in the picture below models the question of going to the beach. According to this decision tree, we go to the beach only when the weather is sunny and the temperature is higher than 25 °C.
How are decision trees built?

Building a decision tree is a recursive process. At each step the algorithm searches for the best possible feature that can be used to split the dataset while minimizing the entropy within each subset. The subsets are then split repeatedly into even smaller subsets, and so on and so forth until the process stops when the algorithm determines the data within the subsets are sufficiently homogenous, or another stopping criterion has been met.

The entropy describes the variety in a dataset. An entropy of 0 means that all items are in the same class. The maximum entropy possible is when each item is in a different class.

This algorithm is greedy since it chooses a local optimal solution at each step.
What are the main advantages of decision trees?
The main advantages of decision trees are:
- Output simple to understand: the output tree is human readable, i.e. it can be interpreted easily. This makes decision trees particularly interesting for cases where the decision mechanism needs to be transparent (for example in medical diagnostics).
- Minimum data preparation: decision trees do require less data cleaning and data preparation. Indeed they do not need data normalization/scaling and they can work well even with missing values.
What are some disadvantages of decision trees?
Some disadvantages of decision trees are:
- Overfitting: decision trees can easily overfit the data.
- Instability: small changes in the input data can lead to a very different tree output structure.
- Complex data structure: the structure of decision trees can become complex for example when the dataset has a lot of features.
What technique can you use to reduce the overfitting of a decision tree? How many approaches are there?
To reduce the overfitting of a decision tree you can use pruning, which reduces the size of the tree. There are two approaches to pruning
- Pre-pruning: the tree stops growing when it reaches a certain number of decisions or when the decision nodes contain too few items. A disadvantage of this approach is that some important data could be pruned.
- Post-pruning: the tree can grow as needed and only at the end it is checked if the tree is too big. This approach guarantees that the important data is not pruned.
What are the main techniques used to combine decision trees to obtain a more accurate model (so-called ensemble techniques)?
There are two main ensemble techniques:
- Boosting algorithms: this is a sequential process. In boosting algorithms learners are learned sequentially with early learners fitting simple models to the data and then analyzing data for errors. After each training step, the weights are redistributed. Misclassified data gets an increased weight so this data will be more of a focus in the next training step.
- Bagging techniques: this is a parallel process. In a bagging technique, the dataset is divided into n samples using randomized sampling. Then a model is built on each sample. After that, the resulting models are combined using voting or averaging.
What is Random Forest? How does it work?
Random forest is a versatile machine learning method capable of performing:
- Regression
- Classification
- Dimensionality Reduction
- Treat Missing values
Random forest is an ensemble learning method, where a group of weak models combine to form a powerful model. The random forest starts with a standard machine learning technique called a “decision tree” which, in ensemble terms, corresponds to our weak learner. In a decision tree, an input is entered at the top and as it traverses down the tree the data gets bucketed into smaller and smaller sets.

In Random Forest, we grow multiple trees as opposed to a single tree. To classify a new object based on attributes, each tree gives a classification. The forest chooses the classification having the most votes (Overall the trees in the forest) and in the case of regression, it takes the average of outputs by different trees.

Support Vector Machine

What is a Support Vector Machine and what is its main idea?

Support Vector Machine (SVM) is a supervised machine learning algorithm that tries to find high-dimensional planes that divide a dataset into clusters. The planes are found in such a way that they are as far as possible from any point.

The position of the planes depends on the so-called support vectors, which are the points that are closest to the planes.
What is the technique called kerneling that is used by Support Vector Machine?

Under the hood, SVM uses a technique called kerneling to find clusters that might not be apparent in lower dimensions. More specifically, the dataset is mapped via kernel functions into a higher dimension where the data is potentially easier to cluster.
What is a Support Vector Machine particularly good at? What are some disadvantages?
Support Vector Machine (SVM) is particularly good at classifying high-dimensional data, i.e. data with a lot of features.

SVM has some disadvantages:
- It does not perform well with large datasets because of high computational costs.
- The kernel must be chosen carefully and this can be tricky.
What is the difference between batch gradient descent and stochastic gradient descent?

Batch gradient descent computes the gradient of the cost function with respect to the entire training dataset,
whereas stochastic gradient descent updates the model's parameters based on the gradient of the cost function with
respect to only one training example at a time. Mini-batch gradient descent is a compromise between the two,
using a subset of the training data.

Time Series

What is time series data and what are the characteristics and types of time series data?
Time series data consists of a series of data points collected or recorded at specific time intervals. These data points are typically ordered chronologically and represent values or observations at successive time points. Time series data is prevalent in various fields, including finance, economics, engineering, environmental science, and many others.

Time series data exhibits the following characteristics:
- time order: Data points are arranged chronologically from the past to the present or from the present to the future.
- sequential dependence: The value of a data point at a particular time is dependent on values at previous time points.
There are two main types of time series data:
- univariate: in this type, a single variable is measured over time (e.g., daily temperature readings).
- multivariate: this type involves the measurement of multiple variables measured over the same time periods (e.g., hourly recordings of stock prices and trading volumes).
What is the difference between time series data and cross-sectional data?
The primary difference between time series data and cross-sectional data lies in how the data is collected and the nature of the observations:
- Time series data is collected over a period of time, with observations recorded sequentially at regular intervals (e.g., stock prices over days).
- Cross-sectional data is collected at a single point in time or over a very short period, capturing a snapshot of multiple variables at once (e.g., survey data collected from different individuals).

Deep Learning

Introduction to Deep Learning

What are the differences between machine learning and deep learning?
Deep learning is a subset of machine learning. Both machine learning and deep learning try to create a mathematical model of a dataset. This mathematical model can then be used to make predictions on new input data.
However deep learning and machine learning differ in some points:
- Techniques used: machine learning can use several different techniques to create a mathematical model, like regression, neural networks, nearest neighbors and decision trees. By contrast, deep learning is based only on neural networks. These neural networks have a large number of hidden layers, which is what the word deep refers to.
- Quantity of data: while machine learning algorithms can also work with small datasets, deep learning requires very large datasets.
- Feature selection: in machine learning, the user selects the most important features or creates new ones for the task at hand. By contrast, in deep learning the feature selection is done automatically by the algorithms during the training process.
What is Deep Learning used for?

Deep learning is used in a wide range of applications, including image and speech recognition, natural language processing, autonomous vehicles, recommendation systems, and more.
To what does the term deep refer in deep learning?

Deep learning is a subset of machine learning methods that relies on artificial neural networks (ANNs) with representation learning. The term deep specifically refers to the use of multiple hidden layers within the neural network architecture. These layers allow the network to progressively extract higher-level features from raw input data, resulting in more complex and abstract representations.
What is a deepfake?

A deepfake refers digital content, like videos or pictures, that are created using deep learning algorithms. These creations convey information or events that are either false or have never occured in reality.

Essentially, deepfakes manipulate media to present a distorted version of the truth.

Neural Networks

What is an Artificial Neural Network (ANN)?

An artificial neural network is a computational model inspired by the structure and functioning of the human brain. It consists of interconnected nodes (neurons) organized in layers, typically an input layer, one or more hidden layers, and an output layer.
What are Neurons in a Neural Network?

Neurons (also called nodes) in a neural network are basic computational units that receive inputs, perform a weighted sum of those inputs, and apply an activation function to produce an output. They are the building blocks of the network.
What is a Shallow Neural Network and Deep Neural Network?

A shallow neural network is a neural network which typically has only one or two hidden layers (layers between the input layer and the output layer).

By contrast, Deep neural networks have multiple hidden layers. They are capable of learning complex representations, whereas shallow networks are less capable of capturing intricate patterns.

Types of Neural Networks

What are some types of neural networks? What are they used for?
Some types of neural networks are:
- Convolutional Neural Networks (CNNs): used for image and video analysis, object recognition, and computer vision tasks.
- Recurrent Neural Networks (RNNs): ideal for sequence data, like natural language processing, time series analysis, and speech recognition.
- Long Short-Term Memory (LSTM) Networks: a type of RNN that's effective at learning long-term dependencies in sequences.
- Gated Recurrent Unit (GRU) Networks: another variant of RNNs, useful for many of the same tasks as LSTMs but with lower computational complexity.
- Autoencoders: often employed for dimensionality reduction, feature learning, and data denoising.
- Generative Adversarial Networks (GANs): used for generating new data samples, image-to-image translation, and creating realistic images.
- Transformers: ideal for natural language processing tasks, like machine translation, language understanding, and text generation.
What is a Convolutional Neural Network (CNN)?

Convolutional Neural Networks (CNN) are a type of neural network designed for processing grid-like data, such as images and video. They use convolutional layers to automatically learn features from the input data.
What is a Recurrent Neural Network (RNN)?

Recurrent Neural Networks (RNN) are neural networks with loops that allow information to be passed from one step of the network to the next. They are commonly used for sequential data, like time series and natural language processing.
What kind of neural networks works particularly well with image data?

Convolutional neural networks are a kind of neural network that works particularly well with image data.
The basic idea behind convolutional neural networks is to divide the input image into small fragments and apply a number of filters on each fragment. Applying a filter is equivalent to searching for a pattern in the fragment. Each filter has a specific task, like searching for edges or for circles.
In this way convolutional neural networks can better deal with images that are not very similar, for example because the images are rotated differently.
What kind of neural network is particulary good at processing text data?

A kind neural network that is particulary good at processing text data is recurrent neural networks.
A recurrent neural network has an intern loop, that means the data is not processed all at once but rather is processed in different steps. In particular in the case of text, each word is processed once at a time. This process is similar to the way humans read text.
What is the self-attention technique that is used in deep learning?

Self-attention is a powerful mechanism used in deep learning, particularly in natural language processing (NLP) and computer vision tasks. Self-attention allows neural networks to capture contextual dependencies and relationships within input sequences.

For each token, self-attention computes a weighted sum of all other tokens in the sequence, where the weights reflect the relevance of each token to the current one. The attention mechanism learns to focus on relevant tokens dynamically, rather than relying on fixed positional embeddings.

Self-attention allows a model to consider all positions in the input sequence when making predictions for a specific position. By constrast, traditional neural network models (like RNNs and CNNs) have fixed receptive fields, limiting their ability to capture long-range dependencies. Self-attention overcomes this limitation.
What is the transformer architecture?

The transformer architecture is a neural network architecture introduced at Google in 2017. In general, a transformer is something that transforms one sequence into one another, like a translation from one language into another language.

Transformers consist of two parts: an encoder and a decoder. The encoder takes the input sequence and generates encodings which define what parts of the input sequence are related with each other. The decoder takes the generated encodings to generate the output sequence. Transformers use sequence-to-sequence learning in the sense that they take a sequence of input tokens and try to predict the next token in the output sequence. This is done by iterating through the encoding layers.

Transformers do not necessarily process data in order, rather run multiple sequences in parallel. Transformers do use the so-called attention mechanism, that provides context to the tokens of the input sequence, i.e. context information that gives a meaning to each token.
What is the attention mechanism used by the transformer architecture?

Training and Optimization

How are weights initialized in a neural network?
There are several ways to initialize weights in a neural network. Here are some of the most common methods:
- Initialize the weights to zero: this makes your model similar to a linear model. All the neurons and every layer perform the same operation, giving the same output and making the deep net useless.
- Initializing all weights randomly: the weights are assigned randomly by initializing them very close to 0. It gives better accuracy to the model since every neuron performs different computations. This is the most commonly used method
What is an activation function?

An activation function introduces non-linearity into a neural network. It determines whether a neuron should be activated (output a non-zero value) or not. Common activation functions include ReLU (Rectified Linear Unit), Sigmoid, and Tanh.
What is the purpose of an activation function in a neural network?

The purpose of activation functions is to add non-linearity to a neural network, allowing the network to model complex relationships within the data. Activation functions Activation functions also determine the output of neurons, enabling the network to make decisions and learn from a vast array of features.
Explain different types of activation functions and its uses
These are common examples of activation functions:
- Sigmoid: It converts the input into the value between 0 and 1 using the formula
  g(z) =
  1 / 1+e^(-z)
  
  It is used in Binary classification and multilabel classification
- ReLU: It converts the negitive input to 0 using the formula g(z) = max(0, z)
  It is used in Regression when expected output is always positive and mostly used in hidden layers for better gradient flow.
- Linear: It gives the output same as input using formula g(z) = z
  It is used in output layers for regression models.
- Softmax: It gives vector as output using the formula g(z) = e**zi/sum(e**z) ∀(zi ∈ z) where z = [z1, z2, z3...]
  It is used in multiclassification
- Tanh: It converts the input into the value between -1 and 1 using the formula g(z) = (e**x-e**-x)/e**x+e**-x)
  It is used in hidden layers when the input data is centered around 0 to maintain the scale especially in RNN to handle sequential data.
What is a cost function?
A cost function (or also called loss function) is a mathematical measure that quantifies how well a neural network's predictions match the actual target values. The goal in training a neural network is to minimize this function.
It is primarily used for two purposes:
- Model Training: During the training phase, it quantifies the disparity between the predictions of a machine learning model and the true target values, guiding the adjustment of model parameters for better predictions.
- Evaluation: After training, the cost function serves as an evaluation metric, measuring how well the model generalizes to new data. Lower cost function values indicate better model performance.
What is batch normalization?

Batch normalization is a technique for improving the performance and stability of neural networks. It involves normalizing the outputs of each layer, i.e., transforming them to have zero mean and unit variance. This helps to reduce the internal covariate shift problem, which is the change in the distribution of network activations due to the change in network parameters during training.
What is backpropagation?

Backpropagation is a fundamental algorithm used to train neural networks. It involves iteratively adjusting the network's weights and biases to minimize the difference between the predicted output and the actual target values.
What is the purpose of dropout regularization in deep learning?

Dropout regularization prevents overfitting by randomly dropping a fraction of neurons during training, forcing the network to learn more robust features and reducing reliance on specific neurons.

Transfer Learning

What is Transfer Learning? When should it be used?
Transfer learning is a technique where a pre-trained neural network is adapted for a new, related task. This can significantly reduce the amount of data and training time required for the new task.
The most important scenarios to consider using transfer learning are:
- Limited Data: When you have a small dataset that is insufficient to train a deep model effectively, transfer learning can provide a substantial boost in performance.
- Similar Tasks: If your problem is closely related to a task for which a pre-trained model exists, transfer learning is highly advantageous, as it leverages the knowledge from the related task.
- Resource Constraints: When computational resources, including time and hardware, are limited, using pre-trained models is efficient compared to training a model from scratch.
- Domain Adaptation: When you need to work in a new domain but have access to data from a related domain, transfer learning can be used to adapt the model from the related domain to the new one.
- You do not have enough data to train a model from scratch. Starting with a pre-trained can allow you to train the model for your task.
- You do not have enough time to train a model from scratch. Because training models can take days or even weeks, using a pre-trained model is much more faster.
Explain the concept of transfer learning in deep learning. Provide a real-world example.

Transfer learning in deep learning is a technique where a pre-trained neural network, typically on a large dataset, is used as a starting point to solve a different but related problem. The idea is to leverage the knowledge gained during training on the initial task and adapt it to the new task, often with less training data and computation.

For example, a pre-trained convolutional neural network (CNN) that has learned to recognize a wide range of objects in images can be fine-tuned for a specific image classification task, like classifying different species of flowers. The pre-trained model already possesses features that are generally useful for recognizing edges, shapes, and textures, making it easier to adapt to the new task with a smaller dataset.
What is the vanishing gradient problem in deep learning, and how can it be mitigated?
The vanishing gradient problem is a challenge in training deep neural networks, particularly in recurrent neural networks (RNNs) and deep feedforward networks. It occurs when gradients become extremely small during backpropagation, causing the network's weights to update very slowly or not at all. This can lead to slow convergence and difficulty in training deep models.

To mitigate the vanishing gradient problem, several techniques can be used:
- Activation Functions: replace activation functions like sigmoid with alternatives such as ReLU (Rectified Linear Unit), which are less prone to vanishing gradients.
- Weight Initialization: use appropriate weight initialization techniques like Xavier/Glorot initialization to ensure that weights have reasonable initial values.
- Batch Normalization: apply batch normalization to normalize the inputs at each layer, helping gradients to flow more smoothly.
- Gated Architectures: architectures like LSTMs and GRUs use gating mechanisms to control the flow of information, reducing the vanishing gradient problem in sequential data models.
- Gradient Clipping: clip gradients during training to prevent them from becoming excessively large or small.

Large Language Models (LLMs)

What is a Large Language Model (LLM)?

A Large Language Model (LLM) is a type of language model that can achieve advanced language understanding and generation. LLMs are highly efficient at capturing the complex entity relationships in the text at hand and can generate the text using the semantic and syntactic of that particular language in which we wish to do so. They can perform tasks like text generation, machine translation, summary writing, image generation from texts, machine coding, chat-bots, or Conversational AI.

LLMs are artificial neural networks, mainly transformers, that are trained on massive amount of data and take millions of parameters. As autoregressive language models, they work by taking an input text and repeatedly predicting the next token or word.
How do Large Language Models (LLMs) work?

Large Language Models (LLMs) are a type of artificial intelligence model that is designed to understand and generate human language. LLMs represent words as long lists of numbers, known as word vectors. For example, the word “cat” might be represented as a list of 300 numbers.

Most LLMs are built on a neural network architecture known as the transformer. This architecture enables the model to identify relationships between words in a sentence, irrespective of their position in the sequence.

LLMs are trained using a technique known as transfer learning, where a pre-trained model is adapted to a specific task. They learn by taking an input text and repeatedly predicting the next word. This requires massive amounts of data and computational resources.

Once trained, LLMs can generate text by taking a series of words as input and predicting the most likely next word, over and over, until they’ve generated a full piece of text.
What is pre-training and fine-tuning in the context of Large Language Models (LLMs)?
Pre-training and fine-tuning are two key stages in the training of Large Language Models (LLMs):
- Pre-training: this is the initial stage involving the training of the model on a large and diverse dataset that contains vast amounts of text. The model learns to predict the next word in a given sentence or sequence, developing a profound understanding of grammar, context and semantics.
- Fine-tuning: after the pre-training phase on a general text dataset, the model is fine-tuned, or in simple words, trained to perform a specific task such as sentiment analysis, translation, text summarization and so on. Fine-tuning involves updating the weights of a pre-trained language model on a new task and dataset.
The division of training into pre-training and fine-tuning is a form of transfer learning. The idea is that the model, having learned a broad understanding of language during pre-training, can then be fine-tuned on specific tasks with less computational power. This approach is efficient, especially when dealing with limited task-specific datasets, as the model can transfer its knowledge from the pre-training phase to the task at hand.
How are Large Language Models (LLMs) evaluated?
Large Language Models (LLMs) are evaluated based on the ability to generate accurate and relevant responses. Here are some common methods that are used to measure the performance of LLMs.
- Controlled Generation Tasks: These tasks assess the model’s ability to generate content under specific constraints. For example, a study found that LLMs often struggle with meeting fine-grained hard constraints.
- Zero-Shot Classification Tasks: In these tasks, the model is evaluated on its ability to correctly classify inputs without having seen any labeled examples during training. This method is popular for evaluating LLMs as they have been shown to learn capabilities during training without explicitly being shown labeled examples.
- Comparative Evaluation: This involves comparing the performance of large models against smaller, fine-tuned models. The comparison can reveal whether large models fall behind, are comparable, or exceed the ability of smaller models.
- Use of Evaluation Metrics: Certain metrics like Sentence-BERT (SBERT), BS-F1, and Universal Sentence Encoder (USE) can be used with confidence when evaluating the efficacy of LLMs, particularly in generative tasks like summarization where machine/human comparisons are appropriate.
What are some applications of Large Language Models (LLMs)?
Large Language Models (LLMs) have many potential applications in different domains and industries. Some of the most common and promising ones are:
- Translation: LLMs can translate texts from one language to another. For example, a user can enter text into a chatbot and ask it to translate into another language, and the LLM will automatically generate the translation.
- Content creation: LLMs can generate various types of written content, such as blogs, articles, summaries, scripts, questionnaires, surveys, and social media posts. They can also help with ideation and inspiration for content creation. Additionally, some LLMs can generate images based on a written prompt.
- Chatbots and virtual assistants: LLMs can power conversational agents that can interact with humans in natural language, providing information, guidance, entertainment, and support. For example, ChatGPT is a popular AI chatbot that can engage in a wide range of natural language processing tasks.
- Knowledge work: LLMs can help knowledge workers with tasks such as information retrieval, summarization, analysis, synthesis, and extraction. They can also provide insights and recommendations based on large and complex data sets.
- Malware analysis: LLMs can scan and explain the behavior of scripts and files, and detect whether they are malicious or not. For example, Google’s SecPaLM LLM can analyze and classify malware samples without running them in a sandbox.
- Law: LLMs can assist lawyers and legal professionals with tasks such as document review, contract generation, compliance checking, and legal research.
- Medicine: LLMs can support medical professionals and patients with tasks such as diagnosis, treatment, drug discovery, clinical documentation, and health education.
- Robotics and embodied agents: LLMs can enable robots and other physical agents to communicate with humans and other agents, and understand and execute natural language commands.
- Social sciences and psychology: LLMs can help researchers and practitioners with tasks such as data collection, analysis, interpretation, and intervention. They can also generate synthetic data and texts that can be used for experiments and studies.
These are some of the applications of LLMs that have been explored or implemented so far. However, there are many more possibilities and challenges that await further research and development. LLMs are an exciting and rapidly evolving field of AI that can have a significant impact on various aspects of human society and culture.
What are the limitations of Large Language Models (LLMs)?
Large Language Models (LLMs) have several limitations:
- Data Dependence: LLMs are trained on large amounts of data and their performance heavily depends on the quality and diversity of this data. If the training data is biased or unrepresentative, the model’s outputs will likely reflect these biases.
- Lack of Understanding: Despite their ability to generate human-like text, LLMs do not truly understand the content they are generating. They do not have a concept of the world, facts, or truth, and are merely predicting the next word based on patterns learned during training.
- Inability to Verify Information: LLMs cannot verify the information they generate against real-world facts. They do not have the ability to access real-time, up-to-date information or databases.
- Ethical Concerns: LLMs can generate harmful or inappropriate content if not properly controlled. This includes content that is biased, offensive, or misleading.
- Resource Intensive: Training LLMs requires significant computational resources, which can be a barrier to development and use.
- Lack of Creativity and Innovation: While LLMs can generate text based on patterns they’ve learned, they are not capable of true creativity or innovation. They cannot come up with new ideas or concepts that were not present in their training data.
How do Large Language Models (LLMs) handle multilingual tasks?
Large Language Models (LLMs) are AI systems that can understand and generate human language by processing vast amounts of text data. They can perform a wide range of natural language processing tasks, such as translation, summarization, question answering, and more.
There are different ways that LLMs can handle multilingual tasks, depending on how they are trained and used. Some common approaches are
- Multilingual pre-training. This involves training a single LLM on text data from multiple languages, usually with a shared vocabulary and subword tokenization. This allows the LLM to learn cross-lingual representations and transfer knowledge across languages. Examples of multilingual LLMs are mBERT, XLM-R, mT5, and BLOOM.
- Continued pre-training. This involves further training a pre-trained LLM on additional text data from specific languages or domains, to improve its performance and adaptability. For example, Tower is a multilingual LLM that is built on top of LLaMA2 and continued pre-trained on 20 billion tokens of text from 10 languages.
- Multi-task fine-tuning. This involves fine-tuning a pre-trained LLM on multiple tasks simultaneously, using task-specific prompts or instructions. This can help the LLM generalize to new tasks in a zero-shot setting, without requiring labeled data. For example, BLOOMZ and mT0 are multilingual LLMs that are fine-tuned on multiple translation-related tasks using multi-task prompted fine-tuning.
- Adapters and parameter-efficient fine-tuning. This involves adding small modules or layers to a pre-trained LLM that can be fine-tuned for specific tasks or languages, without affecting the original parameters. This can reduce the computational cost and memory footprint of fine-tuning, and enable more flexible and modular LLMs. Examples of adapter-based LLMs are AdapterHub, MAD-X, and XGLM.
These are some of the main methods that LLMs use to handle multilingual tasks, but there are also other aspects to consider, such as data collection, evaluation, interpretability, and responsible AI.
What ethical considerations are there when using Large Language Models (LLMs)?
The use of Large Language Models (LLMs) raises several ethical considerations. Some of the most relevant ones are:
- generate harmful content: LLMs have the potential to create harmful content, including hate speech and extremist propaganda. While they are not inherently harmful, the biases present in their training data can be reflected in their outputs.
- Disinformation & Influencing Operations: LLMs can inadvertently produce misinformation or be exploited for disinformation campaigns.
- Weapon Development: in the wrong hands, LLMs could be used to create malicious content, including cyber threats or weapons.
- Privacy: LLMs may inadvertently reveal sensitive information or violate privacy rights.
- Environmental Impact: training large models consumes significant computational resources.
- Accountability and Responsibility: clear guidelines on authorship, disclosure, and intellectual property of the generated content are essential.
- Hallucinations: LLMs sometimes produce text that appears correct but contains incorrect information. This can lead to misinformation and confusion.
How can bias in the training data affect the outputs of Large Language Models (LLMs)?

Coding Dojos

Statistics Dojos

Given the conditional tree in the picture in the answer, find the probabilities P(Cake|Coffee), P(Cake'∩Coffee), P(Cake), P(Coffee|Cake).

P(Cake|Coffee), can be read directly from the conditional diagram.
P(Cake|Coffee) = (3/4)

To find P(Cake' ∩ Coffee), we use the formula of the conditional probability.
P(Cake' ∩ Coffee) = P(Cake'|Coffee) * P(Coffee) = (1/4) * (2/3) = 1/6

To find P(Cake), we need to add P(Cake ∩ Coffee) and P(Cake ∩ Coffee')
P(Cake) = P(Cake ∩ Coffee) + P(Cake ∩ Coffee') = P(Cake|Coffee) * P(Coffee) + P(Cake|Coffee') * P(Coffee') = (3/4)*(2/3) + (3/5)*(1/3) = 1/2 + 1/5 = 7/10 = 0.7

To find P(Coffee|Cake), we use the formula for the conditional probability and P(Cake) found previously.
P(Coffee|Cake) = P(Coffee ∩ Cake) / P(Cake) = P(Cake|Coffee) * P(Coffee) / P(Cake) = (3/4) * (2/3) / (7/10) = 1/2 * 10/7 = 0.71
Given a set of people, 80% of them drink coffee while 20% drink tea. Out of those drinking coffee, 70% put sugar in the coffee. Out of those drinking tea, 40% put sugar in the tea. Can you draw the probability tree for this scenario? Use the Bayes's Theorem to compute P(coffee|sugar free).

The probability tree is visible below. Given the Bayes formula
P(A|B) =
P(A) * P(B|A) / P(A) * P(B|A) + P(A') * P(B|A')

we can find P(coffee|sugar free) as follows

P(coffee|sugar free) =
P(coffee) * P(sugar free|coffee) / P(coffee) * P(sugar free|coffee) + P(tea) * P(sugar free|tea)
=
0.8 * 0.3 / 0.8 * 0.3 + 0.2 * 0.6
= 0.24 / 0.36 = 2/3
Given 10 people, 7 people likes coffe, 5 likes tea, 4 people likes both. Are the probabilities P(coffee) and P(tea) dependent or independent?

P(coffee) and P(tea) are independent if P(coffee) x P(tea) = P(coffee ∩ tea).

Let's check whether this is the case or not.

P(coffee) = 7/10 = 0.7
P(tea) = 5/10 = 0.5
P(coffee ∩ tea) = 4 / 10 = 0.4
P(coffee) x P(tea) = 0.7 x 0.5 = 0.35

Because P(coffee) x P(tea) is different from P(coffee ∩ tea), it follows that P(coffee) and P(tea) are dependent events.
At a local shop you can buy magic boxes that can contain some money prizes. The box costs 1$. The probability of winning 5$ is 0.07 and the probability of winning 10$ is 0.03. If X is the net gain, what is the probability distribution P(X)? What are the values of the expectation E(X) and the variance Var(X)? How do P(X), E(X) and Var(X) change if the price of a magic box is increased to 1.5$?

The possible values are -1$ (buying a magic box without winning), 4$ (buying a magic box and winning 5$) and 9$ (buying a magic box and winning 10$). The probability distribution P(X) is given by all possible values (-1, 7, 14) and their probabilities (0.9, 0.07, 0.03).

The expectation E(X) is
E(X) = μ = -1*0.9 + 4*0.07 + 9 * 0.03 = -0.9 + 0.28 + 0.27 = -0.35
The expected long term net gain is -0.35$

The variance Var(X) is
Var(X) = (-1+0.35)²*0.9 + (4+0.35)²*0.07 + (9+0.35)²*0.03 = 0.38 + 1.32 + 2.62 = 4.32$

If the price of the magic box is increased to 1.5$, the values change but the probabilities remain the same. The new data distribution Y can then be obtained from a linear transformation of the original data distribution as follows
P(Y) = aP(X) + b.
In this example a = 1 and b = 0.5. E(Y) and Var(Y) can be then computed as

E(Y) = a E(X) + b = -0.35 + 0.5 = 0.15$
Var(Y) = a²Var(X) = 4.32$
Given a restaurant 1 with prices (3$, $4$, 5$) and probability (0.6, 0.3, 0.1) and a restaurant 2 with prices (2$, 6$, 7$) and probability (0.8, 0.15, 0.05), what is the difference in price between the two restaurants?

The average price for each restaurant is
E(R1) = 3*0.6 + 4*0.3 + 5*0.1 = 3.5$

E(R2) = 2*0.8 + 6.0.15 + 7*0.05 = 2.85$

The variance of the price is
Var(R1) = (3-3.5)²*0.6 + (4-3.5)²*0.3 + (5-3.5)²*0.1 = 0.45
Var(R2) = (2-2.85)²*0.8 + (6-2.85)²*0.15 + (7-2.85)²*0.05 = 2.93

The difference in price between the two restaurant and the respective variance is
E(R1 - R2) = E(R1) - E(R2) = 3.5 - 2.85 = 0.65$
Var(R1 - R2) = Var(R1) + Var(R2) = 0.45 + 2.93 = 3.38
There is a race between 12 Ferrari cars, 8 Mercedes cars and 5 Lamborghini cars. What is the probability that the 5 Lamborghini cars will finish the race consecutively? Suppose that each Lamborghini has the same probability of winning.

We are not interested in the order of the individual cars but on the car type at each position. Since the Lamborghini cars are at consecutive positions, they can be treated as a single item. We need to find the number of ways in which the 5 Lamborghini cars can finish the race consecutively. We also need to find the number of ways the race can be finished irrespective of the car types. We use the formula for computing the number of arrangements of k objects where j and k of them are of the same type
n! / j! k!

The number of ways 5 Lamborghini cars can finish the race consecutively is (20 + 1)! / 12! 8!
. Notice that we added 1 to the numerator because we consider the 5 Lamborghini cars as a single item.
The number of ways the race can be finished irrespective of the car types is 25! / 12! 8! 5!
.
The searched probability is (20 + 1))! / 12! 8!
·
12! 8! 5! / 25!
=
21! 5! / 25!
= 3.95 · 10^-4
Given 12 animals, how many distinct arrangements are there if you pick 5 of them?

To find the number of distinct arrangements we can use the formula for the combinations
n! / k! (n-k)!
= 12! / 5! (12-5)!
= 792
The probability of winning a game is 0.2. Can you find
- the probability you will succeed at your 5th attempt after 4 failures
- the probability you will succeed in 7 or less attempts
- the probability you will need more than 7 attempts to succeed
- the expected number of attempts you need to succeed
- the variance of the expected number of attempts you need to succeed
This is an example of geometric distribution because we have a set of independent trials. Let p = 0.2 be the probability of success and q = 1-p = 0.8 the probability of failure.
- The probability you will succeed at your 5th attempt after 4 failures is
  P(X=5) = q⁴ · p = 0.8⁴ · 0.2 = 0.082
- The probability you will succeed in 7 or less attempts is 1 minus the probability of failing 7 times.
  P(X<=7) = 1 - q⁷ = 1 - 0.8⁷ = 0.79
- The expected number of attempts you need to succeed is the expectation of the geometric distribution
  E(X) = 1 / p = 1 / 0.2 = 5
- the variance of the expected number of attempts you need to succeed is found with the variance of a geometric distribution
  Var(X) = q / p² = 0.8 / 0.2² = 20
Given 6 questions where each question has 4 possible answers, what is the probability of answering 5 questions correctly?
For this case we can use the formula of the binomial distribution P(X = r) = ⁿC_r p^r q^{n - r}. Indeed we have the following conditions, that are typical for a binomial distribution:
- a limited number of trials (5)
- the probability of answering correctly a question is independent of the other questions
- each answer can either be correct or wrong.
In our case we have r = 5 number of answers correct, p = 0.25 probability of answering correctly a question, q = 0.75 probability of answering wrongly a question, n = 6 total number of questions.
P(X = 5) = ⁶C₅ 0.25⁴ 0.75^{6 - 5} =
6! / 5! (6-5)!
0.25⁴ 0.75 = 0.018
Suppose that you live in a town where in average it can rain twice a week. What is the probability that next week it will not rain?

You know that in this town it rains twice a week. Let X be the distribution of the number of times it rains. Since we have the scenario with a set of events happening in a certain interval (in a week it can rain twice) and we know the mean values of times it rains in a week, we can say that X follows a Poisson distribution:
X ~ Po(λ) with λ = 2 (number of times it rains in a week).
To calculate the probability that next week it will not rain, we can use the formula of the Poisson distribution:
P(X = r) =
e^-λ λ^r / r!
.
In our case we have λ = 2 and r = 0 (0 rains in a week). It follows that the probability that it will not rain next week is
P(X = 0) =
e^-2 2⁰ / 0!
= e^-2 = 0.153
Suppose that you have a restaurant with a kitchen where, in average, the oven can break once a year and the fridge can break twice a year. What is the probability that nothing will break in the kitchen next year?

Let X and Y be the distributions of the number of malfunctions of the oven and the fridge. Because we know the number of malfunctions in a certain time interval (one year), it follows that X and Y follow a Poisson distribution. Thus we can write X ~ Po(1) and Y ~ Po(2).

To find the probability that nothing breaks in the kitchen next year we need to find P(X + Y) = 0. Because X and Y are indipendent distributions, we can sum the distributions as follows
X + Y ~ Po(3).

P(X + Y = 0) =
e^-λ λ^r / r!
=
e^-3 3⁰ / 0!
= e^-3 = 0.05
The probability that nothing in the kitchen will break next year is 5%.

SQL Dojos

Given a table person(name,age), sort the records by age ascending and by name descending

                        
SELECT * FROM person 
ORDER BY age ASC,name DESC

Given a table person(name,age), select the minimum age, the maximum age, the average age. In addition rename the average age to meanAge

                        
SELECT MIN(age), MAX(age), AVG(age) AS meanAge
FROM person

Given a table person(name,age), select all names that do not start with 'R'

                        
SELECT name FROM person 
WHERE name NOT LIKE 'R%'

Given a table person(name,age), select all names that start with 'R' and end with 'A' and do not contain 'C'

                        
SELECT name FROM person 
WHERE name LIKE 'R%A' 
and name NOT LIKE '%C%'

Given a table person(name,age), select all names where the second letter is a 'i' and the letter before the last letter is a 'n'

                        
SELECT * FROM person 
WHERE name LIKE '_i%' 
AND name LIKE '%n_'

Given a table person(name,age), select all items whose age is one of the following values: 23,56,75. Then select all items whose age is NOT one of the previous values.

                        
-- query 1: items with one of the ages below
SELECT * FROM person 
WHERE age IN (23,56,75)

-- query 2: items with NONE of the ages below
SELECT * FROM person 
WHERE age NOT IN (23,56,75)

Given a table person(name,age), select all records whose age is between 20 and 40. Then select all records whose age is NOT between the previous values.

                        
-- age is between 20 and 40
SELECT * FROM person
WHERE age between 20 AND 40

-- age is NOT between 20 and 40
SELECT * FROM person
WHERE age NOT between 20 AND 40

What is a JOIN clause? What are the four types of JOIN clauses in SQL?
A JOIN clause is a query that retrieves records from two or more tables.

In SQL there are the following types of JOIN clauses (lets suppose we have two tables):
- INNER JOIN: returns records that have matching values in both tables.
- LEFT (OUTER) JOIN: returns all records from the left table and all matching records from the right table.
- RIGHT (OUTER) JOIN: returns all records from the right table and all matching records from the left table.
- FULL (OUTER) JOIN: returns all records that match in the left or in the right table.

Given the tables person(name,age), salary(name,money), select all records that match in both tables

                        
SELECT * FROM person
INNER JOIN salary
ON person.name = salary.name

Given the tables person(name,age), salary(name,money), select all records from person and all matching records from salary.

                        
SELECT * FROM salary
LEFT JOIN person 
ON person.name = salary.name

Given a table person(name,age), show the number of records for each name and rename that column to n.

                        
SELECT name, COUNT(*) as n
FROM person
GROUP BY name

Given a table person(name,age,gender), show the number of records for each gender, labeling the result as either MALE or FEMALE depending on the value of the gender column. Assume that the gender column can have only values M or F.

                        
SELECT count( * ),
       CASE gender
            WHEN 'M' THEN 'male'
            WHEN 'F' THEN 'female'
       END AS gender
FROM person
GROUP BY gender

Given a table person(name) find out the people with the longest name (as number of chars).

                        
SELECT name
FROM person
WHERE length( name ) = ( SELECT max( length( name ) )
                         FROM person );

Python Dojos

Given a list [3,5,7,9,2,4,7], print the first two items, all items from the third element and the last two items

                        
mylist = [3,5,7,9,2,4,7]
# first two items
mylist[:2]
# all items from the third
mylist[3:]
# last two items
mylist[-2:]

Take a list and concatenate it with another list

                        
list1 = [5,2,7]
list2 = [6,1,9]
list1.extend(list2)
# list1 is the concatenation of list1 and list2
print(list1)

Create a tuple with 3 and 5 and print the second integer

                        
myTuple = (3,5)
# prints 5
print(myTuple[1])

Create a dictionary, add a (key, value) pair, try to retrieve that key then a non-existing key.

                        
myDict = {}
myDict['myKey'] = 'myValue'
print(myDict.get('myKey')) # prints myValue
# retrieve the non-existing key
print(myDict.get('myKey2')) # prints None

Create a doSomething function that calls a createTuple function. createTuple must create a tuple from input params x,y. In addition call doSomething with a lambda function that sums two params.

                        
def createTuple(x,y):
  return (x,y)

def doSomething(f,x,y):
  return f(x,y)

t = doSomething(createTuple,3,6)
print(type(t))

doSomething(lambda x,y: x+y, 4,6)

Loop over the first 10 integers and print only the even numbers.

                        
for i in range(10):
  if i%2 == 0:
      print(i)

Given a list of integers, add 1 to each element of the list by using list comprehension.

                        
a = [4,5,6]
# 5,6,7
b = [x+1 for x in a]

Given a string 'hello everybody', print the first three characters, the last two characters and the substring between the third and fifth characters included.

                        
s = 'hello everybody'
# first three characters
print(s[:3])
# last two characters
print(s[-2:])
# substring between third and fifth characters included
print(s[2:6])

Given a string 'hello everybody', find at what position the word 'everybody' starts.

                        
s = 'hello everybody'
# let's find the position of 'everyboey'. It starts at position 6.
pos =  s.find('everybody')

Create a dictionary, add the key-value pairs ('one', 1) and ('two', 2) and retrieve the value for key 'two'.

                        
# create dictionary
mydict = dict()
# insert key-value ('one', 1)
mydict['one'] = 1
# insert key-value ('two', 2)
mydict['two'] = 2

# you could also insert both pairs with a single command
# mydict.update({'one':1, 'two':2})

# retrieve value for key 'two'
print(mydict['two'])

Given a dictionary myDic = {'one':1, 'two':2}, get the value for key 'three' and take 3 as default in case the key does not exist.

                        
myDic = {'one':1, 'two':2}
# since the key 'three' does not exist, the default value 3 is returned
print(myDic.get('three', 3))

Numpy Dojos

What is Numpy?

Numpy is a Python library for numerical computation. Numpy uses libraries written in C and Fortran and this make it very efficient to work with arrays and matrices.

An interesting aspect of Numpy is that you can define upfront how many bytes a variable will need in memory. This feature is not supported in Python standard.

Create a Numpy array with values (2,6,4) where each value takes 1 byte. Then print out the first and third elements.

                        
import numpy as np
a = np.array([2,6,4], dtype = np.int8)
# print first and third elements
print(a[[0,2]])

Create a Numpy array with values from 0 to 4. Then add 3 to each element. Finally show those elements that are smaller than 6.

                        
# create array 0..4
a = np.arange(5)
# add 3 to each element
a += 3
# show elements smaller than 6.
print(a[a < 6])

Create a numpy array with values (1,4,7,10).

                        
np.arange(start=1, stop=11, step=3)

Create a two-dimensional Numpy array with values from 1 to 4. Then sum the elements by row and by column.

                        
# create a two-dim array
A = np.array([[1,2],[3,4]])
# A = array([[1, 2],
#            [3, 4]])

# sums by row
A.sum(axis=0)
# = array([4, 6])

# sums by column 
A.sum(axis=1)
# = array([3, 7])

Create a 2x2 array of random integers with values from 0 to 10 inclusive.

                        
np.random.randint(low=1, high=11, size=(2,2))

Create an 3x2 array of only zeros. Then print the type of data.

                        
z = np.zeros(shape=(3,2))
# print the type of data
print(z.dtype)

Create a NumPy array with values (3.14, 2.71, 1.62) with a specified data type, and then convert it to a Python list.

                        
a = np.array([3.14, 2.71, 1.62], dtype=np.float32)
# convert to list
a.tolist()

Create a NumPy array with 10 equally spaced values between 0 and 1.

                        
np.linspace(start=0, stop=1, num=10)

Given two NumPy arrays arr1 = [1, 2, 3] and arr2 = [4, 5, 6], perform element-wise addition and multiplication.

                        
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
# element-wise addition
arr1 + arr2
# element-wise multiplication
arr1 * arr2

Reshape the following 1D NumPy array [1, 2, 3, 4, 5, 6] into a 2D array with 2 rows and 3 columns.

                        
a = np.array([1, 2, 3, 4, 5, 6])
# reshape to 2x3
a.reshape(2,3)

Compute the mean, median, and standard deviation of a NumPy array with random values.

                        
a = np.random.rand(10)
# mean
a.mean()
# median
np.median(a)
# standard deviation
a.std()

Matplotlib and Seaborn Dojos

Visualize a pie chart using Matplotlib for the data fruits = ["Apples", "Bananas", "Cherries", "Apricots", "Pears"] and values = [14, 35, 23, 25, 15].

                        
import matplotlib.pyplot as plt
values = [14, 35, 23, 25, 15]
fruits = ["Apples", "Bananas", "Cherries", "Apricots", "Pears"]
plt.pie(values, labels = fruits)
plt.show()

Create a chart with plt.subplots(), plot the values x=[2,3,4], y=[3,5,4]. Assign chart test to the title, test x to the x-axis and test y to the y axis.

                        
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
x = [2,3,4]
y = [3,5,4]
ax.plot(x,y)
ax.set(title="chart test", xlabel="test x", ylabel="test y")

Create a Python dictionary containing 5 bananas, 4 apples and 2 cherries. Then create a plot with plt.subplots(), display the values in a bar chart. Set the chart title to Fruit quantities, the x-axis to Fruit and the y-axis to Quantity.

                        
fruits = {'banana': 5, 'apple':4, 'cherry': 2}
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.bar(fruits.keys(), fruits.values())
ax.set_title('Fruit quantities')
ax.set_xlabel('Fruit')
ax.set_ylabel('Quantity')

Given x=[3,5,7] and y=[6,8,2], create two subplots. In the first one draw the line plot of x, y and in the second one draw a scatter plot of x,y.

                        
import matplotlib.pyplot as plt
x = [3,5,7]
y = [6,8,2]
fig, ax = plt.subplots(ncols=2)
ax[0].plot(x,y)
ax[1].scatter(x,y)

Generate two random arrays x and y with a normal distribution, each of 100 samples. Then create three plots: plot the histogram of x, the histogram of y and the scatter plot of x versus y.

                        
import matplotlib.pyplot as plt
x = np.random.normal(size=100)
y = np.random.normal(size=100)
fig, ax = plt.subplots(ncols=3, figsize=(12,5))
ax[0].hist(x)
ax[1].hist(y)
ax[2].scatter(x,y)

Generate a dataframe with three columns a,b,c, where each column contains 10 values with an uniform distribution. Then plot the three columns in a bar chart.

                        
import pandas as pd
import numpy as np
# create a 2D array with 10 rows and 3 columns. Each column
# has then 10 values with uniform distribution
x = np.random.rand(10,3)
# create a dataframe
df = pd.DataFrame(x, columns=['a','b','c'])
df.plot.bar()

Create the following dataset df = pd.DataFrame({'age':[44, 67, np.nan], 'money':[500,700,550], 'color':['red', 'green','blue']}). Replace the np.nan value with the mean of the other ages. Then display a scatter plot where each point has the given color.

                        
# create dataframe
df = pd.DataFrame({'age':[44, 67, np.nan], 'money':[500,700,550], 'color':['red', 'green','blue']})
# replace na value with mean age
df.fillna(df.age.mean(), inplace=True)
# display the scatter plot
df.plot.scatter(x='age', y='money', c='color')

Create two arrays x and y, each containing 30 values that are normally distributed. Then display them in a scatter plot and include a vertical and a horizontal line that divide each axis into two equal parts.

                        
import numpy as np
import matplotlib.pyplot as plt
# generate x,y with random distr.
x = np.random.randn(30)
y = np.random.randn(30)
fig, ax = plt.subplots()
# scatter plot
ax.scatter(x,y)
# horizontal line
ax.axhline(y.mean())
# vertical line
ax.axvline(x.mean())

Generate two random arrays: x1, which should contain 30 integers between 2 and 20, and x2, which should contain 100 integers between 1 and 10, both uniformly distributed. Next, create two plots side by side that share the same y-axis, one for x1 and one for x2. Label the title of each plot as x1 plot and x2 plot respectively. Lastly, give the entire figure a common title such as My Plots.
```
                        
fig, (ax1,ax2) = plt.subplots(nrows=1, ncols=2, sharey = True)
x1 = np.random.randint(2,20,30)
x2 = np.random.randint(1,10,100)
ax1.plot(x1)
ax1.set_title("x1 plot")
ax2.plot(x2)
ax2.set_title("x2 plot")
# set common title
fig.subtitle("My plots")

                        
```

Generate a random array with 10 rows and 3 columns. Display the values in a bar chart with the dark_background style.

                        
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# create a random array of size 10x4
data = np.random.randn(10,4)
df = pd.DataFrame(data)
# use the 'dark_background' style 
plt.style.use('dark_background')
# display a bar chart
df.plot.bar()

Generate a random dataset with 10 rows and 2 columns. Display the values in a scatter plot and limit the x-axis between 0.2 and 0.6.

                        
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
x = np.random.rand(10,2)
df = pd.DataFrame(x, columns=['a','b'])
fig, ax = plt.subplots()
ax.set_xlim([0.2, 0.6])
ax.scatter(x=df['a'],y=df['b'])

Given the two series s1 = [3,4] and s2 = [5,7], create a dataframe and call the columns c1 and c2. Finally display the dataframe in a heatmap with title My heatmap.

                        
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
# create dataframe
s1 = [3,4]
s2 = [5,7]
df = pd.DataFrame([s1,s2], columns=['c1','c2'])
# Creating the heatmap using Seaborn
sns.heatmap(df, annot=True)
# set the title
plt.title('My heatmap')
plt.show()

Given some salaries of engineers [70,72,76,82,69,75] and workers [50,52,56,32,49,55], create a dataframe with a column for each salary group and display the dataframe in a violin plot.

                        
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

engineer_salary = [70,72,76,82,69,75]
worker_salary = [50,52,56,32,49,55]
data = {
    'engineer salary': [70,72,76,82,69,75],
    'worker salary': [50,52,56,32,49,55]
}
df = pd.DataFrame(data=data) 
sns.violinplot(data=df, palette='muted', inner='quart')
plt.title('Salary Distribution')
plt.xlabel('Job')
plt.ylabel('Salary')
plt.show()

Create a line plot for the function y = x^2 for x values ranging from -5 to 5. Label the axes and give the plot a title.

                        
import matplotlib.pyplot as plt
import numpy as np
x = np.arange(-5,5)
y = x**2
plt.plot(x,y)
plt.xlabel('x')
plt.ylabel('y')
plt.title('y = x^2')

Visualize a horizontal bar chart for the top 10 countries with the highest population. Use sample data or fetch real data from a dataset. Label the countries on the y-axis and their populations on the x-axis.

                        
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# create a dataframe with countries and population
countries = ['China', 'India', 'USA', 'Indonesia', 'Pakistan', 'Brazil', 'Nigeria', 'Bangladesh', 'Russia', 'Mexico']
population = [1439323776, 1380004385, 331002651, 273523615, 220892340, 212559417, 206139589, 164689383, 145934462, 128932753]
df = pd.DataFrame({'country':countries, 'population':population})
# sort by population
df.sort_values('population', ascending=False, inplace=True)
# plot horizontal bar chart
plt.barh(df.country[:10], df.population[:10])
plt.xlabel('Population')
plt.ylabel('Country')
plt.title('Top 10 countries by population')

Visualize a 3D surface plot of a mathematical function, such as z = x^2 + y^2. Add labels to the axes and a title to the plot.

                        
import matplotlib.pyplot as plt
import numpy as np
from mpl_toolkits.mplot3d import Axes3D
# create a 3D plot
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
# create x,y
x = np.arange(-5,5)
y = np.arange(-5,5)
# create a meshgrid
X, Y = np.meshgrid(x, y)
# create z
Z = X**2 + Y**2
# plot the surface
ax.plot_surface(X, Y, Z)
# set labels
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('z')
# set title
ax.set_title('z = x^2 + y^2')

Pandas Dojos

What are the two main data structures of Pandas?
The two main data structures of Pandas are
- Series: it is given by a one-dimensional array and an associated array of labels, called indices. Each element in the array has a respective index.
- Dataframe: it is a tabular data structure with rows and columns that behind the scenes is a collection of Series.
What is the differences between a Pandas series and a standard Python list?

A Python list and a Pandas series look at first sight equal. The difference is that a Pandas series has a set of indices, one for each element. By default these indices are integers but they can be changed to another data type, strings for example.
```
                        
import pandas as pd
standard_array = [2,4,6]
pandas_series = pd.Series([2,4,6])
# default values are integers
pandas_series.index
# let's change them to chars
pandas_series.index = ['a','b','c']

                        
```

Create a series with 3 values (4,5,7) and 3 indices ('a','b','c'). Then print the values and the indices separately.

                        
from pandas import Series
mySeries = Series([4, 7, 5], index=['a','b','c'])
# print values 4,7,5
print(mySeries.values)
# print index 'a','b','c'
print(mySeries.index)

Given a series with values (3,5,7), return the values less than 4 or the values bigger than 6

                        
import pandas as pd
s = pd.Series([3,5,7])
# prints [3,7]
print(s[ (s < 4) | (s > 6) ])

Given a series pd.Series([4,6,7,4]), find the number of distinct values in it.

                        
mySeries = pd.Series([4,6,7,4])
num_distinct_values = mySeries.nunique()
# num_distinct_values is equal to 3. The distinct values are 4,6,7.

Given a series pd.Series(['male', 'female', 'male']), map 'male' to 0 and 'female' to 1.

                        
myseries = pd.Series(['male', 'female', 'male'])
mapping = {'male': 0, 'female': 1}
# map 'male' to 0 and 'female' to 1
myseries.map(mapping)

Given a series pd.Series([5,-7,2]), replace the negative numbers with 0.

                        
mySeries = pd.Series([5,-7,2])
# -7 is replaced with 0
mySeries[mySeries < 0] = 0

Given a series pd.Series([5,np.nan,2]), replace the nan number with the mean.

                        
mySeries = pd.Series([5,np.nan,2])
# replace all nan numbers with the mean of the remaining numbers
mySeries[mySeries.isnull()] = mySeries.mean()

Given a series pd.Series(['John', 'Paul', 'Ringo']), return a new series where each string is replaced with its first character.

                        
mySeries = pd.Series(['John', 'Paul', 'Ringo'])
# for each string return the first char
mySeries.apply(lambda s : s[0]) # [J,P,R]

Given a series pd.Series([4,6,8]), transform it to a new series where values less than 5 are mapped to 0 and the other values are mapped to 1.

                        
mylist = pd.Series([4,6,9])
# lambda function with conditional statement
mylist.apply(lambda x: 0 if x < 5 else 1)

Given list of ages, divide the ages into groups of range ages. For examples ages between 0 and 5 are babys, between 5 and 10 are children and so on.

                        
pd.cut([3, 4, 13, 14, 23, 24], bins=[0, 5, 15, 25], labels=['Baby', 'Child', 'Young'])

Open a csv file, show the first and last two rows, show the number of rows and columns and show the column names.

                        
# content of file.csv
# name,age,money
# john,43,2500
# ringo,32,3420
# paul,42,5600

import pandas as pd
ds = pd.read_csv('file.csv')
# first row
ds.head(1)
# last two rows
ds.tail(2)
# number of rows and columns
ds.shape
# column names
ds.columns

Create a dataframe with values [4,6,8] for column c1, [6,7,2] for column c2 and [3,5,6] for the index. Retrieve the row with index value equals to 3 and the row at position 1.

                        
# create the dataframe
df = pd.DataFrame({"c1":[4,6,8], "c2": [6,7,2]})
# set the index array
df.index = [3,5,6]
# get row with index 3
df.loc[3]
# get row at position 1
df.iloc[1]

Given a dataframe, sort the values by a column and show how many different items there are in that column

                        
# sort by column 'name'
ds.sort_values(['name'])
# show different number of items in column
ds['name'].value_counts()

Given a dataframe, show all columns of row 2, then show the number of null values for all columns.

                        
import pandas as pd
data = {'Name':['Jeff', 'Paul', 'Roger', 'George'], 'Age':[48, 26, 17, 64],
        'Role':['Tecnician', 'Programmer', 'Student', np.nan],
        'Salary':[46000, 57000, 8000, 120000]}
df = pd.DataFrame(data) 
# show all columns of row 2
df.loc[2,:]
# number of null values for each column
df.isnull().sum()

Given a dataframe, drop the name and age columns. Try to use at least two different approaches.

                        
# let's suppose you have a dataframe df

# approach 1 to remove the columns. 
df.drop(['name','age'], axis=1)

# approach 2
df.drop(columns = ['name','age'])

Given a dataframe with a numeric column Age that has some null values, replace the null values with the average age.

                        
# Lets create a dataframe that has a column Age where one value is null
pd.DataFrame(data)
data = {'Name':['Jeff', 'Paul', 'Roger', 'George'], 'Age':[np.nan, 26, 17, 64]}
df = pd.DataFrame(data)
# let's find the average age
mean_age = df.Age.mean()
# replace null values with mean_age.
df.Age.fillna(mean_age, inplace=True)

Given a list of ages [3, 7, 14, 18, 20, 25, 29], divide the ages into three bins: child (0-14), young (14-21) and adult (> 21).

                        
# define a list of ages
ages = [3, 7, 14, 18, 20, 25, 29]
# define the bins (0-14), (14-21), (21-100)
mybins = [0, 14, 21, 100]
# each bin is assigned a label
mylabels = ['child', 'young', 'adult']
res = pd.cut(ages, bins = mybins, labels = mylabels)
# the result is 
# ['child', 'child', 'child', 'young', 'young', 'adult', 'adult']

Given a dataset 'age': [35, 40, 45, 56, 22, 19], 'salary': [32, 43, 56, 75, 18, 16], divide the salary into three bins: poor (0-20), medium (20-45), rich (>45). Then find the average age for the poor, medium and rich bins.

                        
# create the dataset
df = pd.DataFrame({'age': [35, 40, 45, 56, 22, 19], 'salary': [32, 43, 56, 75, 18, 16]})
# define salary bins
mybins = [0, 20, 45, 100]
# define a label for each bin
mylabels = ['poor', 'medium', 'rich']
# create a new column 'salary_bin'
df['salary_bin'] = pd.cut(df.salary, bins = mybins, labels = mylabels)
# create a pivot table with 
df.pivot_table(values='age', index='salary_bin')

Create a DataFrame from a dictionary where each key represents a column name and the corresponding value is a list of values for that column. Display the first few rows of the DataFrame.

                        
import pandas as pd
data = {'Name':['Jeff', 'Paul', 'Roger', 'George'], 'Age':[48, 26, 17, 64],
        'Role':['Tecnician', 'Programmer', 'Student', 'Manager'],
        'Salary':[46000, 57000, 8000, 120000]}
df = pd.DataFrame(data) 
df.head()

Filter a DataFrame to show only rows that meet specific conditions. For example, display rows where a certain column's value is greater than a particular threshold.

                        
import pandas as pd
data = {'Name':['Jeff', 'Paul', 'Roger', 'George'], 'Age':[48, 26, 17, 64],
        'Role':['Tecnician', 'Programmer', 'Student', 'Manager'],
        'Salary':[46000, 57000, 8000, 120000]}
df = pd.DataFrame(data) 
# show only rows where salary is greater than 50000
df[df['Salary'] > 50000]

Reshape a DataFrame using pivot tables or other reshaping techniques to make it suitable for analysis or visualization.

                        
import pandas as pd
data = {'Name':['Jeff', 'Paul', 'Roger', 'George'], 'Age':[48, 26, 17, 64],
        'Role':['Tecnician', 'Programmer', 'Student', 'Manager'],
        'Salary':[46000, 57000, 8000, 120000]}
df = pd.DataFrame(data) 
# reshape the dataframe
df.pivot_table(values='Age', index='Role')

Merge two DataFrames together based on a common column or index and display the resulting merged DataFrame.

                        
import pandas as pd
data1 = {'Name':['Jeff', 'Paul', 'Roger', 'George'], 'Age':[48, 26, 17, 64],
        'Role':['Tecnician', 'Programmer', 'Student', 'Manager'],
        'Salary':[46000,57000,8000,120000]}
data2 = {'Name':['Jeff', 'Paul', 'Roger', 'George'], 'Age':[48, 26, 17, 64],
        'Role':['Tecnician', 'Programmer', 'Student', 'Manager'],
        'Salary':[46000, 57000, 8000, 120000]}
df1 = pd.DataFrame(data1) 
df2 = pd.DataFrame(data2) 
# merge the two dataframes
pd.merge(df1, df2, on='Name')

Machine Learning Dojos

Create a dataframe from {'a': [2, 4, 6], 'b': [9, 5, 3], 'y':[1, 2, 5]}, where y denotes the label column. Create a X dataframe with the feature columns and a y dataframe with the label column. Finally split each dataframe into a training and a test dataframe, where the test dataframe has 30% of the items.

                        
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.DataFrame({'a':[2, 4, 6], 'b':[9, 5, 3], 'y':[1, 2, 5]})
X = df.drop(columns='y', axis=1)
y = df['y']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

Create a dataframe from {'a':['a', 'b', 'c']} and convert the text feature to a numerical feature using the OneHotEncoder class of the sklearn library.

                        
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
df = pd.DataFrame({'a':['a', 'b', 'c']})
one_hot = OneHotEncoder()
transformer = ColumnTransformer( [ ('one_hot', one_hot, ['a']) ], remainder = 'passthrough')
fitted = transformer.fit_transform( df )
transformed_df = pd.DataFrame( fitted )

Create a dataframe from {'a' : [6, 8, 6, np.NaN]} and replace the NaN value with the most frequent value using the SimpleImputer class of the sklearn library.

                        
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
my_imputer = SimpleImputer(strategy = "most_frequent")
df = pd.DataFrame({'a' : [6, 8, 6, np.NaN]})
imputer = ColumnTransformer( [ ('imp', my_imputer, ['a']) ] )
fitted = imputer.fit_transform(df)
df_filled = pd.DataFrame(fitted)

Create a DataFrame with sample data where you have features ('X1', 'X2', 'X3') and a target column ('y'). Split the data into training and test sets using a 70/30 split ratio.

                        
import pandas as pd
from sklearn.model_selection import train_test_split
# create a dataframe with sample data
data = {'X1':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'X2':[11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
        'X3':[21, 22, 23, 24, 25, 26, 27, 28, 29, 30], 'y':[31, 32, 33, 34, 35, 36, 37, 38, 39, 40]}
df = pd.DataFrame(data) 
# split the data into training and test sets
X = df.drop('y', axis=1)
y = df['y']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

Load the california housing dataset from sklearn and create a training and a test dataset. The test dataset must contains 20% of the total data points. Then train a linear regression with the Ridge algorithm and print the score.

                        
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge

np.random.seed(42)
ds = fetch_california_housing()
df = pd.DataFrame(ds['data'], columns = ds['feature_names'])
df['MedHouseVal'] = ds['target']
X = df.drop('MedHouseVal', axis=1)
y = df['MedHouseVal']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = Ridge()
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print(score)
# With seed 42 you should get score = 0.575

Load the Iris dataset from the sklearn library. Then, train the RandomForestClassifier on it and finally evaluate the result with the cross_val_score metric.

                        
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
np.random.seed(12)
ds = load_iris()
df = pd.DataFrame(ds.data, columns=ds.feature_names)
df['target'] = pd.Series(ds.target)
X = df.drop('target', axis=1)
y = df['target']
model = RandomForestClassifier()
cross_val_score(model, X, y)
# As result you should get array([0.96666667, 0.96666667, 0.93333333, 0.96666667, 1.])

Suppose we have some actual results y_act = [1, 0, 1, 1, 0] and let y_pred = [1, 1, 0, 1, 0] be the results predicted with a classifier. Evaluate the performance of that classifier using the ROC curve.

                        
from sklearn.metrics import roc_curve
from sklearn.metrics import RocCurveDisplay
from sklearn.metrics import roc_auc_score
y_act = [1, 0, 1, 1, 0] 
y_pred = [1, 1, 0, 1, 0]
# compute ROC
fpr, tpr, threshold = roc_curve(y_act, y_pred)
# compute the area under curve (auc)
auc = roc_auc_score(y_act, y_pred)
# display the ROC and AUC
roc_display = RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=auc).plot()
# you will get auc = 0.5833

Apply the random forest regressor to the california housing dataset. Then evaluate the model performance with the R-squared metric.

                        
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score

np.random.seed(1234)
# load dataset
ds = fetch_california_housing()
df = pd.DataFrame(ds['data'], columns = ds['feature_names'])
df['target'] = ds['target']
# dependent variables
X = df.drop('target', axis=1)
# independent variable
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = RandomForestRegressor( n_estimators=100 )
model.fit(X_train, y_train)
model.score(X_test, y_test)
y_pred = model.predict(X_test)
r2_score(y_test, y_pred)
# you should get 0.8049

Write a Python script to perform a web scraping task. Visit a website, such as "https://example.com" and scrape the titles and links of the top articles on the homepage. Use the requests library to fetch the HTML content and BeautifulSoup for parsing.

                        
import requests
from bs4 import BeautifulSoup
url = "https://example.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
# Find and print the titles and links of the top articles
top_articles = soup.find_all("div", class_="article")
for article in top_articles:
title = article.find("h2").text
link = article.find("a")["href"]
print(f"Title: {title}")
print(f"Link: {link}")

Create a movie recommendation system using user ratings data. The goal is to recommend the top 3 movies for a specific user based on their similarity to other users. For the solution you can use the user_ratings.csv dataset.

                        
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

# Load user ratings dataset
ratings_data = pd.read_csv("user_ratings.csv")

# Calculate the similarity matrix between users
user_similarity = cosine_similarity( ratings_data )

# Recommend top 3 movies for a specific user (e.g., user 123)
user_id = 123
similar_users = list( enumerate( user_similarity[user_id] ) )
similar_users.sort(key = lambda x: x[1], reverse=True)

# Extract and print the top 3 movie recommendations
top_movie_ids = [user[0] for user in similar_users[1:4]]
recommended_movies = ratings_data.columns[ top_movie_ids ]

Implement a Python script that simulates a basic multiplayer game. The game should involve multiple players, each with unique attributes, and allow them to interact in a virtual world. Create classes and methods for character creation, movement, and combat.

                        class Player:
  def __init__(self, name, health, attack):
    self.name = name
    self.health = health
    self.attack = attack

  def move(self, direction):
    # Implement movement logic
    pass

  def attack_player(self, target_player):
    # Implement combat logic
    pass

# Create multiple player objects
player1 = Player("Alice", 100, 20)
player2 = Player("Bob", 120, 15)

# Simulate player interactions, movement, and combat
player1.move("north")
player2.attack_player(player1)

Implement k-fold cross-validation on a dataset and use it to train and evaluate a machine learning model. Print the cross-validation scores and metrics.

                        
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
# create a dataframe with sample data
data = {'X1':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'X2':[11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
        'X3':[21, 22, 23, 24, 25, 26, 27, 28, 29, 30], 'y':[31, 32, 33, 34, 35, 36, 37, 38, 39, 40]}
df = pd.DataFrame(data) 
# split the data into training and test sets
X = df.drop('y', axis=1)
y = df['y']
# create a linear regression model
model = LinearRegression()
# perform k-fold cross-validation
scores = cross_val_score(model, X, y, cv=5)
# print the scores
print(scores)
# print the mean score
print(scores.mean())

Apply Principal Component Analysis (PCA) to reduce the dimensionality of a dataset and then train a machine learning model on the reduced data. Compare the model's performance before and after dimensionality reduction.

                        
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
# create a dataframe with sample data
data = {'X1':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'X2':[11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
        'X3':[21, 22, 23, 24, 25, 26, 27, 28, 29, 30], 'y':[31, 32, 33, 34, 35, 36, 37, 38, 39, 40]}
df = pd.DataFrame(data) 
# split the data into training and test sets
X = df.drop('y', axis=1)
y = df['y']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# apply PCA
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)
# train a linear regression model
model = LinearRegression()
model.fit(X_train_pca, y_train)
# evaluate the model
y_pred = model.predict(X_test_pca)
r2_score(y_test, y_pred)
# you should get 0.9999

Load a dataset with imbalanced classes and implement strategies to deal with class imbalance, such as oversampling, undersampling, or using different evaluation metrics like F1-score or AUC-ROC.

                        
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
# create a dataset with imbalanced classes
X, y = make_classification(n_samples = 1000, n_classes = 2, weights = [0.99, 0.01], random_state = 1)
# split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# train a logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)
# evaluate the model
y_pred = model.predict( X_test )
f1_score(y_test, y_pred)
# you should get 0.0

Load the Iris dataset, create a classifier using the gaussian Naive Bayes algorithm (GaussianNB) and evaluate its performance with the accuracy metric.

                        
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score

# Load the Iris dataset
X, y = load_iris(return_X_y=True)

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Initialize the Gaussian Naive Bayes classifier
gnb = GaussianNB()

# Train the classifier
gnb.fit(X_train, y_train)

# Make predictions
y_pred = gnb.predict(X_test)

# Evaluate the classifier
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

# You should get accuracy = 0.97

Deep learning dojos

Implement a simple feedforward neural network using TensorFlow and Keras. Train it on a dataset to perform image classification.

                        
# Here's a basic example of a feedforward neural network using TensorFlow and Keras for image classification:

import tensorflow as tf
from tensorflow import keras
# Load the dataset (e.g., MNIST)
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
# Preprocess the data
x_train, x_test = x_train / 255.0, x_test / 255.0
# Build the neural network model
model = keras.Sequential([
    keras.layers.Flatten( input_shape = (28, 28) ),
    keras.layers.Dense(128, activation='relu'),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(10)
])
# Compile the model
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy( from_logits = True ),
              metrics = ['accuracy'])

# Train the model
model.fit(x_train, y_train, epochs = 5)
# Evaluate the model
test_loss, test_acc = model.evaluate(x_test, y_test)

# This code demonstrates building a simple neural network for image classification on the MNIST dataset.