⚡Day 3 - Feature Engineering , Data Preprocessing

Well, today I tried to be extra productive and an extra explorer so this morning I tried searching for a few open source projects I can contribute to So after seeking trending topics of GitHub on topics like machine learning, Artificial Intelligence, and python, I found an open source project called Danswer, which sounds like (dancer 🕺🏾), does an even cooler objective of integrating Generative AI in Enterprises. so I was scrolling through the documentation and tried understanding a few files of code because I know how Python language works I do understand how the functions are working, but I lack skills in web development so let's see how it goes, I've looked into an issue and felt I can contribute to it, and talked in the discord community and the maintainer has shown full support, felt so nice.

Now coming to the study part, I again looked into the syllabus and choose to refine my feature engineering knowledge. so I googled a few articles and found some nice ones, explaining how feature engineering plays a vital role in Data Science, Feature engineering is the process of transforming raw data into a format that machine learning algorithms can understand and effectively utilize. It involves creating new features or modifying existing ones to enhance the performance of a machine-learning model. Feature engineering plays a crucial role in extracting meaningful information from the data and improving the predictive power of a model.

✨Reference: chat.openai.com

✨Reference: medium.com/@ndleah/eda-data-preprocessing-f..

some nice lecture notes were explaining Feature Engineering in an illustrative format which was nice and easy to understand. The below image is taken from the lecture notes.

the above image shows how the feature engineering cycle works, where you first build the features by brainstorming and then checking if the features suites well to the dataset and get more efficient predictions from the model. there are different techniques involved in feature engineering, some are listed below:

  1. Imputation: This technique is used to handle missing values in the dataset. Missing values can hinder the performance of machine learning models, so imputation involves filling in those missing values with reasonable estimates. For example, if a dataset has missing age values, they can be imputed by using the mean or median age of the available data.

  2. One-Hot Encoding: One-Hot Encoding is used to convert categorical variables into binary vectors. It creates new binary features for each unique category in a categorical variable. For instance, if there is a "color" feature with categories "red," "blue," and "green," one-hot encoding would transform it into three binary features: "color_red," "color_blue," and "color_green," with values of 1 or 0 depending on the original category.

  3. Feature Scaling: Scaling features is necessary when variables have different scales or units. It ensures that all features contribute equally to the model. Common scaling techniques include normalization (scaling values between 0 and 1) and standardization (scaling to have zero mean and unit variance). For example, if a dataset has features like "age" ranging from 0 to 100 and "income" ranging from 0 to 1, scaling can help bring them to a similar range.

  4. Binning: Binning involves dividing continuous numerical features into discrete bins. This process can help capture non-linear relationships and reduce the impact of outliers. For instance, if a dataset has an "age" feature, it can be binned into categories such as "child," "teenager," "adult," and "elderly" based on specific age ranges.

  5. Polynomial Features: Polynomial features can be created by generating higher-order terms from existing features. This approach allows the model to capture non-linear relationships between features and target variables. For example, given a feature "x," creating a polynomial feature of degree 2 would involve adding a new feature "x^2" to the dataset.

  6. Feature Interaction: Creating new features by combining existing ones can help the model capture interaction effects. For instance, if a dataset has features "length" and "width," creating a new feature called "area" by multiplying the two can provide additional information for the model.

  7. Time-Based Features: If the dataset includes a temporal component, extracting time-based features can be valuable. Examples include day of the week, month, or season. These features can help the model identify patterns or seasonality in the data.

  8. Domain-Specific Features: Incorporating domain knowledge can lead to informative features. For instance, in a fraud detection system, features such as transaction frequency, deviation from usual spending patterns, or suspicious keywords can be engineered based on the understanding of fraudulent behavior.

  9. Feature Selection: Sometimes, not all features in the dataset are relevant or contribute significantly to the model's performance. Feature selection techniques, such as statistical tests or feature importance algorithms, help identify the most informative features and discard irrelevant ones. This process can enhance model efficiency and reduce overfitting.

I asked this question to the chatGPT and it gave the above response which was kinda nice. later going to...

Data Preprocessing

it is a crucial step in ML pipelines, feature engineering also comes under data preprocessing, as the name suggests we work on the data before and get the best data possible to train our model from the data which is filled will empty cells, largely scaled, labeled, and dirty data. there are various things involved in data preprocessing like Data cleaning, data transformation, handling, imbalanced data, time-series data, splitting data Handling Skewed target variables, and many more. Not every step is used while working on the data but chosen accordingly.

Lil Bit of Linear Regression..🙂

the wasn't sufficient enough so I read enough prerequisites and introductory information about Linear Regression, like what are the assumptions involved in Linear Regression, what Loss function is used, what model evaluation functions are used, what regularization technique used, the types involved in Linear regression and much more. I've asked ChatGPT to give some basic info about linear regression to keep in mind and it responded to me with a nice response that goes like follows:

  1. Purpose: Linear regression is a supervised learning algorithm used for predicting a continuous target variable based on one or more independent variables. It assumes a linear relationship between the independent variables and the target variable.

  2. Linearity Assumption: Linear regression assumes that the relationship between the independent variables and the target variable can be represented by a linear equation. However, it can also capture non-linear relationships by incorporating transformations of the independent variables.

  3. Model Interpretability: Linear regression offers good interpretability as it provides coefficients that quantify the impact of each independent variable on the target variable. These coefficients can indicate the direction and magnitude of the relationship.

  4. Loss Function: The most common loss function used in linear regression is the Mean Squared Error (MSE). The objective is to minimize the sum of the squared differences between the predicted and actual target values.

  5. Model Evaluation: To assess the performance of a linear regression model, various metrics can be used, such as the coefficient of determination (R-squared), root mean squared error (RMSE), mean absolute error (MAE), etc. These metrics provide insights into how well the model fits the data and the extent of its predictive accuracy.

  6. Assumptions:

    • Linearity: The relationship between independent variables and the target variable is linear.

    • Independence: The observations in the dataset are independent of each other.

    • Homoscedasticity: The variance of the residuals (the differences between predicted and actual values) is constant across all levels of the independent variables.

    • Normality: The residuals are normally distributed, indicating that the errors follow a normal distribution.

  7. Feature Engineering: Similar to other regression algorithms, feature engineering can play a vital role in linear regression. It involves selecting relevant features, transforming variables, handling outliers, dealing with missing data, and incorporating domain knowledge to improve model performance.

  8. Multicollinearity: Linear regression assumes that the independent variables are not highly correlated with each other. Multicollinearity, where two or more independent variables are highly correlated, can impact the interpretability and stability of the model. Techniques such as variance inflation factor (VIF) analysis or dimensionality reduction methods can help mitigate multicollinearity.

  9. Regularization Techniques: In cases where overfitting is a concern or when dealing with high-dimensional data, regularization techniques like Ridge regression or Lasso regression can be employed. These techniques add penalty terms to the loss function to control the complexity of the model and prevent overfitting.

  10. Model Diagnostics: It's crucial to assess the assumptions of linear regression and perform diagnostic checks on the model. These checks include examining residual plots, detecting outliers or influential data points, and verifying the fulfillment of assumptions.

and coming to the types of linear regression there are 3 types majorly which are, simple Linear Regression, Multiple Linear Regression, and Polynomial regression.

  1. Simple Linear Regression: In simple linear regression, there is a single independent variable (predictor) and a single dependent variable (target). The relationship between the predictor variable (x) and the target variable (y) is assumed to be linear. The equation of a simple linear regression model can be represented as y = b0 + b1 * x, where y is the target variable, x is the predictor variable, b0 is the y-intercept (bias term), and b1 is the coefficient or slope of the line.

The objective is to estimate the values of b0 and b1 that minimize the sum of squared differences between the predicted and actual target values.

  1. Multiple Linear Regression: Multiple linear regression extends simple linear regression to include multiple independent variables (predictors). The equation is expanded to account for the additional predictors: y = b0 + b1 x1 + b2 x2 + ... + bn * xn, where x1, x2, ..., xn are the independent variables, and b1, b2, ..., bn are the corresponding coefficients.

The goal remains the same: to estimate the coefficients that minimize the sum of squared differences between the predicted and actual target values. Multiple linear regression allows for modeling more complex relationships between the predictors and the target variable.

  1. Polynomial Regression: Polynomial regression is an extension of linear regression that allows for non-linear relationships by introducing polynomial terms of the predictors. The equation takes the form: y = b0 + b1 x + b2 x^2 + ... + bn * x^n, where x^2, x^3, ..., x^n represent the squared, cubed, and higher-order terms of the predictor variable x.

Polynomial regression can capture non-linear patterns by fitting curves instead of straight lines. The coefficients b0, b1, ..., bn are estimated through a process similar to linear regression, aiming to minimize the difference between predicted and actual target values.

Most of it is known to me, as I'm restarting this journey with leaving no voids this time as I completed this journey in a rush and skipped a lot and lots of concepts which made me a little, actually no, very underconfident present myself to the community. So, this time we'll take it to the top gradually and with a decent pace, "NO RUSHING THIS TIME!!". So yeah that's it for today. always remember, the reason for you rushing in through the topics is 90% caused by you comparing to others and losing your calm. So, never leave your calm and trust what you are doing and do things at your pace.

If you are liking this series do follow and subscribe to the newsletter, and follow my 🐦 Twitter account where I dump thoughts and share great resources posted on Twitter which will make you a tech geek and 100x more productive.

Today's notes 📝: Tap Here

Please let me know in the comments how are you enjoying this series and if you want me to change something in the blog 🙂.

Happy Coding 👽 .