Mirror from Kaggle: Titanic In-Depth Analysis and KNN from Scratch
This Jupyter Notebook provides an in-depth analysis of the Titanic dataset and demonstrates the implementation of the K-Nearest Neighbors (KNN) algorithm from scratch. It includes data loading, exploration, visualization, preprocessing, model building, and submission.
- Libraries:
sqrt(math),numpy,pandas,os,matplotlib.pyplot,seaborn,wordcloud,sklearn,scipy. - Purposes: Data manipulation, statistical analysis, visualization, machine learning, and distance calculations.
- Data Source: Kaggle's Titanic dataset.
- Files:
train.csv,test.csv,gender_submission.csv. - Dataframes:
df_trainanddf_test.
- Examining data structure, types, and initial insights.
- Key observations on features like
Pclass,Sex,Ticket,Cabin,Embarked, etc.
- Analysis of features' value ranges, uniqueness, and potential transformations.
- Categorical feature handling: Mapping and one-hot encoding.
- Exploratory visualizations for features like
Survived,Sex,Pclass,Age,SibSp,Parch,Fare, andEmbarked. - Correlation matrix to understand feature relationships.
- Imputing missing values and handling NaNs.
- Normalizing and scaling numeric data.
- One-hot encoding of categorical data.
- Preparing train and test datasets for the model.
- Custom implementation of KNN algorithm.
- Euclidean distance calculation.
- Vectorized approach for performance optimization.
- Finding the best
kvalue through brute force.
- Preparing the submission file based on the KNN model predictions.