Learning Through My First Kaggle Projects

A short reflection on my first month exploring Kaggle and learning through real data.

From Learning to Practice

It’s been seven weeks since my last post. During this time, I finished the Google Advanced Data Analytics Professional Certificate, which covers eight courses. It was a wonderful journey, where I was led step by step into the world of data science. The course content is thoughtfully designed, the instructors are superb, and the learning curve feels just right. I cannot express my gratitude to Google enough, and I feel very lucky to have found this series of courses.

After finishing the certificate, I wanted more exposure to real data analysis and more hands-on experience, so I joined Kaggle, a community for data scientists and machine learning engineers. (My Kaggle profile can be found here: Kaggle) Within a month, I have tried two competitions so far, both from the Playground Series, which features month-long tabular data competitions.

With some beginner’s luck, my code was upvoted and used, and I earned one silver and one bronze code medal. Both projects involve making probability predictions for classification tasks and are evaluated using the ROC AUC score. The February theme was predicting heart disease, and the March task is to predict customer churn for a telecom company.

Working with a Blank Canvas

These are my first experiences working with a completely blank canvas in data analysis, holding a brush in my hand. My biggest challenge when writing a notebook is forming a clear overall picture and designing a coherent pipeline.

I typically apply three to four models, each requiring slightly different feature engineering. This means I need to think in advance about what should be handled globally and what should remain model-specific. At the same time, I must ensure a proper validation strategy—such as train-test split or Stratified K-Fold—perform hyperparameter tuning with cross-validation, and avoid data leakage throughout the entire process.

Simultaneously, I pay attention to local details—such as formatting plots clearly, explaining how tree-based models behave (for example, whether they are shallow, learn quickly, or split easily), and interpreting results through coefficients or feature importance.

Along the way, I developed new skills, including CatBoost, RandomizedSearchCV (I later found that Optuna works more efficiently, but I hadn’t the chance of applying it yet), and StratifiedKFold, strengthened my feature engineering, and experimented with simple ensembling techniques such as averaging model outputs.

I still remember the first time I opened the Titanic project (a 101-level training project on Kaggle) six months ago. The code felt distant and difficult to understand. But now, everything on Kaggle has started to make sense. I am independently resolving real data problems and have completed notebooks that have been upvoted by several Experts and Masters.

Experimenting and Learning

Kaggle is a collaborative community where many participants share their code on competition pages. It’s interesting to draw inspiration from others’ ideas and run your own experiments.

For example, in the churn project, I explored a two-step modeling approach by adding a correction term to an initial model’s prediction.

The idea is as follows: I first train a model such as XGBoost or CatBoost to obtain predicted probabilities. Then, I compute the residuals and train a Ridge model on those residuals. Finally, I adjust the original prediction using the following transformation:

final = expit(logit(model1_prob) + α · ridge_resid_pred)

Here, logit and expit are standard transformations between probability space and log-odds space, and α is a tunable parameter controlling how much correction to apply. The key point is that ridge_resid_pred is not a probability, but a correction term that can be either positive or negative.

As it turned out, this second model on the residual did not provide a meaningful improvement, no matter how α was tuned. At first glance, the idea of exploiting residuals felt quite appealing. But after stepping back, I realized that this approach was largely redundant. For boosting-based models such as XGBoost, residual-like corrections are already learned stage by stage during training. In that sense, I was trying to manually add something the model had already done internally.

Still, I found the process valuable. Writing the code, testing the idea, and observing the results helped me develop a deeper understanding of how these models work. Rather than simply accepting a technique because it sounds reasonable, I learned to validate it through experimentation.

I also experimented with frequency encoding and seed averaging afterward. The improvements were marginal given the performance of my existing models.

This, again, was a useful reminder that added complexity does not always translate into better results—especially when the baseline is already strong.

Balancing Learning and Direction

Another topic that draws me into deeper thought is how I should best direct my efforts toward my career goal. My current goal is to start a career in data analysis, where the focus is on practical model application and interpretation.

However, I often find myself drawn to more advanced techniques. At times, I have started to notice a tension between data analysis and machine learning engineering, and between model interpretability and leaderboard-driven optimization. There are many appealing ideas I would like to experiment with, even though they may not be necessary for my future work.

For example, inspired by a leaderboard-winning solution, I was eager to design a pipeline that combines multiple models built on different feature engineering strategies, generates out-of-fold (OOF) predictions, and ensembles them using a Ridge meta-model. Such an approach could potentially improve my leaderboard score. However, at some point, I realized that designing this kind of pipeline is no longer just about building a model—it becomes the design of an entire system. I also began to recognize that an overly complex modeling pipeline can be difficult to apply in real business settings, where interpretability and transparency are often essential.

A similar tension appears in feature engineering. More complex transformations can sometimes improve model performance, but they are often difficult to interpret or explain in a business context. Features that are heavily engineered or abstract may lose their intuitive meaning, making it harder to connect model outputs back to real-world behavior or actionable insights.

After thinking more carefully, I felt that an improvement of 0.001 in ROC AUC is unlikely to justify the loss of clarity that comes with complex model stacking or overly engineered features.

For now, I will continue to focus on a data analysis–oriented approach, while exploring more complex pipelines like multi-model ensembling simply out of curiosity and for learning.

Looking Ahead

I will keep asking myself the following questions along the way:

  • Am I becoming fluent in writing code?
  • Do I truly understand the fundamentals of statistics and business through applying these models?
  • Can I explain what I am doing clearly?

Looking ahead, I plan to:

  • Continue working on a housing price prediction project on Kaggle. It focuses on value prediction, which is different from the classification problems I have worked on so far. It will involve different models, and with around 40 features, there will be more room to explore feature engineering. I expect this project to be especially exciting.
  • Keep practicing pandas and SQL through LeetCode.
  • Complete an A/B testing course on Udacity.
  • Regularly summarize and review what I have learned.

I believe an idea is not fully understood until it can be explained clearly and precisely.

There is still a long way to go, but I feel that I am on the right path.

Learning Slowly: Notes on Regularization in Logistic Regression and a Small K-means Experiment

1. Learning Slowly

Over the past six weeks on Coursera, I’ve finished Google’s Course 5 Regression Analysis and am now halfway through Course 6 Machine Learning. Regression Analysis took longer than I expected, yet Machine Learning feels more basic than I initially anticipated. I’ve also decided to temporarily suspend my study of IBM’s certificate, as Google’s advanced certificate overlaps with it and covers slightly more ground.

I’ve been learning at a slower pace recently for several reasons.

I took an eleven-day vacation to Australia, followed by an additional four to five days of rest. Shortly after returning, I moved from one room to another (south-facing) at home. New furniture was bought, and a new built-in wardrobe was installed, so the whole process took a while. For all the effort, the room is super tidy and cozy now. Both Australia and my new room provide me with plenty of sunshine.

More importantly, I prefer depth over speed in my studies. That mindset inevitably slows things down. I tend to linger on topics that genuinely interest me, sometimes longer than planned.

Occasionally, a single question pulls me into a long chain of thinking. I might explore it through extended conversations, revisiting assumptions, clarifying definitions, and trying to reconcile different explanations. These detours are not always efficient, but they are often where my understanding changes most. One such detour came from learning about regularization in logistic regression. The following section is a light reflection on how my understanding evolved.

2. A Light Discussion on Logistic Regression Regularization

2.1 My First Cognition Gap – The Default Is penalty = ‘l2’

The assumptions I initially held were that statistical models always faithfully reflect mathematical definitions and fairly apply formulas unless otherwise stated. In that sense, for logistic regression, I thought the penalty was supposed to be set to none (no regularization at all) for a pure model. However, I learned incidentally through lab code that the default setting for LogisticRegression() in scikit-learn is l2, which corresponds to Ridge regularization. The lab doesn’t cover much on this topic, so I did my own research.

Unlike linear regression, where regularization isn’t set as default, logistic regression is solved by iterative optimization of the log-loss function. During this process, it may encounter the issue of separable data, where the best solution corresponds to infinite coefficients.

Data are linearly separable if there exists a vector β and an intercept β₀ such that

β0+βxi>0for all yi=1,\beta_0 + \beta^\top x_i > 0 \quad \text{for all } y_i = 1,
β0+βxi<0for all yi=0.\beta_0 + \beta^\top x_i < 0 \quad \text{for all } y_i = 0.

In other words, a single linear boundary can perfectly separate the two classes.

An extreme one-feature example is when, for every churned user (y=1),X=1(y = 1), X = 1, whereas for all retained users (y=0),X=0(y = 0), X = 0. This means that P(y=1|X=0)0P(y = 1 \mid X = 0) \approx 0 andP(y=1|X=1)1P(y = 1 \mid X = 1) \approx 1.

We know that the logistic model takes the form p=11+ez,where z=β0+β1Xp = \frac{1}{1 + e^{-z}}, \quad \text{where } z = \beta_0 + \beta_1 X.

When X=0X = 0, we want p0p \approx 0, which drives β0\beta_0 \to -\infty.

When X=1X = 1, we want p1p \approx 1, which requires β0+β1+\beta_0 + \beta_1 \to +\infty, and therefore β1+\beta_1 \to +\infty.

Another way of thinking about this situation is to focus on the optimization process itself. When the data are separable, once the coefficients are already near the asymptotic regions of the sigmoid, a large change in the betas only marginally improves the loss. This behavior can be seen from the shape of the sigmoid function π(x)=11+ex\pi(x) = \frac{1}{1 + e^{-x}}, as well as from logπ(x)\log \pi(x), which contributes to the loss when y=1y = 1, and log(1π(x))\log \bigl(1 – \pi(x)\bigr) which contributes when y=0y=0. The latter two terms form the log-loss.

Figures: sigmoid function

loss contribution when y=1y = 1

loss contribution when y=0y = 0

We want to keep the betas from inflating and exploding while pushing for a better result (lower loss) during the optimization process.

Introducing a penalty to the model is like preinstalling seatbelts in a car. Large beta coefficients get penalized before they explode in the presence of separable or nearly separable data.

2.2 My Second Cognition Gap – Does the Default Penalty Parameter Make Sense?

Understanding why regularization exists naturally led me to a second question: does the default penalty parameter itself make sense?

It is always a good time to do some boring math.

For the logistic model, for each observation ii,

pi=P(yi=1|xi)=11+ezi,p_i = P(y_i = 1 \mid x_i) = \frac{1}{1 + e^{-z_i}},

where

zi=β0+β1xi1++βpxip.z_i = \beta_0 + \beta_1 x_{i1} + \dots + \beta_p x_{ip}.

The likelihood function over the dataset is

(β)=i=1Npiyi(1pi)1yi,\mathcal{L}(\beta) = \prod_{i=1}^N p_i^{y_i} (1 – p_i)^{1 – y_i},

for binary responses yi{0,1}y_i \in \{0,1\}.

Taking the negative log leads to the loss for a single observation,

i(β)=yilog(pi)(1yi)log(1pi).\ell_i(\beta) = – y_i \log(p_i) – (1 – y_i)\log(1 – p_i).

The loss over the entire dataset, which scikit-learn uses in its mean form, is

J(β)=1Ni=1Ni(β).J(\beta) = \frac{1}{N} \sum_{i=1}^N \ell_i(\beta).

When adding L2 regularization, the objective becomes the log-loss plus a penalty term:

J(β)=1Ni=1Ni(β)+12CNj=1pβj2,J(\beta) = \frac{1}{N} \sum_{i=1}^N \ell_i(\beta) \;+\; \frac{1}{2CN} \sum_{j=1}^p \beta_j^2,

where CC is the inverse regularization strength. In scikit-learn, the default value of CC is 1.

While I was initially picking up this concept, the formula ChatGPT originally provided to me missed the 1/N1/N factor in the penalty term, which made the penalty appear excessively large. As a result, the penalty and the original loss looked completely out of proportion, and the regularization term seemed arbitrary and dominating. This confused me for quite some time. I spent hours discussing this with GPT, trying to make sense of the formula, until we eventually realized that the original expression was incorrect! (I just wanted to scream at that point — and I probably did.)

After correcting the formula by including the proper 1/N1/N scaling, the default parameter choice C=1C=1 started to make sense. It is reasonable and tolerant enough not to hinder overall model performance on generally good data. By “good data,” I mean data with no (near) separation, no extreme multicollinearity, sufficient sample size relative to the number of features, and coefficients that are stable across resamples.

When the data are good, logistic regression with Ridge and without Ridge produces almost identical results. The regularization term quietly disappears. Regularization mainly matters when the unregularized solution is unstable. In that sense, it really behaves like a seatbelt: always present, but only noticeable when something goes wrong. This behavior differs from Ridge in linear regression, where coefficient shrinkage is present even when the data are well behaved.

I also learned that the parameter CC often requires tuning in special cases, and this tuning is typically performed on a logarithmic scale (for example, C=0.1110C = 0.1 \rightarrow 1 \rightarrow 10).

Finally, it is worth noting that, in practice, we almost always apply StandardScaler() before fitting the model. Since the scale of the coefficients directly affects the penalty, unscaled features can lead to coefficients being penalized unevenly. For example, without standardization, features with smaller scales may be penalized more heavily because they require larger coefficients.

3. A Small K-means Color Quantization Experiment

Not all learning moments need to be heavy. As another part of my recent study, I experimented with k-means clustering through a simple color quantization project, which I found quite interesting. In this exercise, we ignore the spatial position of each pixel in the image and group pixels into kk representative colors based on their RGB values. Each pixel is then replaced by its cluster’s representative color to approximate the original image.

Below are my current profile picture (taken in Australia), the centroid colors learned by k-means (ordered by brightness) with k=9k = 9, and the corresponding quantized image (yes, it comprises only 9 colors).

Figures: Centroid colors

Original image and quantized image

For now, this is where my understanding stands.