Tech Notebook: A Machine Learning & Deep Learning Journey

The actual blog (with code) is at: https://xianzhiwang1.github.io/tech-blog/

I maintain a personal technical blog where I write up what I learn as I work through machine learning and deep learning, both inside coursework and on my own. This project grows out of the CS0451 Machine Learning class I took at Middlebury College. The overarching theme of this tech blog is understanding by building: each post takes a single idea, walks through the math behind it, implements it in Python (usually from scratch first, then alongside a library reference), and then puts the implementation to work on a real dataset. The site is built with Quarto and published from a stack of Jupyter notebooks, so every post is a fully reproducible mix of derivations, explanations, code, plots, and embedded source files.

The blog is split roughly into three threads — classical machine learning implemented from scratch, applied projects on real-world data, and a deeper dive into modern deep learning. Below I summarize a few of the posts and what I learned while writing them. The full posts, including the code, the plots, and the appendix containing the sources, live at the link above.


Classical Machine Learning, from scratch

A large part of the blog is a sequence of “from scratch” implementations of foundational classification and regression algorithms, written with nothing heavier than numpy and matplotlib. Every post follows the same arc: state the empirical risk minimization problem, derive the loss and its gradient (or Hessian) by hand, implement the optimizer in a small Python class, and finally validate the result by comparing predictions against the corresponding model in scikit-learn. Treating sklearn as a reference solution rather than the answer has been one of the most useful study habits I have picked up — when my hand-rolled model agrees with LogisticRegression or SVC to within numerical tolerance, I know I have the correct implementation.

The Perceptron

The journey starts with the perceptron, the canonical first algorithm in any intro level machine learning class, for me, that class was CS0451 at Middlebury College. The post walks through the geometric intuition (a hyperplane that separates two linearly separable point clouds), the update rule derived from the 0–1 loss, and a small training loop. I generate synthetic, linearly separable data with sklearn.datasets and watch the decision boundary rotate into place. This post is mostly about getting comfortable with the feature matrix $X \in \mathbb{R}^{n\times p}$, the target vector $y$, and the weight vector $w$ — vocabulary that every later post reuses.

Gradient Descent and Logistic Regression

The follow-up post replaces the 0–1 loss with the logistic loss and introduces gradient descent as a general-purpose optimization algorithm. The bulk of this post is a careful derivation of the gradient of the empirical risk

$$ L(w) = \frac{1}{n} \sum_{i=1}^{n} \ell\bigl(\langle w, x_i \rangle, y_i\bigr), $$

followed by a clean numpy implementation that updates the weights one step at a time. I also experiment with momentum and convergence tolerance to get a feel for what the hyperparameters actually do. This same gradient machinery resurfaces in almost every subsequent post.

Linear Regression and Kernel Logistic Regression

Two more posts continue the from-scratch theme. The linear regression post derives ordinary least squares from the squared-error loss $\ell(\hat y, y) = (\hat y - y)^2$ and solves it both with the closed-form normal equations and with gradient descent. The kernel logistic regression post then asks what happens when the data is no longer linearly separable: rather than fit a straight line, I replace the raw feature vector $x_i$ with the kernelized feature vector $\kappa(x_i) = (k(x_1, x_i), \dots, k(x_n, x_i))^\top$ and let a positive definite kernel pick out the right notion of similarity. The RBF kernel makes non-linear decision boundaries fall out naturally from what is still, under the hood, logistic regression.

Newton–Raphson and Support Vector Machines

The next two posts move beyond first-order optimization. The Newton–Raphson post derives the Hessian of the logistic loss and implements the classic second-order update $w \leftarrow w - H^{-1} \nabla L(w)$, which converges in a handful of iterations on well-conditioned problems — a striking contrast to the many-thousand-iteration runs of plain gradient descent in the earlier post. I compare my implementation against sklearn.linear_model.LogisticRegression on the same data and discuss when Newton-style methods are and are not the right tool.

The support vector machine post is the most involved of the classical implementations. I derive the hinge loss $\xi_i = \max(1 - y_i \hat y_i, 0)$ and the regularized SVM objective, work through the subgradient with respect to $w$, and implement sub-gradient descent with a $1/(\lambda j)$ learning rate schedule. The final model is again validated against sklearn.svm.SVC with a linear kernel, and the decision boundary is visualized with mlxtend.plotting.plot_decision_regions.


Unsupervised Learning with Linear Algebra

A separate post leaves the supervised setting behind and uses linear algebra directly. The first half is low-rank image compression with the SVD: a grayscale image is treated as a matrix, factored into $U \Sigma V^\top$, and reconstructed using only the top $k$ singular triplets. Watching the reconstruction sharpen as $k$ grows is a wonderful illustration of why the singular value decomposition is the right object for “best $k$-dimensional approximation.”

The second half is spectral community detection on a graph. I build the graph Laplacian, take its smallest non-trivial eigenvectors with numpy.linalg, and cluster the resulting low-dimensional embedding with $k$-means. The post connects the algebraic story (eigenvectors of the Laplacian) to the combinatorial one (relaxations of the normalized cut), and shows the clusters that emerge on a synthetic graph.


Applied Projects on Real Data

The blog also contains a few longer applied posts that move beyond synthetic data.

Auditing Allocative Bias

In the bias auditing post I use the American Community Survey PUMS data for Indiana in 2018, downloaded with the folktables package, to predict whether an individual is employed. The technical work is straightforward — a logistic regression from sklearn and confusion matrices from sklearn.metrics — but the interesting part is the bias that the confusion matrices reveal. The post discusses false-positive and false-negative rates across protected groups, and the limits of any single fairness metric. return to when I want to remember that a high test accuracy is not the same as a model behaving well.

Final Project: Incorporation in Late Imperial Russia

The ML final project is a CS0451 joint write-up with a classmate at Middlebury College that applies classification methods to a historical economics dataset: factory-level records from Late Imperial Russia (1894–1908), digitized by the historian Amanda Gregg (Middlebury College) and distributed by the AEA. We ask whether a factory’s observable characteristics — productivity, ownership structure, location, year — can predict whether it incorporates. The dataset is heavily imbalanced (most factories never incorporated), which makes it a natural testbed for the fairness and class-imbalance concepts introduced in the earlier audit post. The post combines pandas for cleaning, the from-scratch Newton–Raphson logistic regression from earlier in the blog, and sklearn baselines, and the final discussion ties the model’s coefficients back to the economic-history literature.

Classifying Palmer Penguins

The Palmer Penguins post is a shorter, more cheerful piece that uses the well-known three-species penguin dataset and standard sklearn pipelines — feature scaling, train/test splits, and a comparison across logistic regression, decision trees, and random forests — to classify penguins as Adelie, Chinstrap, or Gentoo. It is a nice exercise after the more theoretical posts and a good demonstration of how quickly one can go from raw data to a working classifier with the standard scientific Python stack.


Deep Learning: Denoising Diffusion Probabilistic Models

The more recent post is a project grown out of a graduate level deep learning course: a from-scratch implementation of the Denoising Diffusion Probabilistic Model (DDPM) of Ho, Jain, and Abbeel (2020), written in PyTorch and trained on MNIST. The post is structured as a self-contained derivation followed by a code walkthrough.

On the math side I work through the forward Gaussian Markov chain $q(x_t \mid x_0)$, the closed-form marginal that lets us sample any timestep directly, the reverse process $p_\theta(x_{t-1} \mid x_t)$, and the noise-prediction reparameterization that collapses the variational lower bound into a simple regression objective on $\epsilon_\theta$.

On the implementation side I walk through every component of a time-conditional U-Net: residual blocks with group normalization, self-attention at the lower-resolution feature maps, sinusoidal timestep embeddings projected through a small MLP, and a linear $\beta$ schedule. The training loop is a few lines of PyTorch built on top of torch.utils.data.DataLoader and the standard MNIST loader from torchvision. After training, the post shows samples drawn by running the reverse process from pure Gaussian noise, and discusses the trade-offs between the number of diffusion steps, sample quality, and wall-clock sampling time.

This post is where the threads of the rest of the blog come together: the loss-and-gradient discipline from the classical posts, the linear-algebra fluency from the unsupervised post, and the willingness to read and reproduce a recent paper.


Tooling and Implementation Notes

A few notes on how the blog is put together:


A work in progress

This blog is something I keep expanding as I move further through coursework and side projects, and the rough plan is to keep filling in the gaps: more deep-learning posts (transformers, RNNs, score-based models), more proper experimental write-ups, and more applied projects on real datasets. If you want to dig into the derivations, see the plots, or read the actual implementations — every post will have its source code in an appendix at the bottom — please head over to the live site:

https://xianzhiwang1.github.io/tech-blog/

Thanks for reading.