The purpose of approximation is finding optimal point \(x^*\) i.e. \(\nabla F(x^*) = 0\). We need a step/search direction \(\Delta x\) and step size \(t\). Taylor approximation has polynomial arguments that is a step and parameters of derivatives at the start point. The first degree of Taylor approximation has one adding term from start point \((x_0, F(x_0))\). The adding term \(\nabla F(x) \Delta x\) is consistent with a parameter (gradient \(\nabla F(x)\)) and a argument (step \(\Delta x\)).

Singular vector decomposition

Bases are the central idea of linear algebra. An invertable square matrix has eigenvectors. A symetric matrix has orthogonal eigenvectors with non-negative eigenvalues, i.e. positive semidefinite. A matrix has two types of singular vectors, left and right signular vectors, \(A=U\Sigma V^{T}\). When we think the matrix \(A\) is data points of rows \(A=U\Sigma V^{T}\) like data table, The right singular vectors \(V\) build bases, the sigular values \(\Sigma\) are magnitude of the bases and the left singular values \(U\) becomes new data points on new bases.

Low rank matrix and compressed sensing

This is a note for part III of Linear Algebra and learning from data, Gilbert Strang The main themes are sparsity (Low rank), Information theory (compression), and of course linear transformation. A full rank matrix is inefficient. Finding low lank matrix which is close with original matrix can save computation. The rank one matrix \(uv^{T}\) is a unit of a matrix. The full rank matrix can be decomposed by sum of rank one matrices i.

Steady state equilibrium

The meaning of \(A^{T}\) Steady state equilibrium Graph Laplacian matrix \(A^{T}CA\) Differential equation and Laplacian matrix Derivative is a graph without branch. Row space and column space are dual. \(A\) and \(A^{T}\) are dual. ref) Linear algebra and learning from data, Part IV, Gilbert Strang

Differential equations and Fourier transformation

Differential equations describe the change of state. The change relates to the state. The solutions of the differential equations are the status equations. The initial conditions set the time \(t\) and status \(y\). The boundary conditions are the value of boundary \(y_0\) and \(y_1\). \(dy \over dt\) \(= ay + q(t)\) starting from \(y(0)\) at \(t = 0\). inital conditions \(t = 0\) and \(y=1\) \(q(t)\) is a input and \(y(t)\) is a response.


Information relates to uncertainty. The Shannon information content of an outcome \(x\) is \(h(x)=-log_{2}P(x)\). The rare event has larger information than a common event. The unit of information is a bit (binary digit). Coding is a mapping from an outcome of an ensemble to binary digits \(\{0,1\}^+\). A symbol code is a code for a single ensemble. A block code is a code for a sequence ensemble. A set of sequences of the ensemble has a typical subset.

Taylor series

\[ f(x) = \sum_{k=0}^\infty c_k x^k = c_0 + c_1 x + c_2 x^2 + \dotsb. \] This is an approximation that is a function of h and derivatives of \(f(x)\) are elements of parameters. \(f(x \pm h) = f(x) \pm hf'(x) + \frac{h^2}{2}f''(x) \pm \frac{h^3}{6}f'''(x) + O(h^4)\) Let’s think about \(\sin(x)\). \[ f(x) = \sin(x) \ f(0) = 0, f'(x)=\cos(x)\ f'(0)=1, f''(x)=-\sin(x)\ f''(0)=0 \] Thus, \[\begin{align*} \sin(x) &= 0 + \frac{1}{1!


This is a note for Elements of information theory of Thomas M. Cover. The entropy (\(H\)) is a measure of uncertainty of a variable which is the answer to what is the ultimate data compression. Is the conditional probability \(p(x|y)\) considered as a probability of the “conditional variable” \((X|Y=y)\)? Yes, it is the subset of \(X\) given \(Y=y\). If you sum all of the subset probabilities, it becomes the cardinality of \(X\).


This is summary of Boyd convex optimization. Steepest descent method is a convex optimization algorithm. The normalized steepest descent direction \(x_{nsd}\) is a vector of unit ball of a norm that extends in the direction \(-\nabla f(x)\). The inner product of \(x_{nsd}\) and \(-\nabla f(x)\) is maximized. The first order Taylor approximation of \(f(x+v) = f(x) + \nabla f(x)^{T} v\) is most efficient when \(v = x_{nsd}\). The \(x_{nsd}\) is unnormalized into \(x_{sd}\).

Strong convexity and implications

This is a summary of the Boyd convex optimization book. The strong convexity assumption can be useful to explain the iterative minimization algorithms like gradient descent, steepest descent, and Newton’s method. The smallest and largest eigen value of Hessian \(m \preceq \nabla^{2}f(x) \preceq M\) with norm of gradient \(\| \nabla f(x)\|_2\) determine the boundary of optimal value \(p^{*}\). The condition number of cond(\(C_\alpha\)) \(\leq {M \over m}\), where \(C_\alpha\) is \(\alpha\)-sublevel.