One promise I took was seriously reading the SICP book. I started reading it on vacation in parallel to other exciting activities activities like watching Cesarini and Thompson videos on Erlang, and starting learning a Haskell... for my great good (aficionados will understand).

Joining the reading of Michealson book with these other activities shook me in a positive way as all this functional programming stuff started to rearrange nicely in a flow of coherent concepts. Road is still long, but at least things are getting clearer.

Having read the first part of SICP (practicing the exercises in Scheme and Clojure) helped me in changing my approach of playing with higher order function, recursion, meanwhile re-learning nice mathematical stuff.

Of course I am far from mastering functional programming but it becomes more natural facing some problems to start writing something in Clojure or Scala (and soon in Erlang and Haskell I hope). This functional programming sympathy showed itself when I started watching Andrew Ng lectures on machine learning Stanford lecture CS 229), lectures kindly pointed out to me by Debashish Gosh (thank you Sir).

In addition to the Stanford video resources, Andrew Ng, also presents short lectures nicely supporting and detailing the concepts exposed during the classroom sessions. I did not have much time yet to set an Octave installation, but quickly felt a lot of frustration while watching some examples.

As a matter of fact I had a hunch, I could code adopting a functional language because of their mathematical expression so suitable in functional programming languages. I found the least square method applied to linear regression modelling demonstration so appealing that I did not resist in trying something.

After a first shot I was not satisfied with, I read the first part of SICP. As some concepts became clearer I took the problem again and rewrote it, faster and better. Naturally I am far from perfection, but at least, I can understand the code each time I come back to it. So there have been improvements :)

Here we go. I strongly recommend you make your self familiar with the least square method before starting reading the explanations in the open classroom.

Roughly speaking we are going to be provided a set of so-called training samples, we will use to fit a prediction model.

Typically a data set can be the measurements of heights of children versus their age, or prices of houses versus their areas etc... We will try to model a behavior using these training data sets in order to predict new values.

We will label as Y the data to be predicted and X the data provided as input in order to support our prediction attempt. More precisely, there can be multiple data inputs X for a given Y. Some given input data can be then detailed as X1, X2 etc.. depending on the number of inputs for given Y value. X is then a vector.

This aspect leads to very elegant matrices calculus we are not concerned with today.

We can come up with a starting hypothesis for our prediction model represented by the formula

```
hθ(X)
```

where in a linear regression approach, the expression takes the form:

hθ(X) = θ0X0 + θ1X1 +X2θ2 etc...

In vector representation, θ is also assumed a vector with components θk. Traditionally X0 is added as an additional virtual input and set to 1.

Model fitting can be achieved starting up with a

*cost function*expression that will be minimized versus our model parameters. In essence the values predicted by our model must be as close as possible to the real sample values. A way of formulating this difference between model and reality can be expressed as:

J(θ) =(1/2n) ∑in(hθ(Xi) - Yi)2

where n is the total number of training samples The variable part in the formula being the θk parameters, we are going to look for converging values of these parameters. Grossly the algorithm can be summarized as:

Start with a given set of θk values, like 0 for example keep changing the values until we have convergence

At each step of the algorithm will changed the hypothetical θk values applying the following formula:

θk: = θk - α∂θJ(θ)

Where α is called the

*learning rate*. Naturally its value influences the speed of convergence of the process. Regarding a linear regression approach , the last expression can be simplified into:

θk : = θk - (α/n)∑in(hθ(Xi) - Yi)Xki

I tried to keep things simple, sorry for the invasive formulas.

I tempted a rewriting of my first shot solution using both the wishful thinking advocated by Abelson and Sussman, and testing. This is to not-open the debate of understanding what you are doing versus TDD. Both served my purpose, no discussion about that.

Let's get driven by wishful thinking, at least once. My entry point function tries to replicate the generic algorithm :

(defn find-parameters[datas nb-try learning-rate] (let [ nb-datas (count (:input (first datas))) thetas (create-thetas nb-datas)] (letfn [ (improve [parameters remaining] (println remaining " " parameters) (let [hypothesis (hypothesis-on parameters)] (if (= 0 remaining) parameters (recur (new-parameters-from parameters datas hypothesis learning-rate) (dec remaining)))))] (improve thetas nb-try))))

In order to mimic Andrew Ng approach while dealing with the exercises,

*find-parameters*functions take as input

- The datas
- the number of iterations
- the learning rate

*improve*, recures on itself in order to refine the input parameters until reaching the number of iterations.

On each loop, we create a new hypothesis expression using the more recently processed parameters. So the definition of the hypothesis is abstracted in an external function.

The θ's parameters get created in the

*create-thetas*method. By convention we always start with all parameters values set to 0.We use the number of input data in a sample in order to set the parameters.

The creation functions for data samples and parameter generation where implemented using TDD.

We expect the input data sets to be created using the data sample creation functions:

*create-sample*. The driving tests:

(deftest create-sample-with-tests-value-should-rpoduce-expected-structure (is (= {:input [1 2] :output 3} (create-sample [[2] [3]])))) (deftest create-theta-should-create-expected-structure (is (= [{:value 0.0 :at 0} {:value 0.0 :at 1}] (create-thetas 2))) (is (= [{:value 0.0 :at 0} {:value 0.0 :at 1} {:value 0.0 :at 2}] (create-thetas 3))))

lead to the creation functions:

(defn create-sample[[input output]] {:input (vec (cons 1 input)) :output (first output)}) (defn create-samples[from-x from-y] (map create-sample (zipmap from-x from-y))) (defn create-theta [parameter position] {:value parameter :at position}) (defn create-thetas [for-number] (letfn [ (create-thetas-rec [thetas remaining] (if (= 0 remaining) thetas (recur (cons (create-theta 0.0 (dec remaining)) thetas) (dec remaining))))] (create-thetas-rec [] for-number)))

The astute reader will notice we have prep-ended the imaginary sample value 1 (aka X0) to the input data as explained before, in the

*create-sample*function.

Y and Xi data sets being taken from two different files, the

*create-samples*function has been introduced so to zip data incoming from both files.

A hash map represents each sample, indexing the input values with the

*:input*symbol, and the output with the

*:output*symbol.

A hash map structure also represents the θ parameters, referencing in each parameter the value with a

*:value*symbol (hurray !) and the parameter index with the

*:at*symbol.

Good, we have data structures. Remains, the foundation of our model, the hypothesis, hθ(X). I also called

*bias*the difference between the model predicted values and the real model value (hθ(Xi) - Yi) as it is also a recurring structure. Both the hypothesis and the bias are mathematical functions. Using a predefined (simple) set of test values I came with :

(deftest hypothesis-with-null-parameters-should-bring-zero (is (= 0.0 ((hypothesis-on (create-thetas 3)) {:input [1 2 3] :output 4}))) (is (= 6.0 ((hypothesis-on [(create-theta 1.0 0) (create-theta 1.0 1) (create-theta 1.0 2)]) {:input [1 2 3] :output 4})))) (deftest bias-for-with-zero-parameter-should-provide-output (let [ parameters (create-thetas 3) hypothesis (hypothesis-on parameters) data {:input [1 2 3] :output 4}] (is (= -4.0 ((bias-for hypothesis) data))))) (deftest bias-for-with-unitary-parameter-should-provide-output (let [ parameters [(create-theta 1.0 0) (create-theta 1.0 1) (create-theta 1.0 2)] hypothesis (hypothesis-on parameters) data {:input [1 2 3] :output 4}] (is (= 2.0 ((bias-for hypothesis) data)))))

where the bias function is challenged using both (0 0 0) θ values and (1 1 1) θ values. Both implementations are given by

(defn hypothesis-on [parameters] (fn[data] (let [values (:input data)] (reduce + (map #(* (:value %) (values (:at %))) parameters))))) (defn bias-for [hypothesis] (fn[data] (- (hypothesis data) (:output data))))

being

*higher order functions*as returning functions taking respectively as input , the thetas values for the hypothesis and a data sample for the bias. From my point of view this is where functional programming nicely fit the algorithm modeling, abstracting the mathematical functions as basic bricks in our construction.

Going back to the

*find-parameters*function, we now need to implement the

*new-parameters-from*function, in charge of processing the next parameters values, extracted for a previous set of parameter values. This time I must confess I fully decomposed the problem in wishful thinking splitting from top to down the problem in smaller pieces, testing only two of the methods, when there was doubt. The obtained implementation

(defn data-correction[at-index bias] (fn [data] (* (bias data) ((:input data) at-index)))) (defn correction-for[parameter datas bias] (let [ at-index (:at parameter) correction-per-data (data-correction at-index bias)] (reduce + (map correction-per-data datas)))) (defn corrected-value[parameter nb datas bias learning-rate] (- (:value parameter) (* learning-rate (/ (correction-for parameter datas bias) nb)))) (defn new-parameters-from[parameters datas hypothesis learning-rate] (let [nb (count datas) bias (bias-for hypothesis)] (map #(create-theta (corrected-value % nb datas bias learning-rate) (:at %)) parameters)))

where

- the
*new-parameters-from*function simply maps onto the list of parameters in order to set up the next list of parameters, - the
*corrected-value*function really does the job of altering an input parameter value. - The
*correction-for*function then processes the correction for a given parameter without the scaling learning rate, and this using all the data samples (logical, thinking about that), delegating the task of processing the correction on a per data sample base to the*data-correction*function...and that's all.

(ns ml.least-square) (defn square [x] (* x x)) (defn create-sample[[input output]] {:input (vec (cons 1 input)) :output (first output)}) (defn create-samples[from-x from-y] (map create-sample (zipmap from-x from-y))) (defn create-theta [parameter position] {:value parameter :at position}) (defn create-thetas [for-number] (letfn [ (create-thetas-rec [thetas remaining] (if (= 0 remaining) thetas (recur (cons (create-theta 0.0 (dec remaining)) thetas) (dec remaining))))] (create-thetas-rec [] for-number))) (defn hypothesis-on [parameters] (fn[data] (let [values (:input data)] (reduce + (map #(* (:value %) (values (:at %))) parameters))))) (defn bias-for [hypothesis] (fn[data] (- (hypothesis data) (:output data)))) (defn data-correction[at-index bias] (fn [data] (* (bias data) ((:input data) at-index)))) (defn correction-for[parameter datas bias] (let [ at-index (:at parameter) correction-per-data (data-correction at-index bias)] (reduce + (map correction-per-data datas)))) (defn corrected-value[parameter nb datas bias learning-rate] (- (:value parameter) (* learning-rate (/ (correction-for parameter datas bias) nb)))) (defn new-parameters-from[parameters datas hypothesis learning-rate] (let [nb (count datas) bias (bias-for hypothesis)] (map #(create-theta (corrected-value % nb datas bias learning-rate) (:at %)) parameters))) (defn find-parameters[datas nb-try learning-rate] (let [ nb-datas (count (:input (first datas))) thetas (create-thetas nb-datas)] (letfn [ (improve [parameters remaining] (println remaining " " parameters) (let [hypothesis (hypothesis-on parameters)] (if (= 0 remaining) parameters (recur (new-parameters-from parameters datas hypothesis learning-rate) (dec remaining)))))] (improve thetas nb-try))))

The code has been written into a

*least_square.clj*file located in a

*ml*directory under the source root. The Leinengen project definition is:

(defproject practice "1.0.0-SNAPSHOT" :description "Essays on Machine Learning" :dependencies[[org.clojure/clojure "1.3.0"]] :repositories {"sonatype-oss-public" "https://oss.sonatype.org/content/groups/public/"})

In a

*sample_one.clj*file, still under the

*ml*directory the code was challenged using exercise2 samples:

(ns ml.sample-one (:use [ml.file-reader]) (:use [ml.least-square])) (def from-x-sample "/dev/projects/clojure/leinengen/ml/ex2x.dat") (def from-y-sample "/dev/projects/clojure/leinengen/ml/ex2y.dat") (defn x-raw-data[] (to-number (raw-data from-x-sample))) (defn y-raw-data[] (to-number (raw-data from-y-sample))) (defn load-datas[] (create-samples (x-raw-data) (y-raw-data))) (defn sample-one[iterations learning-rate] (find-parameters (load-datas) iterations learning-rate))

reusing the previous blog entry

*file_reader.clj*file:

(ns ml.file-reader (:use [clojure.string :only [split trim-newline trim]]) (:import [java.io File BufferedInputStream BufferedReader FileReader])) (defn read-lines [from-file] (let [reader (BufferedReader. (FileReader. (File. from-file)))] (letfn [(flush-content [line] (lazy-seq (if line (cons line (flush-content (.readLine reader))) (.close reader))))] (flush-content (.readLine reader))))) (defn parseDouble [x] (Double/parseDouble x)) (defn to-number [data] (map #(vec (map parseDouble %)) data)) (defn not-empty? [data] (not (empty? data))) (defn raw-data[from-file] (map #(filter not-empty? (split % #" ")) (map #(trim (trim-newline %)) (read-lines from-file))))

Executing

(sample-one 1500 0.07)

for 1500 iterations and a learning rate value of 0.07, provides us with the exercise expected result

({:value 0.7501503917693574, :at 0} {:value 0.0638833754997108, :at 1})

which matches Andrew Ng results.

This settles a personal story I had with least square method while I was a physicist. These times are buried, the least square problem is tackled too and that's nice.

For the people who were not bored, I will briefly present the content of the

*sample_two.clj*file I used to challenge the exercise three, as it is necessary to apply a little scaling processing onto the input datas in order to proceed with a set of two inputs per sample. The solution works fine though.

Pffew!!!!!!!!! Going back to Haskell and Michaelson book.

Be seeing you !!! :)

## 2 comments:

This is great! I'll have to read your code closely. I'm also in the ML class, BTW. I have been using DrRacket rather than Clojure. It's fun and often more compact to look for a functional solution. The folks who built DrRacket strongly advocate writing test cases before one writes the code in a function. So most of my Racket files take the form ml.rkt and ml-test.rkt; I put my test cases in the latter and begin that file with (require rackunit), a unit testing module.

Thank you for your feedback. I started working with DrRacket while reading seriously the SICP. I find the language fascinating in both its simplicity and power. One can do so much with it !

## Post a Comment