Today we are going to work on the least square exercise number 3 from Andrew Ng Machine learning classroom.

Last time, we challenged our least square method (applied to linear regression), using only single input data. The input data was composed of the ages of young pupils which height was sampled.

Today, we try to find the helpful parameter values required into the prediction of house pricing given for each house, the living area and the number of bedrooms. As exposed by Professor Ng during its lectures, the values of the areas being a 1000 times greater than the number of room some scaling of the input data must be applied.

Being formally a physicist in my youth, I would call this step a normalization. We will scale both types of inputs by their standard deviations and set their means to zero. As a reminder, the mean of a x-data set can be expressed as:

μ =(1/n) ∑in Xi

where n is the total number of data, and Xi each data value.

The standard deviation is expressed by

σ = √(1/n) ∑in( Xi - μ)2

(Yes the little symbol behind is a square root :)) This time we must go a little bit further than using the function raw-data as in the previous exercise.

We will reuse our

*file-reader*module from last session:

(ns ml.file-reader (:use [clojure.string :only [split trim-newline trim]]) (:import [java.io File BufferedInputStream BufferedReader FileReader])) (defn read-lines [from-file] (let [reader (BufferedReader. (FileReader. (File. from-file)))] (letfn [(flush-content [line] (lazy-seq (if line (cons line (flush-content (.readLine reader))) (.close reader))))] (flush-content (.readLine reader))))) (defn parseDouble [x] (Double/parseDouble x)) (defn to-number [data] (map #(vec (map parseDouble %)) data)) (defn not-empty? [data] (not (empty? data))) (defn raw-data[from-file] (map #(filter not-empty? (split % #" ")) (map #(trim (trim-newline %)) (read-lines from-file))))

naturally starting collecting the file contents :

(ns ml.sample-two (:use [ml.file-reader]) (:use [ml.least-square])) (def from-x-sample "/dev/projects/clojure/leinengen/ml/ex3x.dat") (def from-y-sample "/dev/projects/clojure/leinengen/ml/ex3y.dat") (defn x-raw-data[] (to-number (raw-data from-x-sample))) (defn y-raw-data[] (to-number (raw-data from-y-sample)))

You will notice I am working in a new

*namespace*, referencing both the

*file-reader*and

*least-square*namespaces from the previous blog entry. Having browsed to Andrew Ng open classroom lecture you might recognize the file names bound to the

*from-x-sample*and

*from-y-sample variables*.

Before going any further I know I will have to process:

- mean of a data set
- standard deviation of a data set
- normalization of data

(ns ml.test.sample-two-spec (:use ml.sample-two) (:use clojure.test)) (def sample [2 4 4 4 5 5 7 9]) (deftest mean-test-sample-should-be-equal-to-five (is (= 5 (mean-of sample)))) (deftest deviation-test-sample-should-be-equal-to-two (is (= 2.0 (deviation sample (mean-of sample))))) (deftest normalised-datas-sample-should-have-mean-of-zero (is (= [-1.5 -0.5 -0.5 -0.5 0.0 0.0 1.0 2.0] (normalised-datas sample))) (is (= 0.0 (mean-of (normalised-datas sample)))))

The extrapolated implementation came to be:

(defn deviation [dataset mean] (let [squares (map #(square (- % mean)) dataset)] (Math/sqrt (/ (reduce + squares) (count dataset))))) (defn mean-of [dataset] (let [nb-data (count dataset)] (/ (reduce + dataset) nb-data))) (defn normalised-datas [dataset] (let [ mean (mean-of dataset) sigma (deviation dataset mean)] (vec (map #(/ (- % mean) sigma) dataset))))

This is where we can see that

*anonymous*functions (prefixed by the '#' symbol) should not be long as the '#' and '%' symbols may quickly clutter the readability of the code. The formulas remaining simple in this situation, the anonymous functions purposes remain clear.

Our tool methods are ready. But one minute. The

*raw-data*function from my

*file-reader*module helps me to produce a list of vectors for both the x-data and y-data sets:

ml.file-reader=> (to-number (raw-data "/dev/projects/clojure/leinengen/ml/ex3x.dat")) ([2104.0 3.0] [1600.0 3.0] [2400.0 3.0] [1416.0 2.0] [3000.0 4.0] [1985.0 4.0] [1534.0 3.0] ...

In order to normalize the x-data, I will have collect the columns of data referring the living area values, and the column of values referring the number of rooms values. This step is easy meanwhile gathering again on the same list of vectors from the two individual sets will be a little trickier.

I am sure some kind soul will point out to me the useful method that allow to zip vectors in Clojure. After one week of reading and practicing the recursion part of my brain while reading the first part of SICP, I immediately wrote a (

*tail*) recursive solution for the problem.

One of the beautiful aspects of Lisp languages being the easiness of developing one's adapted solution here and now for a given problem. What we expect is:

(deftest zip-datas-with-sample-vectors-should-produce-vector-of-vectors (is (= [[1 4] [2 5] [3 6]] (zip-datas [1 2 3] [4 5 6]))))

That we can easily solve with :

(defn zip-datas [x y] (letfn [ (zip-iteration [result partial-x partial-y] (if (empty? partial-x) result (recur (conj result (vector (first partial-x) (first partial-y))) (rest partial-x) (rest partial-y))))] (zip-iteration [] x y)))

There, we find a

*local scoped function*calling itself recursively. The final result is accumulated in a result variable while we iterate, working on a more and more reduced tail of the vectors being processed. A typical

*recursive procedure*generating an

*iterative process*(check first part of SICP). The complete file content is :

(ns ml.sample-two (:use [ml.file-reader]) (:use [ml.least-square])) (def from-x-sample "/dev/projects/clojure/leinengen/ml/ex3x.dat") (def from-y-sample "/dev/projects/clojure/leinengen/ml/ex3y.dat") (defn x-raw-data[] (to-number (raw-data from-x-sample))) (defn y-raw-data[] (to-number (raw-data from-y-sample))) (defn deviation [dataset mean] (let [squares (map #(square (- % mean)) dataset)] (Math/sqrt (/ (reduce + squares) (count dataset))))) (defn mean-of [dataset] (let [nb-data (count dataset)] (/ (reduce + dataset) nb-data))) (defn normalised-datas [dataset] (let [ mean (mean-of dataset) sigma (deviation dataset mean)] (vec (map #(/ (- % mean) sigma) dataset)))) (defn zip-datas [x y] (letfn [ (zip-iteration [result partial-x partial-y] (if (empty? partial-x) result (recur (conj result (vector (first partial-x) (first partial-y))) (rest partial-x) (rest partial-y))))] (zip-iteration [] x y))) (defn load-datas[] (let [ x (x-raw-data) y (y-raw-data) norm-first-x (normalised-datas (map first x)) norm-second-x (normalised-datas (map second x))] (create-samples (zip-datas norm-first-x norm-second-x) (y-raw-data)))) (defn sample-two[iterations learning-rate] (find-parameters (load-datas) iterations learning-rate))

At the bottom, the

*load-datas*function, takes in charge the reading, normalization and construction of the x input data, while the

*sample-two*function will allow us to fire the regression using Andrew parameters: the number of iterations steps, and the learning rate.

Done. After 100 iterations for a learning rate value of 1, we get:

({:value 340412.6595744681, :at 0} {:value 109447.7964696418, :at 1} {:value -6578.354854161278, :at 2})

...which is close to Andrew Ng result with a precision of 1 to 2%. This bias from the Octave/Matlab result should encourage us to work on the error amplitude estimation while working on numerical data. In a near future I hope.

Thank you very much to the people who have kindly followed these last three articles, and thank you for the feedback.

I have to pack my things in order to come back to the muddy and smelly suburb of Paris, but now that SICP reading has been seriously started, Haskell and Erlang discovery are in progress, I won't be disturbed by that environment. As a matter of fact, I won't have to bear this situation very long as I am looking for new positions in Europe :)

Be seeing you !!! :)