Machine Learning & Neural Computation 424 2013/2014Module leader & Lecturer: Dr Aldo Faisal – Imperial College London Coursework issue date: 26/10/2013 Coursework submission: 18/11/2013 (online via CATE; New submission date with 3 extra days) Coursework return & feedback: within 14 days after submission Coursework has to be submitted individually and will be marked as such. You are welcome to discuss during the lab sessions your work with other students and your GTAs. All text-based files must contain a commented line identifying the Name and CID of the submitter. Submissions may be individually checked for plagiarism. Download and open the coursework zip-file from https://www.dropbox.com/s/f95x4km5kxbyewf/MLNCAssessedCoursework2013-2014.zip The focus of this coursework is supervised learning, and has three questions A, B and C. Please read all the coursework through carefully, understand it and then only start working on it. Your coursework submission should not be longer than 2,000 words (shorter submissions are quite welcome; code is of course excluded from the word count). All figures should conform to the visualising scientific data guidelines explained at the end of this coursework. Space-time series of 1-bedroom/studio rents in London You have been provided with a spatio-temporal dataset of over 57,000 rental prices (pcm) for single bedroom/studio properties in London along with corresponding Lat/Long coordinates. Also provided is a set of coordinates for all stations on the London Underground network. This is a real dataset collected using automated scripts from current sources (our crawler has been running since October 2012). Please, note that as with ever real-world data, there may be some corrupted or strange data points. E.g. there may be a number of properties that are not located within Greater London. The aim of this project is to use supervised learning techniques from the course to learn the supplied sample data mapping from geographical location and date to rental costs per calendar month. Thus, for example for an arbitrary location in Greater London and date of the recent past and present we want to know the predicted rental price from your machine learning system. You can choose any of the methods discussed in the course, provided that you implement them yourself from first principles. Since this dataset contains geolocation information, you are invited to use Google Maps (http://maps.google.com) to search for the specific locations of rental prices (e.g. paste longitude and latitude numbers into the map search window). !"#$%& ( )**%+",-.&/ *0 1%*1&%."&/ 234$& +*./5 ", +-.-/&. -,+ .$3& /.-."*,/ 2%&+ 6"%64&/57 89"/ 0"#$%& :-/ 6%&-.&+ $/",# .9& 3&4*: ;-.4-3 6*<<-,+/ ( 7 1 °MaLlab commands Lo generaLe Lhe flgure ploL(renLal(:,4),renLal(:,3),'.') hold on ploL(Lube.locaLlon(:,2),Lube.locaLlon(:,1),'ro') axls equal ylabel('LaLlLude [^\clrc]') xlabel('LonglLude [^\clrc]') −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 51 51.1 51.2 51.3 51.4 51.5 51.6 51.7 51.8 51.9 52 L a t i t u d e [ c i r c ] Longitude [ c irc] In your coursework zip directory you will find the following files: • rental.mat contains the location and price data that you will be using for the exercise, stored as a structure named rental, with four columns 1. rent (£ rent per calendar month), 2. time stamp 2 and finally geographic location in terms of degrees 3. latitude and 4. longitude. In addition it also contains a “struct” named tube, which contains two fields station and location with all London Underground stations. • trainRegressor.m – an empty stub file you have to write in • testRegressor.m – an empty stub file you have to write in • sanityCheck.m – see explanation in Question B In the following sub-questions you will be required to generate plots from data or your results. Note: Plots that do not have appropriate labels (i.e. the name of dimension and the units in square brackets, e.g. “Rent [£]”) will not be counted. Therefore, figures should be exported so that lines and data points are clearly visible when exported and any text in the figures is clearly readable. Use “export figure” to export the figure from Matlab and store it as PNG. Untidy or grainy figures will not be considered. 2 This date information is given in Matlab datenum format, use datevec to convert into normal calendar dates. A [20%] Familiarise yourself with the data set. Plot the rental prices over time. Now we want to know if the rental prices increased with time or did they remain constant? Fit a 0 th order and a 1 st order polynomial basis function set to this plot. a) Give your ML estimates of all parameters involved, including the noise in the model. b) Which model is more likely? The 0 th order (flat prices) or the 1 st order (linearly increasing prices?) Justify your answer with the data. c) What may be a problem in comparing the 1 st order and 0 th order model using likelihoods? B [45%] Next, we are going to pool the data together across time (ie. we just have price and position): a) Visualise in a 3D plot (plot3) the rental price raw data for Greater London (rotate the view appropriately to be as informative as possible). b) Write the two trainRegressor and testRegressor mentioned functions and train your system with the data set. Your task is to write two functions (the function files are provided for your first convenience), so as to be able to answer the subsequently described subquestions: 1. trainRegressor.m – a Matlab function that accepts training inputs trainIn (a two-column matrix of Longitude and Latitude pairs, where the rows correspond to different training data points) and training outputs trainOut (a column vector of rent prices, where each row rental price corresponds to the rental location in trainIn ). The function is to return a single variable (potentially a structure or matrix) params. >>params = trainRegressor ( trainIn , trainOut ) 2. testRegressor.m – a Matlab function that accepts testing inputs testIn (a two-column matrix of Longitude and Latitude pairs, where the rows correspond to different test data points) and params (the parameter variable created by trainRegressor). The function should return a column vector of rental prices for the corresponding locations called results. >>results = testRegressor ( testIn , params ) To test your code sanitycheck.m is a Matlab function that will check whether your two functions match the specifications we use. Use the function with the following command in matlab: >>sanityCheck ( @trainRegressor ,@testRegressor ) Note: If your functions do not pass the automatic tests performed by this function we will be unable to automatically evaluate their performance and may deduct marks. Your code should be documented with sufficient comments so a stranger is able to follow why and what and you did. Any free parameters that you do not want or cannot learn from the provided data set, have to be set as constants within the trainRegressor function. Explain and discuss your solution approach and how you chose any parameters that you set as constants. We suggest that you use Gaussian Basis functions c) Test your two functions by calculating the Root Mean Square error (RMSE) of your rental price predictions, by performing cross-validation (you will have to choose suitable sizes for the training and testing data chunks). How does the number of training points affect the accuracy of your regression? Hint: acceptable solution will have typically a RMSE lower than £950 per calendar month. d) Calculate the predicted rental price at all London tube stations, what is the mean and standard deviation of these rental prices? e) Use a bar chart (bar) plot to visualise the rent profile for the Central Line from Ealing to Stratford. f) What is the price to rent at Imperial College Campus (Latitude 51.499019°, Longitude - 0.176256°), and discuss if you believe the predicted value based on the data and the machine learning strategy used by you. g) What is the rent price at Upminster station, and discuss if you believe the predicted value based on the data and the machine learning strategy used by you. h) Generate a finely spaced grid of sample points (e.g. you could use the Matlab function meshgrid) and visualise the “landscape” of London rental prices using the function surf and add the tube stations (but not necessarily all tube stations names) in the plot (using plot3 and text). Rotate the view so that you see London from above, you will thus get a “heatmap”. Use colorbar to further annotate the plot. This allows you to answer the following questions: 1. What is your predictor’s most expensive location in London and where is it located? Provide price, latitude and longitude and use Google maps to visualise your location. 2. Where is the price gradient largest - i.e. where is the transition from an expensive area to a cheaper area steepest? i) How good does your machine learner predict your own rent and what is the relative error? C [35%] Now, we want to account for the passage of time and changes in rental prices. To this end we will augment our data space to include position, price and time. a) There are three main ways to achieve this: 1. Augment your sum of basis functions that operate on space (but not on time) with additional basis functions that operates on time (but not on space). 2. Use basis functions that all take space and time as joint argument. 3. Chunk the data in time and repeat the analysis of B on each chunk. Discuss each approach individually and sketch out a solution approach in a paragraph each. What could be challenges or problems to explain the data using each approach? Choose approach 1. or 2. implement them, and show and compare your results with those of approach 3. If your choice requires you to write new regressor functions (e.g. ones that take 3-column input arguments, etc.) please call them trainRegressorTime and testRegressorTime (You will have to generate the files yourself, following the template from question). Plot the change in prices for each month of the data set as predicted by your regressor for each state of the Central Line (choose a suitable way of visualising all this in a single figure). 1. trainRegressorTime.m – a Matlab function that accepts training inputs trainIn (a three-column matrix of 1. Datevec dates, 2. Longitude and 3. Latitude, where the rows correspond to different test data points) and training outputs trainOut (a column vector of rent prices, where each row rental price corresponds to the rental location in trainIn ). The function is to return a single variable (potentially a structure or matrix) params. >>params = trainRegressorTime ( trainIn , trainOut ) 2. testRegressorTime.m – a Matlab function that accepts testing inputs testIn (a three-column matrix of 1. Datevec dates, 2. Longitude and 3. Latitude, where the rows correspond to different test data points) and params (the parameter variable created by trainRegressor). The function should return a column vector of rental prices for the corresponding locations called results. >>results = testRegressorTime ( testIn , params ) b) Which coordinates displayed the highest increases in rental prices? Show and state your results visualised on a London map. c) Is the temporal change in rental prices a strong effect? We can test this by building a naïve Bayesian classifier that given a rental price and location (and using your regressors from the previous sub-questions to give you a likelihood in time) can predict whether it was from the early (e.g. first half) or late (e.g. second half) part of the data set? Choose a suitable and convincing way to test your classifier on the data. IT IS VERY IMPORTANT THAT YOU FOLLOW THE FUNCTION CALL CONVENTIONS THAT ARE SPECIFIED. WE USE AUTOTESTING SCRIPTS TO TEST YOUR CODE. IF YOUR CODE BREAKS DOWN IN AUTOTESTING BECAUSE IT IS NOT FOLLOWING THE SPECIFIED CONVENTIONS THE GTAS CAN NOT BE EXPECTED TO FIX YOUR CODE TO MAKE IT WORK. CODE THAT DOES NOT WORK MAY RESULT IN A DOWNGRADE OF YOUR GRADE. PLEASE SUBMIT ALL THE CODE THAT GENERATED YOUR FIGURES OR RESULTS, ALONGSIDE THE EXPLICITLY REQUESTED MATLAB FILES. YOUR REPORT SHOULD BE SUBMITTED AS PDF FILE. How to visualise and present your scientific data for this coursework (and many other applications): A figure says more than 1000 words, ideally about what it should say or raise in as many words doubts the authors abilities...therefore: Each figure deserves to look good. So make figures in Matlab/Illustrator/Powerpoint/GIMP/TGIF etc. It is well worth making good looking figure. A figure is almost always preferable to text. Figures are usually viewed in a small form. So use large fonts. It is often a good idea if the caption of the figure allows understanding the conclusion from the figure. Ideally, the figures allow understanding the entire paper without reading. Use the same font across all figures and at most 3 sizes. A figure should not have more than 8 panels and usually at least 2. Reference all your figures and subfigures in your main text (everything here holds also for tables, numbered equations, etc). A figure not referenced is a figure that will be considered invisible by the reader. Also, if there is no meaningful way to add reference in your main text to your figure/subfigure, this perhaps suggests that you do not need that figure. Your figure caption should explain in direct terms what one can see on the figure. If it is a data containing graph, the type of graph (e.g. scatter plot), the axis, and the data typos (e.g. circles, dots) should be linked to the entities discussed. The main text should refer to the figure and explain first what the reader is supposed to see in the figure (imagine the reader is blind, what message should they be able you extract from the figure). Ask yourself: Does each figure present evidence in favor of exactly one point that you are trying to make with it, and does it present such evidence as clearly as you can imagine? E.g., if you are conveying that one thing does better than another, are you showing relatively performance, delta performance or absolute performance? absolute performance requires that the reader subtract in their head but may be important when absolute performance is in itself meaningful (e.g. patient survival), relative performance imposes no such requirement but the absolute value is lost, while delta performance may be helpful to convey the meaning that conditions are different from each other. this is certainly the most important consideration. the only question id answer if i were standing on one foot, the rest are details. Check list for your figures: Are all axes are labeled, e.g. ”Height”, are the axes tipped with arrows? Do all axis labels have units, do you specify them using the square bracket convention ”Height [cm]”? If there is 1 color, is it black? If there are 2, are they are black and grey (or blue and red, but do not use red and green, please)? If there are multiple lines/dots, is each a different line style and color? Are all lines sufficiently thick? (If you used Matlab, and they are the default thickness, the answer is no, it should be at least 2pt) Are all font sizes are legible? (If you used Matlab, and they are the default font size the answer is no, it should be at least 16-18pt) Is the figure exported at sufficiently high DPI (400+)? If there are multiple lines, is there is a legend, or other textual description of each line? If errorbars make sense, are they there? If there, does the caption explain whether they are standard error or standard deviation? If they are not either, is there a good justification provided for that? Is your method in the figure named something other than proposed method or our method? If not, name it and use it throughout. Are you axes tight (that is, are the bounds of the axes just larger than the max and min of what you want to show)? If not, do you have good reason for the excess, e.g. because you need to show absolute levels. Are all graphics that can be vector graphics actually vector graphics (unless you export 600 DPI bitmaps, which may actually be the better way sometimes in Matlab). Do all axes have either clear tick marks or gridlines indicating magnitudes of everything? Are all the letters/numbers fully visible (i.e., not obscured by part of the figure)? Is the aspect ratio correct? On data where the different axis have the same units it may be important to show equal unit as equal width in all axes (Hint: if you rescaled both the width and height separately, it is probably not.) If the data are 2D, are you displaying it in 2D? (If not, remove that additional 3rd dimension, it is just confusing and obfuscatory.) Does the caption begin with a sentence (fragment) stating what the figure is demonstrating (i.e., why it is there)? Does the caption end by pointing out particularly interesting aspects of the figure that one should note? Does the caption define all acronyms used in the figure (especially if they are not used anywhere else)?