In regression problems, we are trying to predict continuous values as the output. This differs from classification, where the output is a category or class. The input value is 'this' not 'that'. There are a number of different types of regression problems we support using the following algorithms:
- Linear Regression
- Radial Base Functions
- Regression Trees(e.g. Random Forest)
- Support Vector Regression (SVR)
In this example, we will build a predictive model to predict house price (price is a number from some defined range, so it will be regression task). We will be using linear regression to predict sales price based on multiple attributes.
You can download the house price dataset here.
Let's suppose you want to sell your house and you are wondering what you can get for it. You usually look for other homes similar to yours, in the same area and close to the same age as yours. We will do something similar, but with Linear Regression Machine Learning.
1. CRIM per capita crime rate by town
2. ZN proportion of residential land zoned for lots over 25,000 sq.ft.
3. INDUS proportion of non-retail business acres per town
4. CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
5. NOX nitric oxides concentration (parts per 10 million)
6. RM average number of rooms per dwelling
7. AGE proportion of owner-occupied units built prior to 1940
8. DIS weighted distances to five Boston employment centers
9. RAD index of accessibility to radial highways
10. TAX full-value property-tax rate per $10,000
11. PTRATIO pupil-teacher ratio by town
12. B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
13. LSTAT % lower status of the population
14. PRICE True value of owner-occupied homes in $1000's
We will be training our model using PRICE.
If you haven't already, sign up for a Knowi account. Go to the machine learning workspace area and create a new workspace for predicting housing prices. Select REGRESSION as the type.
Select a training dataset or upload a .CSV that contains your training data. In this case, we have already loaded our training data set call House_prices_training_dataset.
Since I know I'm going to do a Linear Regression, I'm selecting the Ordinary Least Square algorithm. You can also select all the available algorithms to see which algorithm produces the best results.
You have the option to use Cloud9QL to do anything needed to prep your data including removing outliers, create aggregations, limit rows selected, etc.
- Select the algorithm(s) from the algorithm list. For the first run, we will select all with default settings.
- Select attributes to be included in the training from the field list.
- Select the predict (class) field from the drop-down at the bottom of the field list. In our case, we select "price" for this.
Click on Train button and wait for the training to be completed to view the result.
a. The Results panel will be populated with your run results (one per algorithm). You can expand each entry to see more detailed results by clicking on the + symbol. You can also compare the output of your model vs the training value by clicking on the eye symbol (more on this in the next section). Last but not least, you can publish your model by clicking on the save icon.
b. At the bottom of the page is the History section which lists all the runs that you have done on this workspace. Selecting these will update the Results panel with detailed information about that run.
5. Save your model so you can apply it to your analytics workfow.
Applying Saved Model to Data
- Go to our query page by clicking on Data Feeds / Queries icon from the left menu bar and create a new (or edit an existing) query.
- At the bottom of query editing page, there is an "Apply Model" button. Clicking on this shows a drop-down with a list of published models.
Select our newly created model to be applied to the data. Few important notes:
a. The model is applied after the query has been executed. This allows us to perform all necessary data manipulation using the query section as long as the result of our query contains all the attribute fields required by the model.
b. The predict (class) "price" attribute is automatically added to the result after applying the model.
c. The model is applied before the ad-hoc (grid) query. This allows further manipulation of the output data.
d. Existing query functionality remains as is.
You can see the new field predicted price and compare it to the true price. Click on Data Statistics to see information about max. and min. values, mean and standard deviation, etc. You can also view a scatterplot of your data.