CODING CLUB: MACHINE LEARNING

MACHINE LEARNING

Machine learning is a scientific discipline that explores the construction and study of algorithms that can learn from data. Such algorithms operate by building a model based on inputsand using that to make predictions or decisions, rather than following only explicitly programmed instructions.

Deploying a machine learning model typically takes the following five steps:

1. Data collection.

2. Data preprocessing:

a. Data cleaning;

b. Data transformation;

c. Divide data into training and testing sets.

3. Model Building: Build a model on training data.

4. Model Evaluation: Evaluate the model on the test data.

5. If the performance is satisfying, deploy to the real system.

This process can be iterative, meaning we can re-start from step 1 again. For example, after a model is deployed, we can collect new data and repeat this process. Let’s look at the details of each step:

1. Data Collection:

At this stage, we want to collect all relevant data. For an online business, user click, search queries, and browsing information should be all be captured and saved into the database.
In manufacturing, log data capture machine status and activities. Such data are used to produce maintenance schedules and predict required parts for replacement.

2. Data Preprocessing:

The data used in Machine Learning describes factors, attributes, or features of an observation. Simple first steps in looking at the data include finding missing values. What is the significance of that missing value? Would replacing a missing data value with the median value for the feature be acceptable? For example, perhaps the person filling out a questionnaire doesn't want to reveal his salary. This could be because the person has a very low salary or a very high salary. In this case, perhaps using other features to predict the missing salary data might be appropriate. One might infer the salary from the person’s zip code. The fact that the value is missing may be important. There are machine learning methods that ignore missing values and one of these could be used for this data set.

Data Transformation:

In general we work with both numerical and categorical data. Numerical data consists of actual numbers, while categorical data have a few discrete values. Examples of categorical data include eye color, species type, marriage status, or gender. Actually a zip code is categorical. The zip code is a number but there is no meaning to adding two zip codes. There may or may not be an order to categorical data. For instance good, better, best is descriptive categorical data which has an order.

3) After the data has been cleaned and transformed it needs to be split into a training-set and a

Test-set.

3. Model Building:

This training data set is used to create the model which is used to predict the answers for new cases in which the answer or target is unknown. Several different modeling techniques have been introduced and will be discussed in detail in future sections. Various models can be built using the same training data set.

4. Model Evaluation

Once the model is built with the training data, it is used to predict the targets for the test data. First the target values are removed from the test data set. The model is applied to the test data set to predict the target values for the test data. The predicted value of the target is then compared with the actual target value. The accuracy of the model is the percentage of correct predictions made. These accuracies of can be used to compare the different models.

5. Model Deployment:

This is the most important step. If the speed and accuracy of the model is acceptable, then that model should be deployed in the real system. The model that is used in production should be made with all the available data. Models improve with the amount of available data used to create the model. The results of the model need to be incorporated in the business strategy. Data mining models provide valuable information which gives companies great advantages.

Real World Applications of Machine Learning

1. Speech Recognition

2. Computer Vision

3. Bio-surveillance

4. Robot Control

5. Accelerating Empirical Sciences

Speech Recognition

Currently available commercial systems for speech recognition all use machine learning in one fashion or another to train the system to recognize speech. The reason is simple: the speech recognition accuracy is greater if one trains the system, than if one attempts to program it by hand. In fact, many commercial speech recognition systems involve two distinct learning phases: one before the software is shipped (training the general system in a speaker-independent fashion), and a second phase after the user purchases the software (to achieve greater accuracy by training in a speaker-dependent fashion).

Computer Vision

Many current vision systems, from face recognition systems, to systems that automatically classify microscopic images of cells, are developed using machine learning, again because the resulting systems are more accurate than hand-crafted programs. One massive-scale application of computer vision trained using machine learning is its use by the US Post Office to automatically sort letters containing handwritten addresses. Over 85% of handwritten mail in the US is sorted automatically, using handwriting analysis software trained to very high accuracy using machine learning over a very large data set.

Bio-surveillance

A variety of government efforts to detect and track disease outbreaks now use machine learning. For example, the RODS project involves real-time collection of admissions reports to emergency rooms across western Pennsylvania, and the use of machine learning software to learn the profile of typical admissions so that it can detect anomalous patterns of symptoms and their geographical distribution. Current work involves adding in a rich set of additional data, such as retail purchases of over-the-counter medicines to increase the information flow into the system, further increasing the need for automated learning methods given this even more complex data set.

Robot Control

Machine learning methods have been successfully used in a number of robot systems. For example, several researchers have demonstrated the use of machine learning to acquire control strategies for stable helicopter flight and helicopter aerobatics. The recent Darpa-sponsored competition involving a robot driving autonomously for over 100 miles in the desert was won by a robot that used machine learning to refine its ability to detect distant objects (training itself from self-collected data consisting of terrain seen initially in the distance, and seen later up close).

Accelerating Empirical studies

Many data-intensive sciences now make use of machine learning methods to aid in the scientific discovery process. Machine learning is being used to learn models of gene expression in the cell from high-throughput data, to discover unusual astronomical objects from massive data collected by the Sloan sky survey, and to characterize the complex patterns of brain activation that indicate different cognitive states of people in fMRI scanners. Machine learning methods are reshaping the practice of many data-intensive empirical sciences, and many of these sciences now hold workshops on machine learning as part of their field’s conferences.

CODING CLUB

Tuesday 23 December 2014

MACHINE LEARNING

No comments:

Post a Comment