Posts

Showing posts from February, 2018

What is k-Nearest Neighbors?

k-Nearest Neighbors or kNN is a classification algorithm that asigns a class to a new data point based on known data points that are similar, or nearby. What do you mean 'nearby'? To determine similarity of data you can use a few different distance algorithms. For example Euclidian distance , which is the square root of the sum of squares of the difference of the parameters of data points v and w. ... Or the City Block/Manhattan distance (yes that's what it's called) which is the sum of the absolute differences of v and w. ... Or the Cosine distance , which measures the "angle" between two parameter vectors v and w. ... What does kNN do? It goes through all the known data points and measures the distance to the new point. Then it assigns the class of the majority of  k  nearest data points to the new measurement. Yes, it goes through all training data every time you run this algorithm! No model is actually created. This is called a

Kaggle Tensorflow Speech Recognition Challenge

Image
Implementation of the ResNet and CTC models at https://github.com/chrisdinant/speech In November of 2017 the Google Brain team hosted a speech recognition challenge on Kaggle . The goal of this challenge was to write a program that can correctly identify one of 10 words being spoken in a one-second long audio file. Having just made up my mind to start seriously studying data science with the goal of turning a new corner in my career, I decided to tackle this as my first serious kaggle challenge. In this post I will talk about ResNets, RNNs, 1D and 2D convolution, Connectionist Temporal Classification and more. Let's go! Exploratory Data Analysis The training data supplied by Google Brain consists of ca. 60,000 1-second-long .wav files in 32 directories that are named by the word spoken in the files. Only 10 of these are classes you need to identify, the others should go in the 'unknown' or 'silence' classes. There are a couple of things you can do to get a

What is Machine Learning?

When a computer learns from data it is called machine learning (ML). How do you teach a computer? You use ML algorithms. What is a machine learning algorithm? An ML algorithm is a function that takes in data and outputs a prediction. What kind of prediction? Predictions like, "This is a picture of a bicycle", or "Tomorrow it is going to rain", or "This customer will probably leave our company soon". Ok, so how does an algorithm do this? ML algorithms are designed to find attributes of datasets that describe groupings or trends within the data. This can be something simple as: "if a pixel in this location is blue you're looking at a smurf", or something complicated like: "a combination of this pixel intensity with this shape in this orientation here and this other shape there and so on, means this is a picture of Robert on a bicycle". Or another example: "if it rained yesterday and the air pressure is so and so and

Bayes and Binomial Theorem

Image
Bayes Theorem In statistics there are many situations where you want to determine the probability that a sample for which you have certain measurement belongs to a certain set. Say you want to know the chance that you have HIV if you test positive. No test is perfect, so this probability will depend on the test sensitivity, but also on the specificity and on the incidence in the population, or set, that you belong to. Bayes Theorem is a simply the logic you have to apply to estimate such probabilities. As a cancer researcher my attention was naturally drawn to this paper currently trending on Pubmed: Detection and localization of surgically resectable cancers with a multi-analyte blood test. This is a perfect practical example for applying Bayes rule! And most of the information we need is right there in the abstract: " The sensitivities ranged from 69% to 98% for the detection of five cancer types (ovary, liver, stomach, pancreas, and esophagus)... " and " The