Hi Ho, Hi Ho…

This week I’m going to talk a little about Mining. Data mining, naturally.Grumpy_mine

Data mining is a term that’s bandied around a lot lately, but why? what’s the fuss all about?

First off, what does the term even mean?

..data mining is the process of analyzing data from different perspectives and summarizing it into useful information – information that can be used to increase revenue, cuts costs, or both.

Definition

Basically what that means is that by looking at the past we can hopefully start to see relationships, correlations and patterns in the data history, and use this to help our organisation.

There are a couple of common misconceptions of Data Mining, which if you don’t mind I’ll take a few minutes to clear up.

One is that you need massive quantities of data to take part. Huge data warehouses are nice to have, but as with so much in life it’s really more a case of quality over quantity. It doesn’t really matter how much data you have as long as it’s reliable. (For more on reliable data please refer to my earlier post on data cleaning)

On a simple level, an example of this resource planning. Often you’ll see shops and bars hiring part-time staff around October – a simple analysis of the last few years trading patterns, mixed in with a little observation and common sense and the average business owner will be able to assess how much, if any ‘seasonal staff’ they need to manage the increase in traffic.

Another misconception people tend to have is that effective data mining is all about the algorithm. The perfect algorithm will unlock the secrets hidden within and a fountain of knowledge will come gushing out.
Sadly, while algorithms are important they’re not the be all and end all.

Ultimately what we’re trying to achieve with data mining is meeting a business need, improving business processes, cutting costs and maximising profits.
There is an incredible amount of work that must go into defining these issues before the data analyst even gets to work on the algorithm.

Business goals must be defined, the data has to be gathered, investigated and processed (cleansed) etc. These are the dependencies on which algorithms are written and therefore the answer is only going to be valuable if you’re asking a good, well thought out question.

Douglas Adams addressed this before in ‘The Hitchhikers Guide to the Galaxy’, and nothing has really changed since then. Managers and CEOs seem to think the data will solve everything, not realising the important role they themselves play in the process and find themselves disappointed or perplexed by the result.

DeepThough

So to summarize, Data Mining is not necessarily about Big Data or algorithms. Not to detract from those elements, they are and will always be important players.
Data Mining is more about business. It’s about data quality, sound business decisions and clear objectives. If you ignore those you’ll end up building an elaborate super computer and fancy algorithms which give you a meaningless answer of 42 because you don’t understand the question you’re asking it.

Squeaky Clean

I’ve alluded to the importance of data cleaning in previous posts, so I think the time has come to go into a little more detail for you.

cleandata

So what is Data Cleaning?
In a nutshell this is the process of removing or correcting any invalid data from your data set.
Invalid data includes :
Null values
Spelling mistakes
Incorrect values
Incomplete values
Duplicated entries
Irrelevant data

Why do I care?
We use data and data mining techniques to make decisions right?
And we need as much relevant information as possible to make good decisions, yeah?
If unclean data is full of irrelevant, incorrect, duplicated data that’s not exactly good information is it?
So what happens if you make a decision based on sucky information?
Sucky decisions that’s what.

Unclean or dirty data can lead to misleading statistics/results, for example if your data is full of duplicate entries then you’re going to think something is far more likely than it is.
Or if some categories are spelling incorrectly, then they won’t be grouped together properly in your analysis. This all leads you down the garden path into a series of potentially horrible or inaccurate decisions.

Ok, so how do I do it?

As an overview, I’ll take you back to our fusion table example some weeks back.  I mentioned how I had to ‘clean’ the data,and since you’re all loyal and dilligent readers i’m sure it’s fresh in your mind. Lets look at the county population dataset for a moment and see what I mean in a few tidy examples.

The information for Laois is spelled wrong as Laoighis, I had to correct this or my fusion table wouldn’t have been able to match off the data with ‘Laois’ in my geometry data set

Population Data Set

Additionally, In one data set Tipperary is split into Tipp North and Tipp South and the other Data set is just Tipperary. While it may anger some Tipp people to combine them into one county, this is what I did because again, I don’t have geometry for North V South. So they’re just going to have to accept that.

Same for Dublin, this data was provided in county council districts – Dun Laoighaire Rathdown, Fingal, etc. but ye are all Dubs to me, north or south of the river, this monkey doesn’t care.

Furthermore, the population was grouped into the provinces which was irrelevant for my purposes.

This is in short what I mean by data cleaning, weeding out the rubbish and polishing up what’s left. A cleaned data set will be consistent with other data sets in your system and will greatly enhance your decision making capabilites.

If your dataset is huge there are tools you can use to trim the fat, R can be powerful in this regard and I highly recommend you at least skim through the techniques in the following tutorial from CRAN-R
Tutorial

Yes it’s boring
Yes it can be tedious
Yes this is the ‘clean behind your ears’ of the data analyst world. But your mum was right, it’s important to clear out anything wierd or icky that’s lurking in there. So get on it monkeys! I only want to see squeaky clean data from now on!!

 

 

Beware the poop deck

What’s a Data Scientists favourite language?
Arrr!!!!

pirate_monkey

 

Hang on to your booty sea monkeys because this week we’re delving into the world of r, courtesy of  http://tryr.codeschool.com/

This is a good starter tutorial, code school walk you through the basics and show you some simple yet cool functionality for the R language. It’s also set out in 7
pirate themed chapters which is a fun little gimmick. If I had one complaint it’s that the flow of the tutorial doesn’t give you a lot of leeway to play with the language and try various options and permutations of the sample code. The program is clearly expecting one ‘right’ answer and only that will suffice. I’d like to see a little more of a ‘sandbox’ experience from code school, but as I say – it’s a nice gentle intro into a new language.

In a rare move, I invite you all to see my booty……

Badges

It’s quite a haul as I’m sure you’ll agree.

Now you might be asking yourself  “Good for you, but why do I care?’ and you care, or should, because R is a really powerful tool widely used by data analysts. you don’t NEED to learn it – there are alternatives such as Python which are also widely used and sought after. But R is a highly flexible and thriving open source language, it’s rapidly growing in popularity and there’s a wealth of supportive resources out there to help you getting started. So get on it already!

If you’re still on the fence then fine, I’ll show you some basic examples of what R can be used for. If that doesn’t convince you then I give up on you!

To continue on with the theme of the lesson, I pulled some quick stats on Digital Piracy around the world, courtesy of BSA. They quite helpfully provide an indepth study of data piracy in 2010 but for illustrative purposes I just took the table ‘Highest and Lowest Piracy Rates in 2010’.

I created a vector for the percentages and an additional vector for the country names.
> LP<-c(20,20,20,22,24,24,25,25,25,26,26,27,27,28,28,29,31,34,35,35,36,36,37,39,40,40,40,41,42,42)
> LPC<-c(“US”,”Japan”,”Luxembourg”,”NewZealand”,”Australia”,”Austria”,”Sweden”,”Belgium”,”Finland”,”Switzerland”,”Denmark”,”Germany”,”UK”,”Canada”,”Netherlands”,”Norway”,”Israel”,”Singapore”,”SouthAfrica”,”Ireland”,”Czech”,”UAE”,”Taiwan”,”France”,”SouthKorea”,”Portugal”,”Reunion”,”Hungary”,”Slovakia”,”PuertoRico”)

(Low Piracy & Low Piracy Countries) and used the qplot function of ggplot to provide a bar chart

> qplot(LPC,LP, stat=”identity”, geom=”bar”, ylab=”percentage”)

LowPiracy1

This however isn’t particularly pretty or clear, so we’ll need to apply a little more science.
The country names overlap, so I ‘cleaned the data’ (more on this in subsequent posts) to replace the full names with the country codes.
I’ll also add more descriptive names to the Axes so we know what we’re looking at.
R will also let you fill the bars with the colours of your choice, but I found a nice little function that ‘groups’ the bars together by function.
So If we apply these we get a slightly more descriptive chart:

>qplot(LPC,LP, stat=”identity”, geom=”bar”, fill=factor(LP), ylab=”percentage”, xlab=”Low Piracy Countries”)
LowPiracyEditWe’ll apply the same logic to the High Piracy Data and get the following:
LowPiracyEdit

So what can we extract from the above? These countries are the ‘extremes’ of data piracy, the lowest band with 20% of the population using pirated software is Japan, Luxembourg and the US.
The highest level of piracy is 93% in Georgia, closely followed by Zimbabwe at 90% and Bangladesh, Moldova and Yemen tied at 90% of the population.

So the trend implies that the most likely culprits of data piracy are in the developing world of ’emerging markets’ and the least likely are your ‘western world’ developed countries. Why is this? is it a lack of education or awareness – perhaps they don’t realise what they’re doing is piracy?
Or is it financially motivated? Software is expensive, it’s highly likely that the average Georgian corporation/consumer simply doesn’t have the capital to acquire these programs legally, in the numbers they require, so using ‘cracked’ versions is the only viable solution for them.

So how can this data be used?
Perhaps software vendors like Adobe or Microsoft can take this data to reduce the cost of their software – so many people are pirating it they’re not getting maximum revenue for it as it stands, make it more attractive to purchase and maybe less people will try to crack it.

Or create a targeted package for developing markets with limited features for a lower cost?
Provide an attractive ‘bulk buy’ rate for corporations based in the developing market who require multiple copies?
Launch campaigns to create awareness of what is/isn’t piracy?
Or maybe they’re simply just going to up their security protocols to prevent piracy overall.
I’m stingy so I’d vote for option A. Legit copies of software are seriously overpriced.

So begone ye scurvy sea monkeys and learn Arrrrrrr!

Big Mother

Everyone who has ever moved out of home has had that phone call with their mother – are you eating properly? How are you sleeping? Are you getting outdoors? The incessant barrage of questions. If you’re like me, you roll your eyes and say something along the lines of ‘yes…GAWD mam I am an ADULT you know…’ I mean, how dare she?! questioning my ability to take care of myself, like I’m some sort of child!
Alas Irony is a cruel mistress, and it’s sadly apparent that I’ve voluntarily offered and tracked all of this information through apps, all of my own accord.
I intensively researched and purchased the best activity tracker my meagre budget would allow, which counts my steps and gives me a polite little dart if I’ve been still for what it deems to be too long, reminding me to go stretch my legs.
This nifty little device also detects when I’m asleep and monitors my movement, creating a handy graph each morning of how well I slept and for how long. If you don’t have a snazzy gadget like mine then there are also plenty of apps akin to ‘SleepCycle’ which simply use the accelerometer in your smart phone to sense your weight shifting on your mattress each night.
There are also multitudinous  apps available to track your diet, with varying degrees of accuracy – some even include barcode scanners for exact logging of your dietary habits.
I’m not even going into fitness apps or we’d be here all day – MyFitnessPal, MapMyRun/Walk/Cycle, apps that offer training plans for everything from couch to 5ks, marathons, yoga sequences, boxing drills… Anything and everything you could dream of, all ready and waiting for you to offer up your personal health details.
This my friends, is Big Mother. Always hovering and, in theory, watching out for your health. Helping you to take care of yourself.
I’m not against these apps – as I say, I’ve volunteered a LOT of my personal information through them, without even really thinking of the ramifications. Under the guise of getting healthy, keeping fit, meeting personal goals it seems perfectly reasonable to hand over your personal information. But lets just pause for a second to think about just how detailed a picture this data creates about you, and what sort of access you grant each of these apps.
It knows my name, age, height, weight, email, phone number, location, preferred routes, it may even have before/after shots of you in your undies. A quick look at the developer permissions these apps request include: Location, Photos/Media/Files, Contacts, Identity, Camera, Wifi, Calendar, SMS, Permission to read and write my phones call log, Bluetooth information and microphone.
I handed over all of this information without really questioning why a fitness app needs to be able to read my call log, nor did I really read the privacy policy.
I’m not alone either, 7million fitness bands were sold in Q1 of this year (2015) and reports indicate that there are upwards of 100,000 fitness apps available for download, so there’s plenty of people just like me, signing away this data just so we can boast about our step count.
 FitBit-Meme
The question is, what happens to that data? Some of it is being sold to marketing companies already, other providers reserve the option to sell it later. Yes Big Mother is watching over me, but what about Big Brother?
The list of uses for this data is almost as large as the data itself, collating information on popular routes could impact city planning – pedestrian crossing, street lamps, city bikes, bike lanes, bus routes..
Your app could alert you to when you’ve worn out your favourite shoes and order you replacements, your mileage in those shoes could be sold off to develop better shoes, or to sell you the ‘new edition’, granting access to your playlist could suggest new songs or similar artists.
My insurance company could reward me for hitting certain fitness milestones – completing my couch to 5k or losing that stubborn 5pounds.
None of this sounds particularly harmful, to be honest it sounds pretty cool.
However there is a dark side (isn’t there always) during a talk in 2013 Gus Hunt, the CTO of the CIA, revealed a fascinating fact :
just simply by looking at the data what they can find out is with pretty good accuracy what your gender is, whether you’re tall or you’re short, whether you’re heavy or light, but what’s really most intriguing is that you can be 100% guaranteed to be identified by simply your gait – how you walk.
Some of that information doesn’t surprise me, much of it I’ve volunteered, but what leaps out at me is that the little accelerometer I keep on my wrist is uniquely identifying me by my gait. Call me a cynic but that’s a nefarious amount of information to hold on someone.
soze
Going back to the ‘cool’ examples, what if my insurer thinks I’m not losing the right amount of weight fast enough? or what if I didn’t meet my target 10k time? or went a bit OTT on the ice cream?
What if they track my phone location and decide I drove to work a little too quickly this morning? Will my premiums then go up accordingly?
What if your employers get access to this data? What if they think I’m not too sick to come in today after all?
Sure, right now this information is mostly just being used to target ads at me. that’s a little irritating, but mostly avoidable and not particularly sinister, indeed a lot of people are happy to get these targeted ads. They like Reebok runners and want to know about Reeboks new runners. Happy Days.The old chestnut of ‘nothing to hide, nothing to fear’ often gets trotted out at this point – basically if you’re not doing anything illegal or wrong, then you won’t mind me watching you, if you are doing something wrong you deserve to be caught and you have no right to complain about it.

But lets be honest, none of us are perfect. We all have something we want to hide, something we don’t want to be public knowledge.
Maybe traffic wasn’t that bad this morning  and I just slept in.
Maybe I wasn’t sick this morning and I just couldn’t face that meeting?
What if I didn’t go to the gym on my way home, I actually went shopping?
What I want you to think about now is at what point does this become surveillance? and where does it start getting restrictive?
A world where your every move is logged and tracked and where your every action is scrutinized isn’t a healthy atmosphere to live in – but it’s one we’re signing up to one app at a time without even realising.
This isn’t big data – it’s OUR data, and we need to start thinking about who we share that with.
Sleep on it Leibchen.

It’s Getting Hot in Here….


This weeks post, should you choose to read it, relates to Google Fusion tables. For the uninitiated, here comes the science (courtesy of wikipedia):

Google Fusion Tables (or simply Fusion Tables) is a web service provided by Google for data management. Fusion tables can be used for gathering, visualising and sharing data tables. Data are stored in multiple tables that Internet users can view and download.

This is a product/tool/app/service I’d heard a little buzz about, but hadn’t actually sat down and played with it until recently. Basically what it does is let you do some pretty badass analytics really quickly. The two funnest things that this definition doesn’t include are:

1) you can merge multiple tables to produce some really snazzy and powerful results
2) It’s FREE!! Free toys, what more could we possibly ask for?

“That’s all well and good, but I don’t have any data to analyse” says you, fear not munchkins, because once again google has you covered.  You can use your own spreadsheets if you have any information you want to visualise, but there’s a plethora of public data tables provided for you to play with too. all you have to do is search https://research.google.com/tables and uncover information on everything from butterfly population to Toronto traffic. So basically you’ve got no excuse not to play with this thing.

There’s a ton of ways to visualise the data – all the usual bar charts, pie charts, line graphs etc, but for the purposes of this blogroll I’ve decided to go with a heat map, because it’s fancy and I’m all about the shiny.

To create the vision you see before you, all I needed was 2 little files, the county geometries and the population – both freely provided online. here and here.

The county information comes in the form of a KML file, this is a ‘Keyhole Markup Language’ file, which is a thing I’d never come across before. Bear in mind if you try to open this file your poor laptop will probably be very confused and not know quite what to do with it. Don’t worry about this – Google Fusion will know precisely how to manage it so just plonk it straight in to your fusion table.

Before you can merge these files into a heatmap. You will need to spend a little time ‘cleaning’ the data. If you take a look through it you’ll find there are items we don’t really need for our purposes. For example a total value is provided for the ‘state’, and the provinces – these are values which don’t appear in the geometry information (which simply lists the counties in the Republic). In addition to this The big cities/regions are broken down into  things like ‘Dublin city, Fingal county, Dun Laoghaire-Rathdown etc, we don’t have geometry for these values so all we need from these are the totals for the county as a whole.

I’ll backtrack just a smidge here to explain what Data Cleaning means. Data Cleaning is the process of removing inaccurate, inconsistent, null or other data that is just plain wrong. I won’t lie, this is simultaneously the most important and the most tedious part of a data scientists work. Your analysis is only as good as the data you’re working with, and if your data full of mistakes you don’t have a snowballs chance of extracting a worthwhile decision out of it. So clean your data diligently!

So, now that you’ve imported your KML file and cleaned your population data of all the stuff we’re not going to need and corrected all the spelling mistakes, we’re good to go. you will now have something that looks like this:
heatmap1

Which admittedly, right now doesn’t look like a whole lot. But with just a little simple wizardry we’ll have something fabulous in just a moment. If you proceed to the ‘Feature Map’ on the left of your table we’re going to change the ‘feature style’ which will let us choose a ‘polygon’ with a fill. This is the part where we can apply different colours to the counties so we can visualise our population. The trickiest thing here is divvying up your ‘buckets into sizes that give you nice, tidy, relevant groupings for our county populations. So you wind up with something like this:

Which we can all agree looks a lot better. You can choose any colours you like, but I’ve gone with shades of red, the darker the colour the denser the population.

So I’m sure you can already see how this tool can be used to convey complex ideas quickly and simply without using rows and rows of tedious information or the ‘same old same old’ pie charts. Here’s a couple of simple examples of ways you can use a heat map like this to make important decisions:

  • Identifying areas of high population which need better roads or public transport
  • Identifying areas of population growth to propose building new schools, garda stations, hospitals etc
  • Identifying problem areas – crime, traffic accidents etc

To illustrate this I’ve whipped up some simple heat maps of penalty points issued in March 2015, generously provided by the RSA.

In the first map, we pretty much see what we expect to – penalty points are concentrated in the areas of the highest population.

But, if we quickly work out and map the ‘points per capita’ you get a different story, the concentration of points around the big cities drops and suddenly the midlands can’t look so smug anymore.

We can dive into this data even more by using filters to show where the most and least penalty points are dished out, for example drivers with 12 penalty points vs drivers with 1 penalty point

1points12points

If you had access to the full data set to really do a deep dive you could analyse the who, what, when, where and whys of penalty points. This would allow the strategic placement of speed traps or breathalysers, target road safety campaigns etc.

So, even from this high level I’m sure you can already see just how powerful this tool can be. It’s surprisingly simple to use, and in just a few minutes you can create visuals that tell a compelling data story in a very digestible way.

On that note I release you, my fledgling data monkeys, to go play with this wonderful tool yourselves, report back on your findings in the comments below!

Hello world!

Welcome welcome! This is merely a quick post to welcome you, my fuzzy little followers, to my blogatorium.
Within this webspace you will find many things, insights, data, rants, pictures of monkeys. Mostly the monkey thing.

Abandon hope all ye who enter here, for beyond this point lies my peculiar and twisted little mind.speccymonkey