Things in machine learning are repeated over and over. Hence, machine learning is iterative in nature. Therefore to know machine learning, one has to understand the machine learning process. The machine learning process is a bit tricky and challenging. It is very rare that we find the machine learning process easy. The reason for it being so complex is very clear that a large amount of complex data is involved in this process and out of which we try to find out meaningful predictive patterns and models. That’s why, as I mentioned in my last article, that this is dealt with by data scientists who are actually specialists in this space. In my last article I also mentioned that how rewarding a machine learning process could be. The benefits out of this process could be outstanding, but we should also keep in mind that the process may not always succeed but can fail, but that’s too rare. Let’s focus on the processes and scenarios used in Machine Learning in this article.
We’ll try to cover the topic and Machine Learning concepts, processes and scenarios including terminology in a form of series. This is the second article of the series and will largely focus on machine learning processes and scenarios. Here are the articles that we’ll follow to know about machine learning:
- Introduction to Machine Learning
- Machine learning processes and scenarios
- Machine learning : Deep dive
In machine learning, asking the right question and knowing the correct answer is important. We should know what question to ask and it is the most important part of the process. And after that we should ask the question to ourselves to see that we have enough (and correct) data to answer that question. If you ask the wrong question or you do not have enough or correct data, the answer you get could never be what it should be and what exactly is expected. For example, consider Internet banking transaction frauds. We ask, “how can we predict that the transaction is going to be fraudulent?” May be it could be the case that the large piece of predictive data is based on which city does the customer reside in or what is his occupation/business or how long does he live at his current address.
We might not have all this complete data, and we may also not get this data until some point later. In that case we should ask ourselves, do we have enough data to start (or correct data at least)? If we don’t, then we are not going to get the result or answer that we are looking for from the machine learning process. We also should then ask ourselves that what would be the criteria to define the success. At the end of the process we only get the model out of the data that predicts and doesn’t exactly give us the answer. So we should ask the question of how good those predictions should be so that the entire process could be tagged as success. In the case of our example, if we find that we are sure about the fraud prediction in maybe 16 out of 20 cases, and then is this fair enough? Or what about 14 out of 20 or should it be 18 out of 20? How do we decide this? Knowing the correct answers to these questions is really important, as without it, we won’t get the desired result and would never know that the process is complete and we are done with getting actual predictive model.
If we go into the details of machine learning process, firstly we identify, choose and get the data that we want to work with. For our example, we would often need to work with the domain experts in this area that are people who know a lot about fraudulent transactions or we would work with these people for our actual problem that we need to solve. These people being an expert knows what data or data model that we get from the process is predictive. But since the data with which we start is raw and unstructured data is never in the correct form as needed for actual processing. It could have duplicate data, or the data that is missing, it could have lots of extra data that is not needed. The data could be formed from various sources which may also eventually end up being duplicate or redundant data. In this case there comes the requirement for pre-processing the data so that the process could understand the data, and the good thing is that the machine learning products usually provide some data pre-processing modules to process the raw or unstructured data. For example, in Capital markets there is always a need of price predictions for instruments or equities/assets and an algorithm is applied to the huge amount of unstructured data coming from various feed providers in that case multiple feed providers could provide the same data or some feed providers may provide the missing data and some the complete data. So to apply the actual algorithm to the data, we need to have that complete unstructured data into a structured and shaped data for which a process of pre-massaging is required, through which the data is passed and we get a candidate copy of data which could be processes through the algorithm to get the actual golden copy.
After the data is pre-processed, we get some good structured data, and this data is now an input for machine learning. Bus is this a one time job? Of course not, the process has to be iterative, and it has to be iterative until the data is available. In machine learning a major chunk of time is spent in this process. That is working on the data to make it structured, clean, ready and available. Once the data is available, the algorithms could be applied to the data. Not only pre-processing tools but the machine learning products also offer a large number of machine learning algorithms as well. The result of the algorithm applied data is a model, but now the question is, is this the final model that we needed?
No, it is the candidate model that we got. Candidate model means the first most appropriate model that we get, but still needs to be massaged. But do we get only one candidate model? Of course not, since this is an iterative process, we do not actually know what the best candidate model is until we again and again produce several candidate models through the iterative process. We do it until we get the model that is good enough to be deployed. Once the model is deployed, applications start making use of it, so there is iteration at small levels and at the largest level as well.
We need to repeat the entire process again and again and re-create the model at regular intervals. The reason again for this process is very simple, it’s because the scenarios and factors change and we need to have our model up to date and real all the time. This could eventually also mean to process new data or applying new algorithms altogether.
Let’s try to take few scenarios showing how we can actually use machine learning.
Let’s again take the example of fraudulent internet banking transaction. Let’s assume that we have certain number of bank customers using their internet banking facility to some third party payment application or gateway. In that case there should be a point where if the transaction is fraudulent should get rejected. That’s what the challenge is to find out the fraudulent transaction.
We could, in that scenario, get all the historical transaction data and process that through the machine learning process like we saw in earlier section and eventually get a predictive model, that an application could use to make decisions.
Another such example where the challenge is to find out how likely a customer is to switch. Let’s take an example of internet data provider or a mobile company. In this space customers usually call the call centers. For every customer, the call center employee needs to identify what are the chances of a customer to switch to a competitor. Knowing that, a call center executive can then offer a better deal or offer some lucrative deal to prevent customer to switch and retain him. The challenge is how to identify those customers and the answer is again machine learning. The data provider or mobile companies usually have a lots of recorded calls data. The data may be vast and very detailed, so an application could be created around that data to consolidate it. That created application could use technologies like Spark or Hadoop or any other big data technology.
The company then may need to associate the consolidated data with more data like data coming from the CRM’s to really create ample amount of right data that machine learning wants to use. This is not uncommon. The machine learning process can take data from multiple sources to process. As a result, there would be a predictive model that the application of call center could use to make decisions and predictions on customers likeliness to switch. It really adds value to the business and helps in overall growth altogether.
It’s all about asking the right question and that acts as a beginning to machine learning process. After which we need the right and structured data to answer that question, and this is the part that takes most of the time in completing the machine learning process. Then starts the process with n number of iterations until we get a desired predictive model. That model is updated time to time to adapt the changes that happen periodically, and finally the model is deployed. In the next article we’ll focus on some terminologies and look the machine learning process more closely.
- Pluralsight Course – Understanding Machine Learning