Data Science: Magic for Industry of the 21st Century

Júlia Bergmann, Dávid Gyulai

The Director was annoyed. Barely one week ago the entire production line had stopped for cleaning, and now, the factory is idle because of the replacement of one spare part. Could that have been anticipated properly? Data analysis is a wonder weapon that could facilitate significant productivity increase in numerous areas in manufacturing companies, isn’t it?  

Incantations and what is behind them 

Industry 4.0, IoT, data science, machine learning, big data, artificial intelligence. These buzzwords can be heard frequently nowadays. We might even have a clue what is behind them, but do we really know about the immense knowledge accumulated in the past decades in the various science disciplines that can be harnessed for? What information is hiding in the data? How to explore their inherent values? What shall I use them for?  The aim of our series of articles is to demonstrate and summarise how modern data analyzing tools work in an industrial environment and, how they may be applied to predict throughput times, optimize scheduling or reduce waste rate and thus, how companies may enhance their efficiency and decrease their losses. 

Chapman’s CRISP-DM diagram

Data of various types are collected by most of the companies, but only a few of them take advantage of them. The accumulated data might not be stored in the right way, moreover, sometimes not the proper data are collected and stored. To collect data “well” is a tiresome and complex task.Furthermore, it is difficult to interpret unstructured and complex data and it also requires time and stamina. If we don’t see the results in advance, it may seem that the invested work is not worthwhile. On the web, a multitude of studies may be found with the aim to make us overcome this fear, e.g. the case study told by McKinsey Analytics of how an itinerary planning software at a logistics company could be optimized on the basis of available data. The firm benefitted from it by a 16% profit increase without amortization. Besides this, many other examples may give us evidence that it’s rewarding to invest in our data and its detailed analysis that may underpin our decision making.    

Developments based on data analysis are lengthy processes. The worthily famous CRISP-DM diagram of Chapman (Fig. 1) illustrates the complexity of the task, the permanent and necessary feedbacks (Chapman, 1999). The initial step is always a standardized process: learning the given business area, specifying the targets and preparing a project plan as detailed as possible. This is followed by looking at the data: finding and understanding their sources, their mapping and quality checking. Practically, in 99% of the cases there is a need for data cleansing, correcting and structuring what stems from the simple fact that when the data collection has started (which may well be years ago in the past), the system designer wasn’t and even couldn’t have been aware of today’s data analysis and utilization requirements. It is important to realize that these three steps constitute the major part of a data analysis project. When our data are made “digestible” and “nice” enough we may begin the most exciting part of the work, i.e. data modelling. The magic happens in this phase that can be described by pages long equations and logical formulae using the language of mathematics. Among others, the parameters of the learning algorithms are tuned, the most appropriate classification methods are selected and/or predictive models are constructed. Later, the results obtained by the evaluation of the model or models are compared to our targets set up in the first step. It is very unusual that there is a match already at the first trial. Accordingly, we repeat the first four work phases as long as the results do not conform to the targets. Using the models created in the last step of the process and summarizing the results, the development plan may be built for the company.

Industrial Application        

If there is any sector of the [BJ1] economy that may expect substantial results from data analysis then industry is definitely one of them. Every individual or company is affected by manufacturing, thus the sector is possessing a vast amount of data. Due to the worldwide increasing living standard, the appetite for high-quality and affordable products, tailored up to the extreme is ever-growing. In our days, the proper analysis of data is apparently one of the tools that provide a chance to meet this demand. If manufacturing companies will be able to utilize the amount of data in their possession, then the more accurate prediction of customer demands and provision of products meeting their needs both in quantity and quality will be made possible even in the short run.        

Data analysis applications in manufacturing

While we frequently make serious efforts to separate consumer and industrial IoT from each other, from the data science perspective, the real potential for efficiency increase lies in their combined application. Collecting data under real-life usage conditions and their feedback to the production systems for analysis opens a new horizon of revenue generation. 

But this is only one side of the story. What is still interesting, is the possibility for the added value manufacturing to automate the analysis of data generated by the sensors inside the devices and thus, make the discovery of anomalies and prediction of failures autonomous. The lifetime and operational time of the machines may be predicted, errors identified and the checking periods optimally scheduled. This provides a unique way to reduce downtime length and increase machine utilization. As it is usually said, a manufacturing company is just as good as the machines producing its products. Although these examples are very convincing and many firms are eager to operate similar applications, it is an important fact that the active usage of well-operating support, prediction modules and subsystems assume a well “advanced” level in the area of industrial data analysis. Most companies are laggards in this respect.

To cite another very informative example, let’s look at the data collection, not at the shop floor level but about the usage of the produced goods. This contributes also to an increase in production efficiency. In order to guarantee the lifetime of their product, manufacturers are inclined to make the product more robust and complex than what would be required purely for the usage. This leads in many cases to higher production costs and thus, the price of the product as well. If however, we receive feedback on product usage and analyze them, then we may identify the factors that have no impact on the product lifetime. Thus we may attain savings.

It is reasonable to make a halt here: real breakthrough may be achieved by applying advanced machine learning that enables the manufacturers to model the products, machines, and tools, simulate different scenarios and find ways of maximizing efficiency in the given situation. This is only one of the many options offered by data analysis for industry including the sales activity, e.g. the automation of operation-based purchase mandate or opening new revenue sources, thus providing a new feeling of exclusivity for the customers.       

Not all the gold that shines

As every technology and scientific discipline has its shadow side and limitations, so does data analysis as well. It is a generally accepted view that in data science three distinct types of knowledge are needed: you have to be familiar with mathematical statistics, programming and also the peculiarities of the application area and the examination itself that you could produce trustworthy, rapid and useful results for the business. The next figure presents an example for the case when somebody lacks the third and probably most relevant competency: the specific professional knowledge.

In the graph, it is evident even for a layman the similar shape of the two variables. In mathematical terms, the correlation between them is very high, 94,71%. However, the two variables compared to each other are of completely different nature: the per capita cheese consumption and the number of people died because of being tangled in their bedsheets cannot be examined under the same umbrella. Notwithstanding, a data analysis robot would recognise a very strong correlation between them.  

A real correlation?

Obviously, this is a simple example but let’s imagine how difficult it would be to work together with an “expert” who is ignorant of the difference between scheduling and capacity planning or cannot make distinction between cycle time and throughput time. It is, therefore, an important point that data analysis should be performed by those who possess the appropriate competencies, good programming skills and satisfactory professional knowledge. 


The vast subject of data analysis cannot be described in an exhausting way in a couple of pages. Later, when continuing the present introduction we shall provide a detailed insight into the industrial utilization of the different scientific disciplines. We’ll demonstrate through practical examples how to build a regression model for the prediction of throughput time, how to cluster the orders for production scheduling or, how to predict qualitative parameters in advance. 


Algorithmic route optimization improves revenue for a logistics company | McKinsey Analytics. (no date). Source:

Chapman, P. (1999.. March). The CRISP-DM User Guide. 4th CRISP-DM SIG Workshop in Brussels.

Manditereza, K. (2017. August 11). What Data Science Actually Means To Manufacturing. Source: IIoT World: