Keras 2.x Projects
上QQ阅读APP看书,第一时间看更新

Defining a regression problem

Regression analysis is the starting point in data science. This is because regression models represent the most well-understood models in numerical simulation. Once we experience the workings of regression models, we will be able to understand all other machine learning algorithms. Regression models are easily interpretable as they are based on solid mathematical bases (such as matrix algebra, for example). In the following sections, we will see that linear regression allows us to derive a mathematical formula that's representative of the corresponding model. Perhaps this is why such techniques are extremely easy to understand.

Regression analysis is a statistical process that's implemented to study the relationship between a set of independent variables (explanatory variables) and the dependent variable (response variable). Through this technique, it will be possible to understand how the value of the response variable changes when the explanatory variable is varied.

Consider some data that is collected about a group of students, on the number of study hours per day, attendance at school, and the scores that they have obtained on the final exam. Through regression techniques, we can quantify the average increase in the final exam score when we add one more hour of study. Lower attendance in school (decreasing the student's experience) lowers the scores in the final exam.

A regression analysis can have two objectives:

  • Explanatory analysis: To understand and weigh the effects of the independent variable on the dependent variable, according to a particular theoretical model

  • Predictive analysis: To locate a linear combination of the independent variable to predict the value assumed by the dependent variable optimally

However, regression, given its cross-disciplinary characteristics, has numerous and varied areas of applications, right from psychology to agrarianism, and from economics to medicine and business management, just to name a few.

The purpose of regression as a statistical tool is of two types, namely to synthesize and generalize, as we can see in the following diagram:

Synthesize: The first purpose (synthesize) means predisposing collected data into a form (tables, graphs, or numerical summaries), which allows you to understand better, the phenomena on which the detection was performed. The synthesis is met by the need to simplify, which in turn results from the limited ability of the human mind to handle articulated, complex, or multidimensional information. In this way, we can use techniques that allow for a global study of a large number of quantitative and qualitative information to highlight features, ties, differences, or associations between detected variables.

Generalize: The second purpose (generalize) is to extend the result of an analysis performed on the data of a limited group of statistical units (sample) to the entire population group (population). The contribution of regression is not limited to the data analysis phase. It's true that the added value is expressed in the formulation of research hypotheses, argumentation of theses, adoption of appropriate solutions and methodologies, choices of methods of detection, formulation of the sample, and the procedure of extending the results to the reference universes.

Keeping these phases under control means producing reliable and economically useful results, and mastering descriptive statistics and data analysis as well as inferential ones. In this regard, we recall that the descriptive statistics are concerned with describing the experimental data with a few significant numbers or graphs. Therefore, they photograph a given situation and summarize its salient characteristics. Inferential statistics use statistical data that is also appropriately summarized by the descriptive statistics to make probabilistic forecasts on future or otherwise uncertain situations.

People, families, businesses, public administrations, mayors, ministers, and researchers constantly make decisions. For most of them, the outcome is uncertain, in the sense that it is not known exactly what will result, although the expectation is that they will achieve the (positive) effects they are hoping for. Decisions would be better and the effects would be expected closer to those desired if they were made on the basis of relevant data in a decision-making context.

Let's look at some applications of regression in the real world in the following section:

  • A student who graduates this year must choose the faculty and university degree in which he or she will enroll. Perhaps he or she has already gained a vocation for his or her future profession, or studies, and may have confirmed his or her predisposition for a particular discipline. Maybe a well-established family tradition advises him or her to follow their parent's profession. In these cases, the uncertainty of choice will be greatly reduced. However, if the student does not have genuine vocations or is not geared particularly to specific choices, he or she may want to know something about the professional outcomes of the graduates. In this regard, some statistical studies on graduate data from the previous years may help him or her make a decision.
  • A distribution company such as a supermarket chain wants to open a new sales outlet in a big city and must choose the best location. It will use and analyze numerous statistical data on the density of the population in different neighborhoods, the presence of young families, the presence of children under the age of six (if it is interested in selling to this category of consumers), and the presence of schools, offices, other supermarkets, and retail outlets.

  • Another company wants to invest its profits. It must make a portfolio choice and it has to decide whether to invest in government bonds, national shares, foreign securities, funds, or real estate. To make this choice, it will first conduct an analysis of the returns and risks of different investment alternatives based on statistical data.

  • National governments are often called upon to make choices and decisions. To do this, they have statistical production equipment. They have population data and forecasts about population evolution over the coming years, which will calibrate their interventions. A strong decline in birth rates will, for example, recommend school consolidation policies; the emergence of children from the non-community component will signal the need for reviewing multiethnic programs and, more generally, school integration policies. On the other hand, statistical data on the presence of national products in foreign markets will suggest the need to export support actions or interventions to promote innovation and business competitiveness.

In the examples we have seen so far, the usefulness of statistical techniques, and particularly of regression in the most diverse working situations, is clear. It is therefore clear how much more information and data companies are required to have to ensure the rationality of decisions and economic behaviors by those who direct them.