Machine learning in causality studies using large datasets

The researchers want to develop machine learning methods that facilitate the study of complex causal relationships in the social and health sciences, based on large-scale record-linked databases.

Big datasets make it possible to study complex causal relationships in the social and health sciences.

In quite a few countries, including Sweden, it is possible to link administrative records with health records for research purposes. Linking records for entire populations in this way over multiple decades results in large-scale databases covering millions of individuals. Data encompass hundreds of thousands of characteristics, including how socioeconomic conditions and health status develop over time for each individual, as well as for their partners, relatives, neighbors and co-workers.

The research team, which is studying socioeconomic health inequalities, has access to record-linked data infrastructures of this kind. The issues they are addressing are of a causal nature. For instance, if it is found that breast cancer survival differs between income groups, the researchers want to study the mechanisms causing this inequality.

One scientific challenge is that classical statistical methods are not adapted to cope with large volumes of data. This can result in erroneous conclusions. The researchers are developing machine learning methods for causality, e.g. using neural networks. They hope to achieve results similar in quality to those obtained with machine learning for prognosis purposes, e.g. in automated tumor identification.

To avoid the risk of underestimating uncertainty in the final statistical results, the researchers plan to develop machine learning methods that take into account key sources of uncertainty in the assumptions on which the analysis is based. They will also be developing optimal estimation methods, i.e. methods that yield the most reliable conclusions.

The aim of the project is thus to develop tools that enable new and more reliable conclusions to be drawn from studies of the determinants of health inequalities, as well as other complex causal relationships in the social and health sciences based on large-scale datasets.

Project:
Machine learning to study causality with big datasets: towards methods yielding valid statistical conclusions

Principal investigator:
Xavier de Luna

Co-investigators:
Tetiana Gorbach
Per Gustafsson

Institution:
Umeå University

Grant:
SEK 6 million