Modern information technology allows for collecting huge amounts of data both in terms of units (size) as well as variables (multivariate observations). However, the pure availability of Big Data does not necessary lead to further insight into causal structures within the data. Instead the sheer amount of data may cause severe problems for statistical analysis. Moreover, in many situations parts (certain variables) of the data may be cheap to obtain while other variables of interest may be expensive. Therefore, prediction of the expensive variables would be desirable, which can be achieved by standard statistical methods when a suitable subsample of the expensive variables is available.
Our project aims at identifying optimal subsampling schemes to reduce costs or improve accuracy of the prediction. Concepts of optimal design theory originally related to technical experiments may be deployed in a non-standard way to generate efficient sampling strategies. Basic concepts like relaxation to continuous distributions of the data and symmetry properties may lead to substantial reduction in complexity and, hence, to feasible solutions. To make these general ideas more precise and to put them on a sound foundation for applications to real data constitutes the aim of our project.