Optimal Sampling Design for Big Data

Modern information technology allows for collecting huge amounts of data both in terms of units (size) as well as variables (multivariate observations). However, the pure availability of Big Data does not necessary lead to further insight into causal structures within the data. Instead the sheer amount of data may cause severe problems for statistical analysis. Moreover, in many situations parts (certain variables) of the data may be cheap to obtain while other variables of interest may be expensive. Therefore, prediction of the expensive variables would be desirable, which can be achieved by standard statistical methods when a suitable subsample of the expensive variables is available.

Our project aims at identifying optimal subsampling schemes to reduce costs or improve accuracy of the prediction. Concepts of optimal design theory originally related to technical experiments may be deployed in a non-standard way to generate efficient sampling strategies. Basic concepts like relaxation to continuous distributions of the data and symmetry properties may lead to substantial reduction in complexity and, hence, to feasible solutions. To make these general ideas more precise and to put them on a sound foundation for applications to real data constitutes the aim of our project.

  • Dec 05th 2024, Torsten Reuter succesfully defended his PhD thesis on "D-optimal Subsampling Design for Massive Data"
  • Dec 03rd 2024, Xiangying Chen succesfully defended his PhD thesis on "Conditional Erlangen Program"

...more
  • Dec 05th 2024, Torsten Reuter succesfully defended his PhD thesis on "D-optimal Subsampling Design for Massive Data"
  • Dec 03rd 2024, Xiangying Chen succesfully defended his PhD thesis on "Conditional Erlangen Program"

...more