Data Centric AI

Training Data Pruning

It is common practice to strive towards obtaining as much training data as possible. However, large datasets increase management and model training costs. Not all data examples are equally important. Given a large training dataset, what is the smallest subset you can sample that still achieves some threshold of performance?