Too little data? FRP researchers come together at workshop to address this problem in machine learning

BY: NATALIE GOLD

Machine learning is responsible for lots of things used in everyday life, from the autofill in search engines to the organization of social media feeds. It uses statistical algorithms to find trends in data, whether that be numbers, words, or images. From there it can make predictions on what will happen next. Sometimes, there is not enough data for a machine to create stable and accurate predictions. Machines might create algorithms that are too specific and that can’t apply to many different situations. So what do researchers do when there’s not enough data?

On February 5 2021, the Hariri Institute’s Machine Learning for Chemistry & Materials Science Focused Research Program (FRP) worked through this question. The “Sparse data & small datasets in machine learning” workshop brought together researchers from different fields to address what happens when there is not enough data to create an accurate algorithm, and discussed how to prevent these inaccuracies.

Konstantinos Spiliopoulos, Professor of Mathematics & Statistics, spoke to methodologies in machine learning.

After a general introduction from FRP leaders Aaron Beeler, Associate Professor in Chemistry, and Emily Ryan, Associate Professor in Mechanical Engineering, participants heard from three experts. Eric Kolaczyk, the Hariri Institute’s Director, Brian Kulis, Associate Professor of Electrical & Computer Engineering, and Konstantinos Spiliopoulos, Professor of Mathematics & Statistics, presented on how machine learning is dependent on dataset size. They also spoke on the problems that arise from working with small datasets, and what can be done to overcome these challenges.

Brian Kulis, Associate Professor of Electrical & Computer Engineering, presented on how overfitting can cause inaccurate algorithms.

One of the biggest issues to working with limited data is overfitting. Kulis explained that overfitting is when a model is trained too well on one particular dataset, so that the model cannot work effectively on other data. If this happens, then the machine learning algorithm will be inaccurate and not applicable to most situations. A solution to overfitting is data augmentation. Data augmentation is where researchers can take what data they already have, make some slight changes to it, and present it to the model as new data. With the new additions, there is enough data to avoid making an algorithm too specific.

Eric Kolaczyk, Hariri Institute Director, explained that regularization helps a model make more accurate predictions.

The presenters discussed another way to prevent overfitting: regularization. This is a class of techniques that puts constraints on the nature of the model being learned. Kolaczyk pointed out how regularization can improve how accurately a model makes predictions, which is necessary to a well-functioning machine.

The FRP workshop brought together researchers from different backgrounds to share their expertise and pose questions. This convergence is essential to the advancement of research, and is still possible during a global pandemic. Workshop participants found ways of working through dataset size problems to ensure accuracy of their machine learning algorithms. The FRP researchers will converge again to discuss problems and possible solutions to their projects at their next workshop in March.

Interested in learning more about the transformational science happening at the Hariri Institute? Sign up for our newsletter here.