Bridging the Gap: New Datasets Push Recommender Research Toward Real-World Scale


Sponsored Content

 

 

 

Recommender systems rely on data, but access to truly representative data has long been a challenge for researchers. Most academic datasets pale in comparison to the complexity and volume of user interactions in real-world environments, where data is typically locked away inside companies due to privacy concerns and commercial value.
That’s beginning to change.

In recent years, several new datasets have been made public that aim to better reflect real-world usage patterns, spanning music, e-commerce, advertising, and beyond. One notable recent release is Yambda-5B, a 5-billion-event dataset contributed by Yandex, based on data from its music streaming service, now available via Hugging Face. Yambda comes in 3 sizes (50M, 500M, 5B) and includes baselines to underscore accessibility and usability. It joins a growing list of resources helping to close the research-to-production gap in recommender systems.

Below is a brief survey of key datasets currently shaping the field.

 

A Look at Publicly Available Datasets in Recommender Research

 

MovieLens

One of the earliest and most widely used datasets. It includes user-provided movie ratings (1–5 stars) but is limited in scale and diversity—ideal for initial prototyping but not representative of today’s dynamic content platforms.

Netflix Prize

A landmark dataset in recommendеr history (~100M ratings), though now dated. Its static snapshot and lack of detailed metadata limit modern applicability.

Read Also:  How Can Machine Learning Improve Customer Service?

Yelp Open Dataset

Contains 8.6M reviews, but coverage is sparse and city-specific. Valuable for local business research, yet not optimal for large-scale generalizable models.

Spotify Million Playlist

Released for RecSys 2018, this dataset helps analyze short-term and sequential listening behavior. However, it lacks long-term history and explicit feedback.

Criteo 1TB

A massive ad click dataset that showcases industrial-scale interactions. While impressive in volume, it offers minimal metadata and prioritizes click-through rate (CTR) over recommendation logic.

Amazon Reviews

Rich in content and widely used for sentiment analysis and long-tail recommendation. However, the data is notoriously sparse, with a steep drop-off in interaction for most users and products.

Last.fm (LFM-1B)

Previously a go-to for music recommendations. Licensing limitations have since restricted access to newer versions of the dataset.

 

Moving Toward Industrial-Scale Research

 

While each of these datasets has helped shape the field, they all present limitations—either in scale, data freshness, user diversity, or metadata completeness. That’s where new entries, such as Yambda-5B, are particularly promising.

This dataset offers anonymized, large-scale user-item interaction data across music streaming sessions, including metadata such as timestamps, feedback type (explicit vs. implicit), and recommendation context (organic vs. suggested). Importantly, it includes a global temporal split, enabling more realistic model evaluation that mirrors online system deployment. Researchers will also find value in the multimodal nature of the dataset, which includes precomputed audio embeddings for over 7.7 million tracks, enabling content-aware recommendation strategies out of the box.

Privacy has been carefully considered in the design of the dataset. Unlike earlier examples, such as the Netflix Prize dataset, which was eventually withdrawn due to re-identification risks. Аll user and track data in the Yambda dataset is anonymized, using numeric identifiers to meet privacy standards.

Read Also:  A Step-By-Step Guide To Powering Your Application With LLMs

 

Closing the Loop: From Theory to Production

 

As recommender research moves toward practical application at scale, access to robust, varied, and ethically sourced datasets is essential. Resources like MovieLens and Netflix Prize remain foundational for benchmarking and testing ideas. But newer datasets—such as Amazon’s, Criteo’s, and now Yambda—offer the kind of scale and nuance needed to push models from academic novelty to real-world utility.

Read the original article at Turing Post, the newsletter for over 90 000 professionals who are serious about AI and ML.

By, Avi Chawla – highly passionate about approaching and explaining data science problems with intuition. Avi has been working in the field of data science and machine learning for over 6 years, both across academia and industry.

 
 

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top