Top Qs
Timeline
Chat
Perspective

Croissant (metadata format)

From Wikipedia, the free encyclopedia

Remove ads

Croissant is a metadata format design to support sharing of datasets for machine learning applications. It is a platform-agnostic schema used to standardize metadata in data repositories like Hugging Face, kaggle, Dataverse and OpenML.[1][2]

Structure

Croissant builds upon schema.org, uses primarily JSON-LD, and divides metadata in four "layers": Dataset Metadata, Resource, Structure and Semantic:[1][3]

  • The Dataset Metadata layer constrains which schema.org properties should be used, including additional properties, linking together the resources (files) of the dataset with general metadata, like licensing and citation information.
  • The Resource layer describes the individual files and sets of those using two new classes, FileObject and FileSet. A FileSet may be a collection of related images.
  • The Structure layer specifies how the files are organized in the dataset. A RecordSet class describes how resources are present, configurations that may very a lot between modality. This specification facilitates interoperability of the datasets.
  • Finally, the Semantic layer adds information for practical reuse of the dataset, such as splits for train, test and validation subsets.

It also provides a default extension for metadata related to responsible AI.[1][2]

The use of a standard machine-readable structure increases, for example, the discoverability of datasets in search engines such as Google Dataset Search.[4][5]

Remove ads

History

Croissant was shared in arXiv in March 2024 and published in the proceedings of NeurIPS 2024.[1][6][7] It started as community driven as a MLCommons Croissant Working Group, including stakeholders organizations from academia and industry, including Google, the open data institute, Sage Bionetworks and King's College London.[1][8]

Variations of Croissant are developed to support datasets in different areas of research, such as Geo-Croissant for geospatial datasets.[9] Other technical extensions, such as support for RDF, soon followed.[10][11]

Remove ads

References

Loading related searches...

Wikiwand - on

Seamless Wikipedia browsing. On steroids.

Remove ads