Top Qs
Timeline
Chat
Perspective
Croissant (metadata format)
From Wikipedia, the free encyclopedia
Remove ads
Croissant is a metadata format design to support sharing of datasets for machine learning applications. It is a platform-agnostic schema used to standardize metadata in data repositories like Hugging Face, kaggle, Dataverse and OpenML.[1][2]
Structure
Croissant builds upon schema.org, uses primarily JSON-LD, and divides metadata in four "layers": Dataset Metadata, Resource, Structure and Semantic:[1][3]
- The Dataset Metadata layer constrains which schema.org properties should be used, including additional properties, linking together the resources (files) of the dataset with general metadata, like licensing and citation information.
- The Resource layer describes the individual files and sets of those using two new classes, FileObject and FileSet. A FileSet may be a collection of related images.
- The Structure layer specifies how the files are organized in the dataset. A RecordSet class describes how resources are present, configurations that may very a lot between modality. This specification facilitates interoperability of the datasets.
- Finally, the Semantic layer adds information for practical reuse of the dataset, such as splits for train, test and validation subsets.
It also provides a default extension for metadata related to responsible AI.[1][2]
The use of a standard machine-readable structure increases, for example, the discoverability of datasets in search engines such as Google Dataset Search.[4][5]
Remove ads
History
Croissant was shared in arXiv in March 2024 and published in the proceedings of NeurIPS 2024.[1][6][7] It started as community driven as a MLCommons Croissant Working Group, including stakeholders organizations from academia and industry, including Google, the open data institute, Sage Bionetworks and King's College London.[1][8]
Variations of Croissant are developed to support datasets in different areas of research, such as Geo-Croissant for geospatial datasets.[9] Other technical extensions, such as support for RDF, soon followed.[10][11]
Remove ads
References
External links
Wikiwand - on
Seamless Wikipedia browsing. On steroids.
Remove ads