Top Qs
Timeline
Chat
Perspective
Apache Arrow
Software framework From Wikipedia, the free encyclopedia
Remove ads
Apache Arrow is a language-agnostic software framework for developing data analytics applications that process columnar data. It contains a standardized column-oriented memory format that is able to represent flat and hierarchical data for efficient analytic operations on modern CPU and GPU hardware.[2][3][4][5][6] This reduces or eliminates factors that limit the feasibility of working with large sets of data, such as the cost, volatility, or physical constraints of dynamic random-access memory.[7]
Remove ads
Interoperability
Arrow can be used with Apache Parquet, Apache Spark, NumPy, PySpark, pandas and other data processing libraries. The project includes native software libraries written in C, C++, C#, Go, Java, JavaScript, Julia, MATLAB, Python (PyArrow[8]), R, Ruby, and Rust. Arrow allows for zero-copy reads and fast data access and interchange without serialization overhead between these languages and systems.[2]
Remove ads
Applications
Arrow has been used in diverse domains, including analytics,[9] genomics,[10][7] and cloud computing.[11]
Comparison to Apache Parquet and ORC
Apache Parquet and Apache ORC are popular examples of on-disk columnar data formats. Arrow is designed as a complement to these formats for processing data in-memory.[12] The hardware resource engineering trade-offs for in-memory processing vary from those associated with on-disk storage.[13] The Arrow and Parquet projects include libraries that allow for reading and writing data between the two formats.[14]
Remove ads
Governance
Apache Arrow was announced by The Apache Software Foundation on February 17, 2016,[15] with development led by a coalition of developers from other open source data analytics projects.[16][17][6][18][19] The initial codebase and Java library was seeded by code from Apache Drill.[15]
References
External links
Wikiwand - on
Seamless Wikipedia browsing. On steroids.
Remove ads