Exploring Data Formats
JSON - key value pair
pros:
- readability
- serialize / deserialize tools
- some type
cons:
- not able to split up the file
- repeated keys for every record => wasted space
- not fully typed
CSV
pros:
- readability
cons:
- not able to split up the file
- no types
- require pre-process to clean up unwanted strings
avro
pros:
- binary
- typed
- schema attached
- fast to write
- schema evolution fully supported
cons:
- slow lookup
- slow aggregation
parquet
pros:
- binary
- typed
- schema attached
- support predicate and projection push down
- fast lookup
- fast aggregation
- compressed well
cons:
- slow to write
- slow to update and delte