Build a datalake on top of BigQuery

Google BigQuery is a very powerful, serverless data warehosue that lets you ingest unlimited data on a pay-per-use basis (storage + querying). The primary advantage of data warehouses is the ability to quickly query and analyze immense amounts of structured data. Modern data warehouses support new, unstructured data types such as JSON, Avro, and so on, which makes these data warehouses a great contender for data lakes. BigQuery recently added native JSON column type, which we can leverage for our semi-structured lakehouse. With JSON support we can skip the traditional datalakes that are just blob storage with files, from classic Hadoop + hive through GCS, S3, Athena etc. Maintaining such infrastructure and understanding the low-level components, such as metastore, ORC, and Parquet is an expensive process for most startups, financially and with regard to domain expertise. ...

March 14, 2022 · 4 min · Or Elimelech