Yarn and Hive in Layman's Term
About Yarn
Yarn manages resources for all the nodes in its cluster. When a job request is raised, by Hive or Spark or other applications, how much RAM and cores needs to be specified.
- Yarn Resource Manager: Master
- Yarn Node Manager: Slaves
CapacityScheduler and Fair Scheduler When multiple organizations shares the hadoop cluster resources, CapacityScheduler helps to guarantee the capacity needed for each of them to meet their SLA. It’s a stringent set of limits implemented through queues.
In contrast, Fair Scheduler ensures all application has, on average, equal share of resources over time. It can make decision based on memory(by default) or memory+CPU
DynamicAllocation is a machanism to adjust resources based on workload. Executor are added through requests. Each round of request would increase the executors exponentially. This ensures that when needed, the number of executors can quickly increase. Executors will be removed after being idle for x seconds.
About Hive
Metastore is a relational database that keeps track of what tables, columns and locations.
HiveServer2 is build on Apache Thrift HiveServer2 parses the SQL query and translate that into a map-reduce job on the yarn cluster. External Hive tables are tables whose content is managed by other applications like spark. Deleting an external table does not affect the data as the data are files stored on HDFS. Contrary to that, a managed table is purely managed by Hive. Delete a managed table will wipe out the data as the same time.