Hive is a data warehouse system for
Hadoop that facilitates easy data summarization, ad-hoc queries, and the
analysis of large datasets stored in Hadoop compatible file systems.
Hive structures data into well-understood database concepts such as
tables, rows, columns and partitions. It supports primitive types like
Integers, Floats, Doubles, and Strings. Hive also supports Associative
Arrays, Lists, Structs, and Serialize and Deserialized API is used to
move data in and out of tables.
Let’s look at Hive Data Models in detail;
Hive Data Models:
The Hive data models contain the following components:- Databases
- Tables
- Partitions
- Buckets or clusters
Partitions:
Partition means dividing a table into a
coarse grained parts based on the value of a partition column such as
‘data’. This makes it faster to do queries on slices of data
So, what is the function of Partition?
The Partition keys determine how data is stored. Here, each unique value
of the Partition key defines a Partition of the table. The Partitions
are named after dates for convenience. It is similar to ‘Block
Splitting’ in HDFS.
Buckets:
Buckets give extra structure to the data
that may be used for efficient queries. A join of two tables that are
bucketed on the same columns, including the join column can be
implemented as a Map-Side Join. Bucketing by used ID means we can
quickly evaluate a user-based query by running it on a randomized sample
of the total set of users.
Your post is very great.I read this post. It’s very helpful. I will definitely go ahead and take advantage of this. You absolutely have wonderful stories. Cheers for sharing with us your blog. For more learning about data science visit at Data science learning course in bangalore
ReplyDelete