HBase Optimizations

HBase regions - Split

- Hot Spotting: Where uneven key-space distribution can funnel a huge number of requests to a single HBase region, bombarding the RegionServer process and cause slow response time.

- Pre Regions: If the key distribution is known in advance, it is always good to create pre-regions while creating HBase tables. This will help in uniform distribution and faster access of the data.

- Please note that too many regions can cause degraded performance too. It should be a good mix of data, cluster resources available.

- How to create pre-regions: HBase pre regions can be created in two ways

o While creating HBase tables
create ‘sample_hbase_table’,’c’,{SPLITS=>['a','b']}

o Splitting regions when HBase is online
Regions can be split even after table is created and data is loaded. (This will be most typical when we do not know the distribution of data beforehand). There are many ways of doing it.

§ Via HBase Terminal : ‘split’ command
eg: split ‘sample_hbase_table’
eg: split ‘sample_hbase_table’,’c’

§ Via HBase GUI: The HBase mater page will contain list of all tables (both catalog and user table), list of region servers, dead regions, backup masters, etc. The URL to UI is http://<HBase-Master-Name> :60010/master-status
The GUI, will have link to each HBase table the master has.
eg: http:// <HBase-Master-Name> :60010/table.jsp?name= sample_hbase_table
Here, there will be option to split the regions.

HBase regions - Merge

- There can be scenarios when the regions go out of hand and we need to merge regions. HBase provides a way to merge two regions into one.

- Go into the HBase bin directory and issue the following command (eg)

- ./bin/hbase org.apache.hadoop.hbase.util.Merge sample_hbase_table sample_hbase_table,,1447622532624.5c6b185e5aa64c61f5915a4aa1ed96e4. sample_hbase_table,00,1447685933570.dcb1b5c1dedf69bd7bab08e12427f6ae.

Controlling the requests send to HBase server by client

- Controlling the number of requests send by the HBase client to the server is a good practice. There can be scenarios when the HBase server is loaded and is not able to handle all the requests received by it. The client program should be intelligent enough to delay sending its request in such conditions rather than bombarding the region servers with more and more requests. If the client keeps sending more requests, they will be kept in queue and it might take longer than the normal wait time and hence resulting in failures with ‘HBaseRetriesExhaust exception’.

HBase Write Path

- The write path is how HBase completes the PUT/DELETE operations. This path starts with the client, moves to the region server and eventually ends in HBase data file known as the HFile. Region servers handles the HBase tables. HBase tables can be large and hence they are partitioned down to regions. Region servers handles one or more regions. The client contacts the region servers for any requests. The write requests received by region server cannot be fulfilled instantly by the HBase because the data in HFile is sorted (also they are immutable). So they are stored in ‘memstore’ until enough data accumulates in the memstore and then write happens to HDFS.

- The Write Ahead Log (WAL) is present to prevent any data loss if the system crashes. The memstore is in memory (volatile) and data will be lose if the server crashes.

HBase Memstore

- When region server receives write request, it directs to specific region. The data is getting written into memstore first. Memstore is kept in the main memory of region server.

- The main reason to use memstore is that we need to store the data in HDFS in a sorted manner.

- When the memstore reaches a limit, the data is flushed to HFile.

- Each memstore flush will create new HFile for each column family.

- While reading, the HBase first checks the requested data in memstore and then goes into HFile.

HBase Table Migration

- The use case involves replicating a big HBase table in one cluster to another cluster.

- There are multiple ways to accomplish this task:

1. Sequence Files: Taking backup of existing HBase table using ‘Export’ API by HBase, copying the data to new cluster and then loading new HBase tabe using ‘Import’. (The ‘Export’ will generate sequence files, which will be used by ‘Import’ to load data)

2. HFiles: Generating HFiles of original HBase table, copying the data to new cluster and then doing a bulk load to new table. (one of the efficient way)

3. Copy Command: Directly using ‘copy table’ command to copy source table to destination table in another cluster.

4. Snapshots: Taking a ‘snapshot’ of original table and creating the new table in destination cluster. (efficient way)

- By default the snapshots for HBase is turned off for HBase 0.94. Also, API for generating HFiles is not available in 0.94 version.

- Restoring table was giving lot of pain, this is because the normal writes to HBase follow the write path. This might causing blocking writes by the region server if the load is heavy.

‘org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException’