Apache HBase Logo

Preface

前言

This is the official reference guide for the HBase version it ships with.

这是它附带的HBase版本的官方参考指南。

Herein you will find either the definitive documentation on an HBase topic as of its standing when the referenced HBase version shipped, or it will point to the location in Javadoc or JIRA where the pertinent information can be found.

在本文中,您将找到关于HBase主题的最终文档,就像它在被引用的HBase版本发布时的位置一样,或者它将指向Javadoc或JIRA中的位置,在那里可以找到相关信息。

About This Guide

This reference guide is a work in progress. The source for this guide can be found in the _src/main/asciidoc directory of the HBase source. This reference guide is marked up using AsciiDoc from which the finished guide is generated as part of the 'site' build target. Run

这个参考指南是一项正在进行中的工作。本指南的源代码可以在HBase源代码的_src/main/asciidoc目录中找到。这个参考指南使用了一个“网站”构建目标的一部分,并将完成的指南作为“网站”的一部分生成。运行

mvn site

to generate this documentation. Amendments and improvements to the documentation are welcomed. Click this link to file a new documentation bug against Apache HBase with some values pre-selected.

生成这个文档。欢迎修改和改进文件。单击此链接,以预先选择的一些值向Apache HBase提交新的文档错误。

Contributing to the Documentation

For an overview of AsciiDoc and suggestions to get started contributing to the documentation, see the relevant section later in this documentation.

要了解关于AsciiDoc的概述,以及开始对文档做出贡献的建议,请参阅本文档后面的相关部分。

Heads-up if this is your first foray into the world of distributed computing…​

If this is your first foray into the wonderful world of Distributed Computing, then you are in for some interesting times. First off, distributed systems are hard; making a distributed system hum requires a disparate skillset that spans systems (hardware and software) and networking.

如果这是你第一次涉足分布式计算的奇妙世界,那么你将会进入一个有趣的时代。首先,分布式系统很难;要使分布式系统发出嗡嗡声,需要一个跨越系统(硬件和软件)和网络的完全不同的技能集。

Your cluster’s operation can hiccup because of any of a myriad set of reasons from bugs in HBase itself through misconfigurations — misconfiguration of HBase but also operating system misconfigurations — through to hardware problems whether it be a bug in your network card drivers or an underprovisioned RAM bus (to mention two recent examples of hardware issues that manifested as "HBase is slow"). You will also need to do a recalibration if up to this your computing has been bound to a single box. Here is one good starting point: Fallacies of Distributed Computing.

集群的操作可以打嗝,因为无数的理由的任何bug在HBase本身通过配置错误——错误配置的HBase也是操作系统配置错误,通过硬件问题无论是一个错误在你的网卡驱动程序或underprovisioned内存总线(更不用说两个最近的例子的硬件问题,表现为“HBase缓慢”)。您还需要进行重新校准,如果您的计算已经绑定到一个单独的框。这里有一个很好的起点:分布式计算的谬论。

That said, you are welcome.
It’s a fun place to be.
Yours, the HBase Community.

那就是说,你是受欢迎的。这是一个有趣的地方。你的,HBase社区。

Reporting Bugs

Please use JIRA to report non-security-related bugs.

请使用JIRA报告与安全相关的错误。

To protect existing HBase installations from new vulnerabilities, please do not use JIRA to report security-related bugs. Instead, send your report to the mailing list private@apache.org, which allows anyone to send messages, but restricts who can read them. Someone on that list will contact you to follow up on your report.

为了保护现有的HBase安装免受新的漏洞,请不要使用JIRA报告与安全相关的错误。相反,将你的报告发送到邮件列表private@apache.org,它允许任何人发送消息,但是限制谁可以阅读它们。名单上的人会联系你跟进你的报告。

Support and Testing Expectations

The phrases /supported/, /not supported/, /tested/, and /not tested/ occur several places throughout this guide. In the interest of clarity, here is a brief explanation of what is generally meant by these phrases, in the context of HBase.

这些短语/支持/不支持/ / /测试/,以及/未测试/在本指南中出现了几个地方。为了清晰起见,本文简要地解释了这些短语在HBase上下文中的含义。

Commercial technical support for Apache HBase is provided by many Hadoop vendors. This is not the sense in which the term /support/ is used in the context of the Apache HBase project. The Apache HBase team assumes no responsibility for your HBase clusters, your configuration, or your data.
Supported

In the context of Apache HBase, /supported/ means that HBase is designed to work in the way described, and deviation from the defined behavior or functionality should be reported as a bug.

在Apache HBase的上下文中,/支持/意味着HBase按照描述的方式工作,并且偏离定义的行为或功能应该报告为bug。

Not Supported

In the context of Apache HBase, /not supported/ means that a use case or use pattern is not expected to work and should be considered an antipattern. If you think this designation should be reconsidered for a given feature or use pattern, file a JIRA or start a discussion on one of the mailing lists.

在Apache HBase的上下文中,/不支持/表示一个用例或使用模式不能工作,应该被视为反模式。如果您认为这个名称应该被重新考虑为一个给定的特性或使用模式,将JIRA文件或开始讨论一个邮件列表。

Tested

In the context of Apache HBase, /tested/ means that a feature is covered by unit or integration tests, and has been proven to work as expected.

在Apache HBase的上下文中,/测试/意味着一个特性被单元或集成测试所覆盖,并且已经被证明可以按预期工作。

Not Tested

In the context of Apache HBase, /not tested/ means that a feature or use pattern may or may not work in a given way, and may or may not corrupt your data or cause operational issues. It is an unknown, and there are no guarantees. If you can provide proof that a feature designated as /not tested/ does work in a given way, please submit the tests and/or the metrics so that other users can gain certainty about such features or use patterns.

在Apache HBase的上下文中,/未测试/意味着某个特性或使用模式可能在某个特定的方式下工作,也可能不会损坏您的数据或导致操作问题。这是一个未知数,也没有任何保证。如果您可以提供一个指定为/未测试/确实工作的特性,请提交测试和/或度量,以便其他用户能够获得关于这些特性或使用模式的确定性。

Getting Started

开始

1. Introduction

1。介绍

Quickstart will get you up and running on a single-node, standalone instance of HBase.

快速启动将使您在一个单节点、独立的HBase实例上运行。

2. Quick Start - Standalone HBase

2。快速启动-独立的HBase。

This section describes the setup of a single-node standalone HBase. A standalone instance has all HBase daemons — the Master, RegionServers, and ZooKeeper — running in a single JVM persisting to the local filesystem. It is our most basic deploy profile. We will show you how to create a table in HBase using the hbase shell CLI, insert rows into the table, perform put and scan operations against the table, enable or disable the table, and start and stop HBase.

本节描述单节点独立HBase的设置。一个独立的实例拥有所有HBase守护进程——主服务器、区域服务器和ZooKeeper——在一个JVM持久化到本地文件系统中运行。它是我们最基本的部署配置文件。我们将向您展示如何使用HBase shell CLI在HBase中创建表,将行插入到表中,对表执行put和scan操作,启用或禁用表,并启动和停止HBase。

Apart from downloading HBase, this procedure should take less than 10 minutes.

除了下载HBase,这个过程还需要不到10分钟。

Prior to HBase 0.94.x, HBase expected the loopback IP address to be 127.0.0.1. Ubuntu and some other distributions default to 127.0.1.1 and this will cause problems for you. See Why does HBase care about /etc/hosts? for detail

HBase 0.94之前。HBase期望回环IP地址为127.0.0.1。Ubuntu和其他一些发行版默认为127.0.1.1,这会给你带来麻烦。为什么HBase关心/etc/hosts?对细节

The following /etc/hosts file works correctly for HBase 0.94.x and earlier, on Ubuntu. Use this as a template if you run into trouble.

下面的/etc/hosts文件对HBase 0.94是正确的。x和之前的Ubuntu。如果遇到麻烦,可以使用它作为模板。

127.0.0.1 localhost
127.0.0.1 ubuntu.ubuntu-domain ubuntu

This issue has been fixed in hbase-0.96.0 and beyond.

这个问题已经在hbase-0.96.0和更高版本中得到了修正。

2.1. JDK Version Requirements

2.1。JDK版本需求

HBase requires that a JDK be installed. See Java for information about supported JDK versions.

HBase要求安装JDK。有关受支持的JDK版本的信息,请参见Java。

2.2. Get Started with HBase

2.2。开始使用HBase

Procedure: Download, Configure, and Start HBase in Standalone Mode
  1. Choose a download site from this list of Apache Download Mirrors. Click on the suggested top link. This will take you to a mirror of HBase Releases. Click on the folder named stable and then download the binary file that ends in .tar.gz to your local filesystem. Do not download the file ending in src.tar.gz for now.

    从这个Apache下载镜像列表中选择一个下载站点。点击建议的顶部链接。这将带你到一个HBase版本的镜像。单击名为stable的文件夹,然后下载以.tar结尾的二进制文件。gz到您的本地文件系统。不要下载在src.tar中结束的文件。现在广州。

  2. Extract the downloaded file, and change to the newly-created directory.

    提取下载的文件,并更改为新创建的目录。

    $ tar xzvf hbase-2.0.0-beta-2-bin.tar.gz
    $ cd hbase-2.0.0-beta-2/
  3. You are required to set the JAVA_HOME environment variable before starting HBase. You can set the variable via your operating system’s usual mechanism, but HBase provides a central mechanism, conf/hbase-env.sh. Edit this file, uncomment the line starting with JAVA_HOME, and set it to the appropriate location for your operating system. The JAVA_HOME variable should be set to a directory which contains the executable file bin/java. Most modern Linux operating systems provide a mechanism, such as /usr/bin/alternatives on RHEL or CentOS, for transparently switching between versions of executables such as Java. In this case, you can set JAVA_HOME to the directory containing the symbolic link to bin/java, which is usually /usr.

    在启动HBase之前,需要设置JAVA_HOME环境变量。您可以通过操作系统的通常机制来设置变量,但是HBase提供了一个中央机制,conf/ HBase -env.sh。编辑此文件,取消注释从JAVA_HOME开始的行,并将其设置为您的操作系统的适当位置。应该将JAVA_HOME变量设置为包含可执行文件bin/java的目录。大多数现代Linux操作系统都提供了一种机制,例如在RHEL或CentOS上的/usr/bin/ options,以透明地在Java等可执行文件的版本之间切换。在这种情况下,您可以将JAVA_HOME设置为包含到bin/java的符号链接的目录,这通常是/usr。

    JAVA_HOME=/usr
  4. Edit conf/hbase-site.xml, which is the main HBase configuration file. At this time, you only need to specify the directory on the local filesystem where HBase and ZooKeeper write data. By default, a new directory is created under /tmp. Many servers are configured to delete the contents of /tmp upon reboot, so you should store the data elsewhere. The following configuration will store HBase’s data in the hbase directory, in the home directory of the user called testuser. Paste the <property> tags beneath the <configuration> tags, which should be empty in a new HBase install.

    编辑conf / hbase-site。xml,它是主要的HBase配置文件。此时,您只需要在HBase和ZooKeeper写入数据的本地文件系统上指定目录。默认情况下,将在/tmp下创建一个新目录。许多服务器被配置为在重新启动时删除/tmp的内容,因此您应该将数据存储在其他地方。下面的配置将在HBase目录中存储HBase的数据,在用户名为testuser的主目录中。在 <配置> 标签下面粘贴 <属性> 标记,在新的HBase安装中应该是空的。

    Example 1. Example hbase-site.xml for Standalone HBase
    <configuration>
      <property>
        <name>hbase.rootdir</name>
        <value>file:///home/testuser/hbase</value>
      </property>
      <property>
        <name>hbase.zookeeper.property.dataDir</name>
        <value>/home/testuser/zookeeper</value>
      </property>
    </configuration>

    You do not need to create the HBase data directory. HBase will do this for you. If you create the directory, HBase will attempt to do a migration, which is not what you want.

    您不需要创建HBase数据目录。HBase会为你做这个。如果创建目录,HBase将尝试进行迁移,这不是您想要的。

    The hbase.rootdir in the above example points to a directory in the local filesystem. The 'file:/' prefix is how we denote local filesystem. To home HBase on an existing instance of HDFS, set the hbase.rootdir to point at a directory up on your instance: e.g. hdfs://namenode.example.org:8020/hbase. For more on this variant, see the section below on Standalone HBase over HDFS.
  5. The bin/start-hbase.sh script is provided as a convenient way to start HBase. Issue the command, and if all goes well, a message is logged to standard output showing that HBase started successfully. You can use the jps command to verify that you have one running process called HMaster. In standalone mode HBase runs all daemons within this single JVM, i.e. the HMaster, a single HRegionServer, and the ZooKeeper daemon. Go to http://localhost:16010 to view the HBase Web UI.

    bin / start-hbase。sh脚本提供了一种方便的启动HBase的方式。发出命令,如果一切顺利,将记录一个消息,以显示HBase成功启动的标准输出。您可以使用jps命令来验证您是否有一个名为HMaster的正在运行的进程。在单机模式下,HBase在这个单一的JVM中运行所有的守护进程,即HMaster、单个hlocationserver和ZooKeeper守护进程。转到http://localhost:16010查看HBase Web UI。

    Java needs to be installed and available. If you get an error indicating that Java is not installed, but it is on your system, perhaps in a non-standard location, edit the conf/hbase-env.sh file and modify the JAVA_HOME setting to point to the directory that contains bin/java on your system.
Procedure: Use HBase For the First Time
  1. Connect to HBase.

    连接到HBase。

    Connect to your running instance of HBase using the hbase shell command, located in the bin/ directory of your HBase install. In this example, some usage and version information that is printed when you start HBase Shell has been omitted. The HBase Shell prompt ends with a > character.

    连接到HBase的运行实例,使用HBase shell命令,位于HBase安装的bin/目录中。在本例中,我们省略了启动HBase Shell时打印的一些用法和版本信息。HBase Shell提示符以>字符结束。

    $ ./bin/hbase shell
    hbase(main):001:0>
  2. Display HBase Shell Help Text.

    显示HBase Shell帮助文本。

    Type help and press Enter, to display some basic usage information for HBase Shell, as well as several example commands. Notice that table names, rows, columns all must be enclosed in quote characters.

    类型帮助和按Enter,显示一些基本的使用信息的HBase Shell,以及几个示例命令。注意,表名、行、列都必须用引号括起来。

  3. Create a table.

    创建一个表。

    Use the create command to create a new table. You must specify the table name and the ColumnFamily name.

    使用create命令创建一个新表。您必须指定表名和ColumnFamily名称。

    hbase(main):001:0> create 'test', 'cf'
    0 row(s) in 0.4170 seconds
    
    => Hbase::Table - test
  4. List Information About your Table

    列出关于您的表的信息。

    Use the list command to

    使用列表命令。

    hbase(main):002:0> list 'test'
    TABLE
    test
    1 row(s) in 0.0180 seconds
    
    => ["test"]
  5. Put data into your table.

    把数据放到你的桌子上。

    To put data into your table, use the put command.

    要将数据放入表中,请使用put命令。

    hbase(main):003:0> put 'test', 'row1', 'cf:a', 'value1'
    0 row(s) in 0.0850 seconds
    
    hbase(main):004:0> put 'test', 'row2', 'cf:b', 'value2'
    0 row(s) in 0.0110 seconds
    
    hbase(main):005:0> put 'test', 'row3', 'cf:c', 'value3'
    0 row(s) in 0.0100 seconds

    Here, we insert three values, one at a time. The first insert is at row1, column cf:a, with a value of value1. Columns in HBase are comprised of a column family prefix, cf in this example, followed by a colon and then a column qualifier suffix, a in this case.

    在这里,我们插入三个值,一次一个。第一个插入是在row1,列cf:a,值为value1。HBase中的列由一个列家族前缀组成,在这个例子中是cf,后面是一个冒号,然后是一个列限定符后缀,在这个例子中是a。

  6. Scan the table for all data at once.

    立即扫描所有数据表。

    One of the ways to get data from HBase is to scan. Use the scan command to scan the table for data. You can limit your scan, but for now, all data is fetched.

    从HBase获取数据的方法之一是扫描。使用扫描命令扫描数据表。您可以限制您的扫描,但是现在,所有的数据都是被获取的。

    hbase(main):006:0> scan 'test'
    ROW                                      COLUMN+CELL
     row1                                    column=cf:a, timestamp=1421762485768, value=value1
     row2                                    column=cf:b, timestamp=1421762491785, value=value2
     row3                                    column=cf:c, timestamp=1421762496210, value=value3
    3 row(s) in 0.0230 seconds
  7. Get a single row of data.

    获取一行数据。

    To get a single row of data at a time, use the get command.

    要一次获取一行数据,请使用get命令。

    hbase(main):007:0> get 'test', 'row1'
    COLUMN                                   CELL
     cf:a                                    timestamp=1421762485768, value=value1
    1 row(s) in 0.0350 seconds
  8. Disable a table.

    禁用一个表。

    If you want to delete a table or change its settings, as well as in some other situations, you need to disable the table first, using the disable command. You can re-enable it using the enable command.

    如果您想删除一个表或更改它的设置,以及在其他一些情况下,您需要首先禁用该表,使用disable命令。您可以使用enable命令重新启用它。

    hbase(main):008:0> disable 'test'
    0 row(s) in 1.1820 seconds
    
    hbase(main):009:0> enable 'test'
    0 row(s) in 0.1770 seconds

    Disable the table again if you tested the enable command above:

    如果您测试了上面的enable命令,请再次禁用该表:

    hbase(main):010:0> disable 'test'
    0 row(s) in 1.1820 seconds
  9. Drop the table.

    删除表。

    To drop (delete) a table, use the drop command.

    要删除(删除)一个表,请使用drop命令。

    hbase(main):011:0> drop 'test'
    0 row(s) in 0.1370 seconds
  10. Exit the HBase Shell.

    退出HBase壳。

    To exit the HBase Shell and disconnect from your cluster, use the quit command. HBase is still running in the background.

    要退出HBase Shell并断开与集群的连接,可以使用quit命令。HBase仍然在后台运行。

Procedure: Stop HBase
  1. In the same way that the bin/start-hbase.sh script is provided to conveniently start all HBase daemons, the bin/stop-hbase.sh script stops them.

    就像bin/start-hbase一样。为方便地启动所有HBase守护进程(bin/stop-hbase)提供了sh脚本。sh脚本停止它们。

    $ ./bin/stop-hbase.sh
    stopping hbase....................
    $
  2. After issuing the command, it can take several minutes for the processes to shut down. Use the jps to be sure that the HMaster and HRegionServer processes are shut down.

    发出命令后,进程关闭可能需要几分钟时间。使用jps确保HMaster和h区域性服务器进程被关闭。

The above has shown you how to start and stop a standalone instance of HBase. In the next sections we give a quick overview of other modes of hbase deploy.

上面介绍了如何启动和停止HBase的独立实例。在下一节中,我们将简要介绍hbase部署的其他模式。

2.3. Pseudo-Distributed Local Install

2.3。伪分布本地安装

After working your way through quickstart standalone mode, you can re-configure HBase to run in pseudo-distributed mode. Pseudo-distributed mode means that HBase still runs completely on a single host, but each HBase daemon (HMaster, HRegionServer, and ZooKeeper) runs as a separate process: in standalone mode all daemons ran in one jvm process/instance. By default, unless you configure the hbase.rootdir property as described in quickstart, your data is still stored in /tmp/. In this walk-through, we store your data in HDFS instead, assuming you have HDFS available. You can skip the HDFS configuration to continue storing your data in the local filesystem.

通过快速启动独立模式,您可以重新配置HBase以运行伪分布式模式。伪分布式模式意味着HBase仍然完全运行在单个主机上,但是每个HBase守护进程(HMaster、hlocal服务器和ZooKeeper)作为一个单独的进程运行:在独立模式下,所有守护进程都在一个jvm进程/实例中运行。默认情况下,除非配置hbase。如快速启动所描述的rootdir属性,您的数据仍然存储在/tmp/中。在这个过程中,我们将数据存储在HDFS中,假设您有HDFS可用。您可以跳过HDFS配置来继续将数据存储在本地文件系统中。

Hadoop Configuration

This procedure assumes that you have configured Hadoop and HDFS on your local system and/or a remote system, and that they are running and available. It also assumes you are using Hadoop 2. The guide on Setting up a Single Node Cluster in the Hadoop documentation is a good starting point.

这个过程假设您已经在本地系统和/或远程系统上配置了Hadoop和HDFS,并且它们正在运行和可用。它还假设您使用的是Hadoop 2。在Hadoop文档中设置单个节点集群的指南是一个很好的起点。

  1. Stop HBase if it is running.

    停止HBase,如果它正在运行。

    If you have just finished quickstart and HBase is still running, stop it. This procedure will create a totally new directory where HBase will store its data, so any databases you created before will be lost.

    如果您刚刚完成快速启动,HBase仍在运行,请停止。这个过程将创建一个全新的目录,HBase将存储其数据,因此您之前创建的任何数据库都将丢失。

  2. Configure HBase.

    配置HBase。

    Edit the hbase-site.xml configuration. First, add the following property which directs HBase to run in distributed mode, with one JVM instance per daemon.

    编辑hbase-site。xml配置。首先,添加以下属性,该属性指导HBase在分布式模式下运行,每个守护进程有一个JVM实例。

    <property>
      <name>hbase.cluster.distributed</name>
      <value>true</value>
    </property>

    Next, change the hbase.rootdir from the local filesystem to the address of your HDFS instance, using the hdfs://// URI syntax. In this example, HDFS is running on the localhost at port 8020.

    接下来,更改hbase。使用HDFS://// URI语法从本地文件系统到HDFS实例的地址。在本例中,HDFS在端口8020的本地主机上运行。

    <property>
      <name>hbase.rootdir</name>
      <value>hdfs://localhost:8020/hbase</value>
    </property>

    You do not need to create the directory in HDFS. HBase will do this for you. If you create the directory, HBase will attempt to do a migration, which is not what you want.

    您不需要在HDFS中创建目录。HBase会为你做这个。如果创建目录,HBase将尝试进行迁移,这不是您想要的。

  3. Start HBase.

    HBase开始。

    Use the bin/start-hbase.sh command to start HBase. If your system is configured correctly, the jps command should show the HMaster and HRegionServer processes running.

    使用bin / start-hbase。sh命令启动HBase。如果您的系统配置正确,jps命令应该显示HMaster和h分区服务器进程。

  4. Check the HBase directory in HDFS.

    检查HDFS中的HBase目录。

    If everything worked correctly, HBase created its directory in HDFS. In the configuration above, it is stored in /hbase/ on HDFS. You can use the hadoop fs command in Hadoop’s bin/ directory to list this directory.

    如果一切正常,HBase在HDFS中创建了它的目录。在上面的配置中,它存储在/hbase/ HDFS上。您可以在hadoop的bin/目录中使用hadoop fs命令来列出此目录。

    $ ./bin/hadoop fs -ls /hbase
    Found 7 items
    drwxr-xr-x   - hbase users          0 2014-06-25 18:58 /hbase/.tmp
    drwxr-xr-x   - hbase users          0 2014-06-25 21:49 /hbase/WALs
    drwxr-xr-x   - hbase users          0 2014-06-25 18:48 /hbase/corrupt
    drwxr-xr-x   - hbase users          0 2014-06-25 18:58 /hbase/data
    -rw-r--r--   3 hbase users         42 2014-06-25 18:41 /hbase/hbase.id
    -rw-r--r--   3 hbase users          7 2014-06-25 18:41 /hbase/hbase.version
    drwxr-xr-x   - hbase users          0 2014-06-25 21:49 /hbase/oldWALs
  5. Create a table and populate it with data.

    创建一个表并用数据填充它。

    You can use the HBase Shell to create a table, populate it with data, scan and get values from it, using the same procedure as in shell exercises.

    您可以使用HBase Shell创建一个表,用数据填充它,扫描并从中获取值,使用与Shell练习相同的过程。

  6. Start and stop a backup HBase Master (HMaster) server.

    启动和停止备份HBase (HMaster)服务器。

    Running multiple HMaster instances on the same hardware does not make sense in a production environment, in the same way that running a pseudo-distributed cluster does not make sense for production. This step is offered for testing and learning purposes only.

    The HMaster server controls the HBase cluster. You can start up to 9 backup HMaster servers, which makes 10 total HMasters, counting the primary. To start a backup HMaster, use the local-master-backup.sh. For each backup master you want to start, add a parameter representing the port offset for that master. Each HMaster uses three ports (16010, 16020, and 16030 by default). The port offset is added to these ports, so using an offset of 2, the backup HMaster would use ports 16012, 16022, and 16032. The following command starts 3 backup servers using ports 16012/16022/16032, 16013/16023/16033, and 16015/16025/16035.

    HMaster服务器控制HBase集群。您可以启动多达9个备份的HMaster服务器,这使得10个完全的HMaster,计算主服务器。要启动备份HMaster,请使用本地主备份。对于想要启动的每个备份主,添加一个参数,该参数表示该主机的端口偏移量。每个HMaster使用三个端口(默认为16010、16020和16030)。端口偏移量被添加到这些端口,因此使用2的偏移量,备份HMaster将使用端口16012、16022和16032。以下命令使用端口16012/16022/16032、16013/16023/16033和16015/16025/16035启动3个备份服务器。

    $ ./bin/local-master-backup.sh 2 3 5

    To kill a backup master without killing the entire cluster, you need to find its process ID (PID). The PID is stored in a file with a name like /tmp/hbase-USER-X-master.pid. The only contents of the file is the PID. You can use the kill -9 command to kill that PID. The following command will kill the master with port offset 1, but leave the cluster running:

    要在不杀死整个集群的情况下杀死备份主,您需要找到它的进程ID (PID)。PID存储在一个名称为/tmp/hbase-USER-X-master.pid的文件中。该文件的惟一内容是PID。您可以使用kill -9命令来杀死PID。下面的命令将用端口偏移1杀死master,但是离开集群运行:

    $ cat /tmp/hbase-testuser-1-master.pid |xargs kill -9
  7. Start and stop additional RegionServers

    启动和停止额外的区域服务器。

    The HRegionServer manages the data in its StoreFiles as directed by the HMaster. Generally, one HRegionServer runs per node in the cluster. Running multiple HRegionServers on the same system can be useful for testing in pseudo-distributed mode. The local-regionservers.sh command allows you to run multiple RegionServers. It works in a similar way to the local-master-backup.sh command, in that each parameter you provide represents the port offset for an instance. Each RegionServer requires two ports, and the default ports are 16020 and 16030. However, the base ports for additional RegionServers are not the default ports since the default ports are used by the HMaster, which is also a RegionServer since HBase version 1.0.0. The base ports are 16200 and 16300 instead. You can run 99 additional RegionServers that are not a HMaster or backup HMaster, on a server. The following command starts four additional RegionServers, running on sequential ports starting at 16202/16302 (base ports 16200/16300 plus 2).

    hlocationserver按照HMaster的指示管理其存储文件中的数据。通常,集群中的每个节点都运行一个hlocationserver。在相同的系统上运行多个h区域性服务器,可以在伪分布式模式下进行测试。local-regionservers。sh命令允许您运行多个区域服务器。它的工作方式与本地主备份类似。sh命令,您提供的每个参数表示实例的端口偏移量。每个区域服务器需要两个端口,默认端口是16020和16030。但是,其他区域服务器的基本端口不是默认端口,因为HMaster使用默认端口,这也是自HBase 1.0.0版本以来的区域服务器。基本端口是16200和16300。您可以在服务器上运行99个不属于HMaster或备份HMaster的区域服务器。下面的命令启动4个额外的区域服务器,从16202/16302(基本端口16200/16300 + 2)开始运行。

    $ .bin/local-regionservers.sh start 2 3 4 5

    To stop a RegionServer manually, use the local-regionservers.sh command with the stop parameter and the offset of the server to stop.

    要手动停止分区服务器,请使用本地分区服务器。sh命令与停止参数和服务器的偏移量停止。

    $ .bin/local-regionservers.sh stop 3
  8. Stop HBase.

    停止HBase。

    You can stop HBase the same way as in the quickstart procedure, using the bin/stop-hbase.sh command.

    您可以使用bin/stop-hbase在快速启动过程中停止HBase。sh命令。

2.4. Advanced - Fully Distributed

2.4。高级——完全分布式

In reality, you need a fully-distributed configuration to fully test HBase and to use it in real-world scenarios. In a distributed configuration, the cluster contains multiple nodes, each of which runs one or more HBase daemon. These include primary and backup Master instances, multiple ZooKeeper nodes, and multiple RegionServer nodes.

实际上,您需要一个完全分布式的配置来全面测试HBase,并在实际场景中使用它。在分布式配置中,集群包含多个节点,每个节点运行一个或多个HBase守护进程。它们包括主实例和备份主实例、多个ZooKeeper节点和多个区域服务器节点。

This advanced quickstart adds two more nodes to your cluster. The architecture will be as follows:

这个高级的快速启动为集群增加了两个节点。其架构如下:

Table 1. Distributed Cluster Demo Architecture
Node Name Master ZooKeeper RegionServer

node-a.example.com

node-a.example.com

yes

是的

yes

是的

no

没有

node-b.example.com

node-b.example.com

backup

备份

yes

是的

yes

是的

node-c.example.com

node-c.example.com

no

没有

yes

是的

yes

是的

This quickstart assumes that each node is a virtual machine and that they are all on the same network. It builds upon the previous quickstart, Pseudo-Distributed Local Install, assuming that the system you configured in that procedure is now node-a. Stop HBase on node-a before continuing.

这个快速启动假设每个节点都是一个虚拟机,并且它们都在同一个网络上。它建立在前面的快速启动、伪分布式本地安装之上,假设您在该过程中配置的系统现在是nodea。在继续之前,停止在节点a上的HBase。

Be sure that all the nodes have full access to communicate, and that no firewall rules are in place which could prevent them from talking to each other. If you see any errors like no route to host, check your firewall.
Procedure: Configure Passwordless SSH Access

node-a needs to be able to log into node-b and node-c (and to itself) in order to start the daemons. The easiest way to accomplish this is to use the same username on all hosts, and configure password-less SSH login from node-a to each of the others.

node-a需要能够登录到node-b和nodec(以及自己)来启动守护进程。实现这一目标的最简单方法是在所有主机上使用相同的用户名,并从节点a配置无密码的SSH登录到其他主机。

  1. On node-a, generate a key pair.

    在节点a上,生成一个密钥对。

    While logged in as the user who will run HBase, generate a SSH key pair, using the following command:

    当登录为将运行HBase的用户时,使用以下命令生成一个SSH密钥对:

    $ ssh-keygen -t rsa

    If the command succeeds, the location of the key pair is printed to standard output. The default name of the public key is id_rsa.pub.

    如果命令成功,则将密钥对的位置打印到标准输出。公钥的默认名称是id_rsa.pub。

  2. Create the directory that will hold the shared keys on the other nodes.

    创建将在其他节点上保存共享密钥的目录。

    On node-b and node-c, log in as the HBase user and create a .ssh/ directory in the user’s home directory, if it does not already exist. If it already exists, be aware that it may already contain other keys.

    在node-b和nodec上,作为HBase用户登录,在用户的主目录中创建一个.ssh/目录,如果它还不存在的话。如果它已经存在,请注意它可能已经包含其他键。

  3. Copy the public key to the other nodes.

    将公钥复制到其他节点。

    Securely copy the public key from node-a to each of the nodes, by using the scp or some other secure means. On each of the other nodes, create a new file called .ssh/authorized_keys if it does not already exist, and append the contents of the id_rsa.pub file to the end of it. Note that you also need to do this for node-a itself.

    通过使用scp或其他安全方法,安全地将公钥从节点a复制到每个节点。在每个其他节点上,创建一个名为.ssh/authorized_keys的新文件,如果它还不存在,并附加id_rsa的内容。酒吧文件到此为止。注意,您还需要为node-a本身这样做。

    $ cat id_rsa.pub >> ~/.ssh/authorized_keys
  4. Test password-less login.

    测试无密码登录。

    If you performed the procedure correctly, you should not be prompted for a password when you SSH from node-a to either of the other nodes using the same username.

    如果您正确地执行了这个过程,那么当您使用相同的用户名从node-a到另一个节点时,不应该提示您输入密码。

  5. Since node-b will run a backup Master, repeat the procedure above, substituting node-b everywhere you see node-a. Be sure not to overwrite your existing .ssh/authorized_keys files, but concatenate the new key onto the existing file using the >> operator rather than the > operator.

    由于node-b将运行一个备份主程序,重复上面的步骤,在您看到node-a的地方替换node-b。确保不要覆盖现有的.ssh/authorized_keys文件,而是使用>>操作符而不是>操作符将新键连接到现有文件。

Procedure: Prepare node-a

node-a will run your primary master and ZooKeeper processes, but no RegionServers. Stop the RegionServer from starting on node-a.

节点a将运行您的主主机和ZooKeeper进程,但是没有分区服务器。停止区域服务器从节点a开始。

  1. Edit conf/regionservers and remove the line which contains localhost. Add lines with the hostnames or IP addresses for node-b and node-c.

    编辑conf/区域服务器并删除包含本地主机的行。为节点-b和节点-c添加带有主机名或IP地址的行。

    Even if you did want to run a RegionServer on node-a, you should refer to it by the hostname the other servers would use to communicate with it. In this case, that would be node-a.example.com. This enables you to distribute the configuration to each node of your cluster any hostname conflicts. Save the file.

    即使您确实想在节点a上运行一个区域服务器,您也应该通过其他服务器使用的主机名来引用它来与之通信。在这种情况下,那就是node-a.example.com。这使您能够将配置分配给集群中的每个节点,任何主机名冲突。保存文件。

  2. Configure HBase to use node-b as a backup master.

    配置HBase以使用node-b作为备份主。

    Create a new file in conf/ called backup-masters, and add a new line to it with the hostname for node-b. In this demonstration, the hostname is node-b.example.com.

    在conf/称为backup-masters中创建一个新文件,并在其上添加一个新行,以node-b的主机名。在这个演示中,主机名是node-b.example.com。

  3. Configure ZooKeeper

    配置管理员

    In reality, you should carefully consider your ZooKeeper configuration. You can find out more about configuring ZooKeeper in zookeeper section. This configuration will direct HBase to start and manage a ZooKeeper instance on each node of the cluster.

    实际上,您应该仔细考虑您的ZooKeeper配置。您可以在ZooKeeper部分找到更多关于配置ZooKeeper的信息。此配置将指导HBase在集群的每个节点上启动和管理一个ZooKeeper实例。

    On node-a, edit conf/hbase-site.xml and add the following properties.

    在节点上,编辑conf / hbase-site。xml并添加以下属性。

    <property>
      <name>hbase.zookeeper.quorum</name>
      <value>node-a.example.com,node-b.example.com,node-c.example.com</value>
    </property>
    <property>
      <name>hbase.zookeeper.property.dataDir</name>
      <value>/usr/local/zookeeper</value>
    </property>
  4. Everywhere in your configuration that you have referred to node-a as localhost, change the reference to point to the hostname that the other nodes will use to refer to node-a. In these examples, the hostname is node-a.example.com.

    在您的配置中,您已经将node-a作为localhost引用,将引用改为指向其他节点用来引用node-a的主机名。在这些示例中,主机名是node-a.example.com。

Procedure: Prepare node-b and node-c

node-b will run a backup master server and a ZooKeeper instance.

节点b将运行一个备份主服务器和一个ZooKeeper实例。

  1. Download and unpack HBase.

    下载并解压HBase。

    Download and unpack HBase to node-b, just as you did for the standalone and pseudo-distributed quickstarts.

    下载和解压HBase到node-b,就像您为独立和伪分布式快速启动所做的一样。

  2. Copy the configuration files from node-a to node-b.and node-c.

    从node-a复制配置文件到node-b。和node-c。

    Each node of your cluster needs to have the same configuration information. Copy the contents of the conf/ directory to the conf/ directory on node-b and node-c.

    集群的每个节点都需要具有相同的配置信息。将conf/目录的内容复制到node-b和nodec上的conf/目录。

Procedure: Start and Test Your Cluster
  1. Be sure HBase is not running on any node.

    确保HBase不运行在任何节点上。

    If you forgot to stop HBase from previous testing, you will have errors. Check to see whether HBase is running on any of your nodes by using the jps command. Look for the processes HMaster, HRegionServer, and HQuorumPeer. If they exist, kill them.

    如果您忘记了从以前的测试中停止HBase,您将会有错误。检查HBase是否使用jps命令在任何节点上运行。查找HMaster、hlocserver和HQuorumPeer的进程。如果存在,就杀了他们。

  2. Start the cluster.

    启动集群。

    On node-a, issue the start-hbase.sh command. Your output will be similar to that below.

    在node-a上,发出start-hbase。sh命令。您的输出将与下面类似。

    $ bin/start-hbase.sh
    node-c.example.com: starting zookeeper, logging to /home/hbuser/hbase-0.98.3-hadoop2/bin/../logs/hbase-hbuser-zookeeper-node-c.example.com.out
    node-a.example.com: starting zookeeper, logging to /home/hbuser/hbase-0.98.3-hadoop2/bin/../logs/hbase-hbuser-zookeeper-node-a.example.com.out
    node-b.example.com: starting zookeeper, logging to /home/hbuser/hbase-0.98.3-hadoop2/bin/../logs/hbase-hbuser-zookeeper-node-b.example.com.out
    starting master, logging to /home/hbuser/hbase-0.98.3-hadoop2/bin/../logs/hbase-hbuser-master-node-a.example.com.out
    node-c.example.com: starting regionserver, logging to /home/hbuser/hbase-0.98.3-hadoop2/bin/../logs/hbase-hbuser-regionserver-node-c.example.com.out
    node-b.example.com: starting regionserver, logging to /home/hbuser/hbase-0.98.3-hadoop2/bin/../logs/hbase-hbuser-regionserver-node-b.example.com.out
    node-b.example.com: starting master, logging to /home/hbuser/hbase-0.98.3-hadoop2/bin/../logs/hbase-hbuser-master-nodeb.example.com.out

    ZooKeeper starts first, followed by the master, then the RegionServers, and finally the backup masters.

    管理员首先启动,然后是主服务器,然后是区域服务器,最后是备份主服务器。

  3. Verify that the processes are running.

    验证进程正在运行。

    On each node of the cluster, run the jps command and verify that the correct processes are running on each server. You may see additional Java processes running on your servers as well, if they are used for other purposes.

    在集群的每个节点上运行jps命令,并验证是否在每个服务器上运行正确的进程。您还可以看到在服务器上运行的其他Java进程,如果它们用于其他目的。

    Example 2. node-a jps Output
    $ jps
    20355 Jps
    20071 HQuorumPeer
    20137 HMaster
    Example 3. node-b jps Output
    $ jps
    15930 HRegionServer
    16194 Jps
    15838 HQuorumPeer
    16010 HMaster
    Example 4. node-c jps Output
    $ jps
    13901 Jps
    13639 HQuorumPeer
    13737 HRegionServer
    ZooKeeper Process Name

    The HQuorumPeer process is a ZooKeeper instance which is controlled and started by HBase. If you use ZooKeeper this way, it is limited to one instance per cluster node and is appropriate for testing only. If ZooKeeper is run outside of HBase, the process is called QuorumPeer. For more about ZooKeeper configuration, including using an external ZooKeeper instance with HBase, see zookeeper section.

    HQuorumPeer流程是一个由HBase控制和启动的ZooKeeper实例。如果以这种方式使用ZooKeeper,它只适用于每个集群节点的一个实例,并且只适合于测试。如果ZooKeeper在HBase之外运行,这个过程称为QuorumPeer。有关ZooKeeper配置的更多信息,包括使用HBase的外部ZooKeeper实例,请参见ZooKeeper部分。

  4. Browse to the Web UI.

    浏览到Web UI。

    Web UI Port Changes
    Web UI Port Changes

    In HBase newer than 0.98.x, the HTTP ports used by the HBase Web UI changed from 60010 for the Master and 60030 for each RegionServer to 16010 for the Master and 16030 for the RegionServer.

    在HBase中,比0.98更新。HBase Web UI使用的HTTP端口从主服务器的60010改为主服务器的60030,而区域服务器的主服务器为16010,区域服务器为16030。

    If everything is set up correctly, you should be able to connect to the UI for the Master http://node-a.example.com:16010/ or the secondary master at http://node-b.example.com:16010/ using a web browser. If you can connect via localhost but not from another host, check your firewall rules. You can see the web UI for each of the RegionServers at port 16030 of their IP addresses, or by clicking their links in the web UI for the Master.

    如果一切设置正确,您应该能够连接到主界面http://node-a.example.com:16010/或http://node-b.example.com:16010/使用web浏览器。如果您可以通过localhost而不是从另一个主机进行连接,请检查您的防火墙规则。您可以在其IP地址的16030端口上看到每个区域服务器的web UI,或者单击它们在web UI中的链接以获得主服务器。

  5. Test what happens when nodes or services disappear.

    测试节点或服务消失时发生的情况。

    With a three-node cluster you have configured, things will not be very resilient. You can still test the behavior of the primary Master or a RegionServer by killing the associated processes and watching the logs.

    您已经配置了一个三节点集群,事情将不会很有弹性。您仍然可以通过杀死相关进程和查看日志来测试主主服务器或区域服务器的行为。

2.5. Where to go next

2.5。下次要去哪里

The next chapter, configuration, gives more information about the different HBase run modes, system requirements for running HBase, and critical configuration areas for setting up a distributed HBase cluster.

下一章,配置,提供了关于不同HBase运行模式的更多信息,运行HBase的系统需求,以及建立分布式HBase集群的关键配置区域。

Apache HBase Configuration

Apache HBase配置

This chapter expands upon the Getting Started chapter to further explain configuration of Apache HBase. Please read this chapter carefully, especially the Basic Prerequisites to ensure that your HBase testing and deployment goes smoothly, and prevent data loss. Familiarize yourself with Support and Testing Expectations as well.

3. Configuration Files

3所示。配置文件

Apache HBase uses the same configuration system as Apache Hadoop. All configuration files are located in the conf/ directory, which needs to be kept in sync for each node on your cluster.

Apache HBase使用与Apache Hadoop相同的配置系统。所有配置文件都位于conf/目录中,需要对集群中的每个节点保持同步。

HBase Configuration File Descriptions
backup-masters

Not present by default. A plain-text file which lists hosts on which the Master should start a backup Master process, one host per line.

默认情况下不存在。一个纯文本文件,该文件列出主机应该启动一个备份主进程的主机,每一行有一个主机。

hadoop-metrics2-hbase.properties

Used to connect HBase Hadoop’s Metrics2 framework. See the Hadoop Wiki entry for more information on Metrics2. Contains only commented-out examples by default.

用于连接HBase Hadoop的Metrics2框架。有关Metrics2的更多信息,请参见Hadoop Wiki条目。默认只包含有注释的示例。

hbase-env.cmd and hbase-env.sh

Script for Windows and Linux / Unix environments to set up the working environment for HBase, including the location of Java, Java options, and other environment variables. The file contains many commented-out examples to provide guidance.

为Windows和Linux / Unix环境编写脚本,为HBase设置工作环境,包括Java、Java选项和其他环境变量的位置。该文件中包含了许多供参考的示例。

hbase-policy.xml

The default policy configuration file used by RPC servers to make authorization decisions on client requests. Only used if HBase security is enabled.

RPC服务器使用的默认策略配置文件,用于在客户端请求上进行授权决策。只有在启用了HBase安全性时才使用。

hbase-site.xml

The main HBase configuration file. This file specifies configuration options which override HBase’s default configuration. You can view (but do not edit) the default configuration file at docs/hbase-default.xml. You can also view the entire effective configuration for your cluster (defaults and overrides) in the HBase Configuration tab of the HBase Web UI.

主HBase配置文件。此文件指定覆盖HBase默认配置的配置选项。您可以在docs/hbase-default.xml中查看(但不编辑)缺省配置文件。您还可以在HBase Web UI的HBase配置选项卡中查看集群的整个有效配置(默认和覆盖)。

log4j.properties

Configuration file for HBase logging via log4j.

通过log4j进行HBase日志记录的配置文件。

regionservers

A plain-text file containing a list of hosts which should run a RegionServer in your HBase cluster. By default this file contains the single entry localhost. It should contain a list of hostnames or IP addresses, one per line, and should only contain localhost if each node in your cluster will run a RegionServer on its localhost interface.

一个纯文本文件,其中包含一个主机列表,该列表应该在您的HBase集群中运行一个区域服务器。默认情况下,该文件包含单个条目localhost。它应该包含一个主机名或IP地址列表,每行一个,并且如果集群中的每个节点在其本地主机接口上运行一个区域服务器,那么应该只包含localhost。

Checking XML Validity

When you edit XML, it is a good idea to use an XML-aware editor to be sure that your syntax is correct and your XML is well-formed. You can also use the xmllint utility to check that your XML is well-formed. By default, xmllint re-flows and prints the XML to standard output. To check for well-formedness and only print output if errors exist, use the command xmllint -noout filename.xml.

在编辑XML时,最好使用一个支持XML的编辑器,以确保语法正确,并且XML格式良好。还可以使用xmllint实用程序检查XML是否格式良好。默认情况下,xmllint重新流并将XML输出到标准输出。要检查是否有良好的格式,并且只有在存在错误时才打印输出,请使用命令xmllint -noout filename.xml。

Keep Configuration In Sync Across the Cluster

When running in distributed mode, after you make an edit to an HBase configuration, make sure you copy the contents of the conf/ directory to all nodes of the cluster. HBase will not do this for you. Use rsync, scp, or another secure mechanism for copying the configuration files to your nodes. For most configurations, a restart is needed for servers to pick up changes. Dynamic configuration is an exception to this, to be described later below.

在分布式模式下运行时,在对HBase配置进行编辑之后,确保将conf/目录的内容复制到集群的所有节点。HBase不会为你做这些。使用rsync、scp或其他安全机制将配置文件复制到节点。对于大多数配置,服务器需要重新启动来获取更改。动态配置是这方面的一个例外,稍后将对此进行描述。

4. Basic Prerequisites

4所示。基本的先决条件

This section lists required services and some required system configuration.

本节列出所需的服务和一些必需的系统配置。

Table 2. Java
HBase Version JDK 7 JDK 8

2.0

2.0

Not Supported

不支持

yes

是的

1.3

1.3

yes

是的

yes

是的

1.2

1.2

yes

是的

yes

是的

1.1

1.1

yes

是的

Running with JDK 8 will work but is not well tested.

使用JDK 8运行但是没有经过很好的测试。

HBase will neither build nor compile with Java 6.
You must set JAVA_HOME on each node of your cluster. hbase-env.sh provides a handy mechanism to do this.
Operating System Utilities
ssh

HBase uses the Secure Shell (ssh) command and utilities extensively to communicate between cluster nodes. Each server in the cluster must be running ssh so that the Hadoop and HBase daemons can be managed. You must be able to connect to all nodes via SSH, including the local node, from the Master as well as any backup Master, using a shared key rather than a password. You can see the basic methodology for such a set-up in Linux or Unix systems at "Procedure: Configure Passwordless SSH Access". If your cluster nodes use OS X, see the section, SSH: Setting up Remote Desktop and Enabling Self-Login on the Hadoop wiki.

HBase使用安全Shell (ssh)命令和实用工具广泛地在集群节点之间进行通信。集群中的每个服务器都必须运行ssh,以便管理Hadoop和HBase守护进程。您必须能够通过SSH连接到所有节点,包括本地节点、主节点和任何备份主节点,使用共享密钥而不是密码。您可以在“过程:配置无密码SSH访问”的Linux或Unix系统中看到这种设置的基本方法。如果您的集群节点使用OS X,请参见这一节,SSH:设置远程桌面并启用Hadoop wiki上的自登录。

DNS

HBase uses the local hostname to self-report its IP address. Both forward and reverse DNS resolving must work in versions of HBase previous to 0.92.0. The hadoop-dns-checker tool can be used to verify DNS is working correctly on the cluster. The project README file provides detailed instructions on usage.

HBase使用本地主机名自报告其IP地址。正向和反向DNS解析都必须在HBase之前的版本中工作到0.92.0。hadoop-dns-checker工具可以用来验证DNS在集群上是否正确工作。项目README文件提供关于使用的详细说明。

Loopback IP

Prior to hbase-0.96.0, HBase only used the IP address 127.0.0.1 to refer to localhost, and this was not configurable. See Loopback IP for more details.

在HBase -0.96.0之前,HBase只使用了IP地址127.0.0.1来引用localhost,这是不可配置的。有关更多细节,请参见Loopback IP。

NTP

The clocks on cluster nodes should be synchronized. A small amount of variation is acceptable, but larger amounts of skew can cause erratic and unexpected behavior. Time synchronization is one of the first things to check if you see unexplained problems in your cluster. It is recommended that you run a Network Time Protocol (NTP) service, or another time-synchronization mechanism on your cluster and that all nodes look to the same service for time synchronization. See the Basic NTP Configuration at The Linux Documentation Project (TLDP) to set up NTP.

集群节点上的时钟应该同步。少量的偏差是可以接受的,但是较大的偏差会导致不稳定和意外的行为。如果您在集群中看到无法解释的问题,时间同步是首先要检查的东西之一。建议您运行一个网络时间协议(NTP)服务,或者在集群上运行另一个时间同步机制,并且所有的节点都希望使用相同的服务来进行时间同步。请参阅Linux文档项目(TLDP)中的基本NTP配置来设置NTP。

Limits on Number of Files and Processes (ulimit)

Apache HBase is a database. It requires the ability to open a large number of files at once. Many Linux distributions limit the number of files a single user is allowed to open to 1024 (or 256 on older versions of OS X). You can check this limit on your servers by running the command ulimit -n when logged in as the user which runs HBase. See the Troubleshooting section for some of the problems you may experience if the limit is too low. You may also notice errors such as the following:

Apache HBase是一个数据库。它需要能够同时打开大量文件的能力。许多Linux发行版限制单个用户可以打开到1024的文件的数量(在旧版本的OS X上是256个)。当用户登录时,可以通过运行命令ulimit -n来检查服务器上的这个限制。如果限制太低,请参阅故障排除部分,了解一些您可能会遇到的问题。您还可能注意到以下错误:

2010-04-06 03:04:37,542 INFO org.apache.hadoop.hdfs.DFSClient: Exception increateBlockOutputStream java.io.EOFException
2010-04-06 03:04:37,542 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-6935524980745310745_1391901

It is recommended to raise the ulimit to at least 10,000, but more likely 10,240, because the value is usually expressed in multiples of 1024. Each ColumnFamily has at least one StoreFile, and possibly more than six StoreFiles if the region is under load. The number of open files required depends upon the number of ColumnFamilies and the number of regions. The following is a rough formula for calculating the potential number of open files on a RegionServer.

建议将ulimit提高到至少10,000,但更可能是10240,因为该值通常以1024的倍数表示。每个ColumnFamily至少有一个存储文件,如果该区域处于负载状态,则可能有超过6个存储文件。所需打开文件的数量取决于ColumnFamilies的数量和区域的数量。下面是计算区域服务器上打开文件的潜在数量的粗略公式。

Calculate the Potential Number of Open Files
(StoreFiles per ColumnFamily) x (regions per RegionServer)

For example, assuming that a schema had 3 ColumnFamilies per region with an average of 3 StoreFiles per ColumnFamily, and there are 100 regions per RegionServer, the JVM will open 3 * 3 * 100 = 900 file descriptors, not counting open JAR files, configuration files, and others. Opening a file does not take many resources, and the risk of allowing a user to open too many files is minimal.

例如,假设一个模式的每个区域有3个ColumnFamily,平均每个ColumnFamily有3个StoreFiles,并且每个区域服务器有100个区域,JVM将打开3 * 3 * 100 = 900的文件描述符,不包括打开的JAR文件、配置文件和其他文件。打开一个文件不需要很多资源,允许用户打开太多文件的风险是最小的。

Another related setting is the number of processes a user is allowed to run at once. In Linux and Unix, the number of processes is set using the ulimit -u command. This should not be confused with the nproc command, which controls the number of CPUs available to a given user. Under load, a ulimit -u that is too low can cause OutOfMemoryError exceptions. See Jack Levin’s major HDFS issues thread on the hbase-users mailing list, from 2011.

另一个相关设置是允许用户同时运行的进程数量。在Linux和Unix中,使用ulimit -u命令设置进程的数量。这不应该与nproc命令相混淆,nproc命令控制给定用户可用的cpu数量。在负载下,ulimit -u太低,会导致OutOfMemoryError异常。从2011年开始,在hbase-用户邮件列表上看到Jack Levin的主要HDFS问题。

Configuring the maximum number of file descriptors and processes for the user who is running the HBase process is an operating system configuration, rather than an HBase configuration. It is also important to be sure that the settings are changed for the user that actually runs HBase. To see which user started HBase, and that user’s ulimit configuration, look at the first line of the HBase log for that instance. A useful read setting config on your hadoop cluster is Aaron Kimball’s Configuration Parameters: What can you just ignore?

为运行HBase进程的用户配置文件描述符和进程的最大数量是操作系统配置,而不是HBase配置。同样重要的是,确保为实际运行HBase的用户更改了设置。要查看哪个用户启动了HBase,以及该用户的ulimit配置,请查看该实例的HBase日志的第一行。在hadoop集群上一个有用的读设置配置是Aaron Kimball的配置参数:您可以忽略什么?

Example 5. ulimit Settings on Ubuntu

To configure ulimit settings on Ubuntu, edit /etc/security/limits.conf, which is a space-delimited file with four columns. Refer to the man page for limits.conf for details about the format of this file. In the following example, the first line sets both soft and hard limits for the number of open files (nofile) to 32768 for the operating system user with the username hadoop. The second line sets the number of processes to 32000 for the same user.

要配置Ubuntu的ulimit设置,编辑/etc/security/limits。conf是一个空间分隔的文件,有四列。参考手册页的限制。关于该文件格式的详细信息。在下面的示例中,第一行为操作系统用户使用用户名hadoop设置了软和硬限制(nofile)到32768。第二行将进程的数量设置为同一用户的32000个进程。

hadoop  -       nofile  32768
hadoop  -       nproc   32000

The settings are only applied if the Pluggable Authentication Module (PAM) environment is directed to use them. To configure PAM to use these limits, be sure that the /etc/pam.d/common-session file contains the following line:

只有当可插入身份验证模块(PAM)环境被引导使用它们时,才会应用这些设置。要配置PAM以使用这些限制,请确保/etc/ pam。d/common session文件包含以下一行:

session required  pam_limits.so
Linux Shell

All of the shell scripts that come with HBase rely on the GNU Bash shell.

HBase的所有shell脚本都依赖于GNU Bash shell。

Windows

Prior to HBase 0.96, running HBase on Microsoft Windows was limited only for testing purposes. Running production systems on Windows machines is not recommended.

在HBase 0.96之前,在Microsoft Windows上运行HBase只是为了测试目的。不推荐在Windows机器上运行生产系统。

4.1. Hadoop

4.1。Hadoop

The following table summarizes the versions of Hadoop supported with each version of HBase. Based on the version of HBase, you should select the most appropriate version of Hadoop. You can use Apache Hadoop, or a vendor’s distribution of Hadoop. No distinction is made here. See the Hadoop wiki for information about vendors of Hadoop.

下表总结了每个版本的HBase支持的Hadoop版本。基于HBase的版本,您应该选择最合适的Hadoop版本。您可以使用Apache Hadoop或供应商的Hadoop发行版。这里没有区别。有关Hadoop供应商的信息,请参见Hadoop wiki。

Hadoop 2.x is recommended.

Hadoop 2.x is faster and includes features, such as short-circuit reads, which will help improve your HBase random read profile. Hadoop 2.x also includes important bug fixes that will improve your overall HBase experience. HBase does not support running with earlier versions of Hadoop. See the table below for requirements specific to different HBase versions.

Hadoop 2。x更快,包括一些特性,比如短路读取,这将有助于提高您的HBase随机读取配置文件。Hadoop 2。x还包括一些重要的bug修复,可以改善您的整体HBase体验。HBase不支持使用早期版本的Hadoop。请参见下面的表,以了解针对不同HBase版本的需求。

Hadoop 3.x is still in early access releases and has not yet been sufficiently tested by the HBase community for production use cases.

Hadoop 3。x仍在早期访问版本中,并且尚未经过HBase社区对生产用例进行充分的测试。

Use the following legend to interpret this table:

使用以下的图例来解释这张表:

Hadoop version support matrix
  • "S" = supported

    “S”=支持

  • "X" = not supported

    “X”=不支持

  • "NT" = Not tested

    “NT”=不测试

HBase-1.1.x HBase-1.2.x HBase-1.3.x HBase-2.0.x

Hadoop-2.0.x-alpha

Hadoop-2.0.x-alpha

X

X

X

X

X

X

X

X

Hadoop-2.1.0-beta

Hadoop-2.1.0-beta

X

X

X

X

X

X

X

X

Hadoop-2.2.0

Hadoop-2.2.0

NT

NT

X

X

X

X

X

X

Hadoop-2.3.x

Hadoop-2.3.x

NT

NT

X

X

X

X

X

X

Hadoop-2.4.x

Hadoop-2.4.x

S

年代

S

年代

S

年代

X

X

Hadoop-2.5.x

Hadoop-2.5.x

S

年代

S

年代

S

年代

X

X

Hadoop-2.6.0

Hadoop-2.6.0

X

X

X

X

X

X

X

X

Hadoop-2.6.1+

Hadoop-2.6.1 +

NT

NT

S

年代

S

年代

S

年代

Hadoop-2.7.0

Hadoop-2.7.0

X

X

X

X

X

X

X

X

Hadoop-2.7.1+

Hadoop-2.7.1 +

NT

NT

S

年代

S

年代

S

年代

Hadoop-2.8.0

Hadoop-2.8.0

X

X

X

X

X

X

X

X

Hadoop-2.8.1

Hadoop-2.8.1

X

X

X

X

X

X

X

X

Hadoop-3.0.0

Hadoop-3.0.0

NT

NT

NT

NT

NT

NT

NT

NT

Hadoop Pre-2.6.1 and JDK 1.8 Kerberos

When using pre-2.6.1 Hadoop versions and JDK 1.8 in a Kerberos environment, HBase server can fail and abort due to Kerberos keytab relogin error. Late version of JDK 1.7 (1.7.0_80) has the problem too. Refer to HADOOP-10786 for additional details. Consider upgrading to Hadoop 2.6.1+ in this case.

在Kerberos环境中使用pre2.6.1 Hadoop版本和JDK 1.8时,HBase服务器由于Kerberos keytab重新登录错误而失败和中止。JDK 1.7的晚期版本(1.7.0_80)也有问题。请参阅HADOOP-10786,了解更多细节。在本例中考虑升级到Hadoop 2.6.1+。

Hadoop 2.6.x

Hadoop distributions based on the 2.6.x line must have HADOOP-11710 applied if you plan to run HBase on top of an HDFS Encryption Zone. Failure to do so will result in cluster failure and data loss. This patch is present in Apache Hadoop releases 2.6.1+.

基于2.6的Hadoop发行版。如果您打算在HDFS加密区域上运行HBase,那么x线必须使用HADOOP-11710。如果不这样做,将导致集群失败和数据丢失。这个补丁在Apache Hadoop版本2.6.1+中出现。

Hadoop 2.7.x

Hadoop version 2.7.0 is not tested or supported as the Hadoop PMC has explicitly labeled that release as not being stable. (reference the announcement of Apache Hadoop 2.7.0.)

Hadoop version 2.7.0没有测试或支持,因为Hadoop PMC已经明确地将该版本标记为不稳定。(参考Apache Hadoop 2.7.0的公告)

Hadoop 2.8.x

Hadoop version 2.8.0 and 2.8.1 are not tested or supported as the Hadoop PMC has explicitly labeled that releases as not being stable. (reference the announcement of Apache Hadoop 2.8.0 and announcement of Apache Hadoop 2.8.1.)

Hadoop版本2.8.0和2.8.1没有被测试或支持,因为Hadoop PMC已经明确地将其标记为不稳定。(参考Apache Hadoop 2.8.0公告和Apache Hadoop 2.8.1公告)。

Replace the Hadoop Bundled With HBase!

Because HBase depends on Hadoop, it bundles an instance of the Hadoop jar under its lib directory. The bundled jar is ONLY for use in standalone mode. In distributed mode, it is critical that the version of Hadoop that is out on your cluster match what is under HBase. Replace the hadoop jar found in the HBase lib directory with the hadoop jar you are running on your cluster to avoid version mismatch issues. Make sure you replace the jar in HBase across your whole cluster. Hadoop version mismatch issues have various manifestations but often all look like its hung.

因为HBase依赖于Hadoop,它将Hadoop jar的一个实例绑定到它的lib目录下。绑定的jar只用于独立模式。在分布式模式下,关键是在您的集群上的Hadoop版本与HBase下的版本匹配。将在HBase lib目录中找到的hadoop jar替换为您在集群上运行的hadoop jar,以避免版本不匹配问题。确保在整个集群中替换HBase中的jar。Hadoop版本的不匹配问题有各种各样的表现,但通常都看起来像挂着的。

4.1.1. dfs.datanode.max.transfer.threads

以下4.4.1。dfs.datanode.max.transfer.threads

An HDFS DataNode has an upper bound on the number of files that it will serve at any one time. Before doing any loading, make sure you have configured Hadoop’s conf/hdfs-site.xml, setting the dfs.datanode.max.transfer.threads value to at least the following:

一个HDFS DataNode对它将在任何时间服务的文件的数量有一个上限。在进行任何加载之前,请确保已经配置了Hadoop的conf/hdfs-site。xml,设置dfs.datanode.max.transfer。线程的值至少为:

<property>
  <name>dfs.datanode.max.transfer.threads</name>
  <value>4096</value>
</property>

Be sure to restart your HDFS after making the above configuration.

在完成上述配置后,请确保重新启动HDFS。

Not having this configuration in place makes for strange-looking failures. One manifestation is a complaint about missing blocks. For example:

没有这样的配置就会导致看起来很奇怪的失败。一种表现是对缺失的块的抱怨。例如:

10/12/08 20:10:31 INFO hdfs.DFSClient: Could not obtain block
          blk_XXXXXXXXXXXXXXXXXXXXXX_YYYYYYYY from any node: java.io.IOException: No live nodes
          contain current block. Will get new block locations from namenode and retry...

See also casestudies.max.transfer.threads and note that this property was previously known as dfs.datanode.max.xcievers (e.g. Hadoop HDFS: Deceived by Xciever).

也看到casestudies.max.transfer。线程和注意,此属性以前称为dfs.datanode.max。xcievers(例如Hadoop HDFS:被Xciever欺骗)。

4.2. ZooKeeper Requirements

4.2。动物园管理员的要求

ZooKeeper 3.4.x is required. HBase makes use of the multi functionality that is only available since Zookeeper 3.4.0. The hbase.zookeeper.useMulti configuration property defaults to true. Refer to HBASE-12241 (The crash of regionServer when taking deadserver’s replication queue breaks replication) and HBASE-6775 (Use ZK.multi when available for HBASE-6710 0.92/0.94 compatibility fix) for background. The property is deprecated and useMulti is always enabled in HBase 2.0.

3.4管理员。x是必需的。HBase使用了自Zookeeper 3.4.0以来仅可用的多功能。hbase.zookeeper。useMulti配置属性默认为true。请参考HBASE-12241(在使用deadserver的复制队列中断复制时发生的区域服务器崩溃)和HBASE-6775(使用ZK)。为背景提供HBASE-6710 0.92/0.94兼容性补丁。该属性已弃用,而useMulti总是在HBase 2.0中启用。

5. HBase run modes: Standalone and Distributed

5。HBase运行模式:独立和分布式。

HBase has two run modes: standalone and distributed. Out of the box, HBase runs in standalone mode. Whatever your mode, you will need to configure HBase by editing files in the HBase conf directory. At a minimum, you must edit conf/hbase-env.sh to tell HBase which java to use. In this file you set HBase environment variables such as the heapsize and other options for the JVM, the preferred location for log files, etc. Set JAVA_HOME to point at the root of your java install.

HBase有两种运行模式:独立和分布式。在盒子之外,HBase以独立模式运行。无论您的模式是什么,您都需要通过在HBase conf目录中编辑文件来配置HBase。至少,您必须编辑conf/hbase-env。sh告诉HBase使用哪个java。在这个文件中,您设置了HBase环境变量,比如JVM的heapsize和其他选项,日志文件的首选位置等等。

5.1. Standalone HBase

5.1。独立的HBase

This is the default mode. Standalone mode is what is described in the quickstart section. In standalone mode, HBase does not use HDFS — it uses the local filesystem instead — and it runs all HBase daemons and a local ZooKeeper all up in the same JVM. ZooKeeper binds to a well known port so clients may talk to HBase.

这是默认模式。独立模式是在快速启动部分中描述的。在独立模式下,HBase不使用HDFS——而是使用本地文件系统——并且它在同一个JVM中运行所有HBase守护进程和一个本地ZooKeeper。ZooKeeper绑定到一个著名的端口,这样客户可以和HBase交谈。

5.1.1. Standalone HBase over HDFS

5.1.1。独立的HBase在HDFS

A sometimes useful variation on standalone hbase has all daemons running inside the one JVM but rather than persist to the local filesystem, instead they persist to an HDFS instance.

有时候,独立的hbase上的一些有用的变体有所有在一个JVM中运行的守护进程,而不是持久化到本地文件系统中,而是持久化到一个HDFS实例中。

You might consider this profile when you are intent on a simple deploy profile, the loading is light, but the data must persist across node comings and goings. Writing to HDFS where data is replicated ensures the latter.

当您想要一个简单的部署概要文件时,您可能会考虑这个概要文件,加载是轻的,但是数据必须在节点的出入中保持。将数据写入到HDFS中,确保数据被复制。

To configure this standalone variant, edit your hbase-site.xml setting hbase.rootdir to point at a directory in your HDFS instance but then set hbase.cluster.distributed to false. For example:

要配置这个独立的变体,编辑您的hbase站点。xml设置hbase。在HDFS实例中指向一个目录,然后设置hbase.cluster。分发给假。例如:

<configuration>
  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://namenode.example.org:8020/hbase</value>
  </property>
  <property>
    <name>hbase.cluster.distributed</name>
    <value>false</value>
  </property>
</configuration>

5.2. Distributed

5.2。分布式

Distributed mode can be subdivided into distributed but all daemons run on a single node — a.k.a. pseudo-distributed — and fully-distributed where the daemons are spread across all nodes in the cluster. The pseudo-distributed vs. fully-distributed nomenclature comes from Hadoop.

分布式模式可以被细分为分布式的,但是所有的守护进程都运行在一个节点上,即a.k.a。伪分布式和完全分布式,守护进程分布在集群中的所有节点上。伪分布式和完全分布的命名来自于Hadoop。

Pseudo-distributed mode can run against the local filesystem or it can run against an instance of the Hadoop Distributed File System (HDFS). Fully-distributed mode can ONLY run on HDFS. See the Hadoop documentation for how to set up HDFS. A good walk-through for setting up HDFS on Hadoop 2 can be found at http://www.alexjf.net/blog/distributed-systems/hadoop-yarn-installation-definitive-guide.

伪分布模式可以针对本地文件系统运行,也可以针对Hadoop分布式文件系统(HDFS)的实例运行。全分布式模式只能在HDFS上运行。参见Hadoop文档了解如何设置HDFS。可以在http://www.alexjf.net/blog/distributedsystems/hadoop -yarn- installationguide中找到一个用于在Hadoop 2上设置HDFS的好方法。

5.2.1. Pseudo-distributed

5.2.1。伪分布

Pseudo-Distributed Quickstart

A quickstart has been added to the quickstart chapter. See quickstart-pseudo. Some of the information that was originally in this section has been moved there.

快速启动已经添加到快速启动章节。看到quickstart-pseudo。本节中最初的一些信息已经被移到了那里。

A pseudo-distributed mode is simply a fully-distributed mode run on a single host. Use this HBase configuration for testing and prototyping purposes only. Do not use this configuration for production or for performance evaluation.

伪分布模式是在单个主机上运行的完全分布式模式。仅使用此HBase配置进行测试和原型设计。不要使用此配置用于生产或性能评估。

5.3. Fully-distributed

5.3。全分布

By default, HBase runs in standalone mode. Both standalone mode and pseudo-distributed mode are provided for the purposes of small-scale testing. For a production environment, distributed mode is advised. In distributed mode, multiple instances of HBase daemons run on multiple servers in the cluster.

默认情况下,HBase以独立模式运行。为小型测试的目的,提供了独立模式和伪分布式模式。对于生产环境,建议采用分布式模式。在分布式模式中,HBase守护进程的多个实例在集群中的多个服务器上运行。

Just as in pseudo-distributed mode, a fully distributed configuration requires that you set the hbase.cluster.distributed property to true. Typically, the hbase.rootdir is configured to point to a highly-available HDFS filesystem.

就像在伪分布模式中一样,一个完全分布式的配置要求您设置hbase.cluster。分布式属性为true。通常,hbase。rootdir配置为指向高可用的HDFS文件系统。

In addition, the cluster is configured so that multiple cluster nodes enlist as RegionServers, ZooKeeper QuorumPeers, and backup HMaster servers. These configuration basics are all demonstrated in quickstart-fully-distributed.

此外,还配置了集群,以使多个集群节点作为区域服务器、ZooKeeper quorumpeer和备份的HMaster服务器。这些配置基础都在quickstart-full -distributed中演示。

Distributed RegionServers

Typically, your cluster will contain multiple RegionServers all running on different servers, as well as primary and backup Master and ZooKeeper daemons. The conf/regionservers file on the master server contains a list of hosts whose RegionServers are associated with this cluster. Each host is on a separate line. All hosts listed in this file will have their RegionServer processes started and stopped when the master server starts or stops.

通常,您的集群将包含多个在不同服务器上运行的区域服务器,以及主服务器和备份主机和ZooKeeper守护进程。主服务器上的conf/区域服务器文件包含一个主机列表,其区域服务器与此集群关联。每个主机都在一个单独的行上。在该文件中列出的所有主机将在主服务器启动或停止时启动并停止区域服务器进程。

ZooKeeper and HBase

See the ZooKeeper section for ZooKeeper setup instructions for HBase.

请参见ZooKeeper部分的ZooKeeper设置说明。

Example 6. Example Distributed HBase Cluster

This is a bare-bones conf/hbase-site.xml for a distributed HBase cluster. A cluster that is used for real-world work would contain more custom configuration parameters. Most HBase configuration directives have default values, which are used unless the value is overridden in the hbase-site.xml. See "Configuration Files" for more information.

这是一个裸体的conf/hbase网站。用于分布式HBase集群的xml。用于实际工作的集群将包含更多自定义配置参数。大多数HBase配置指令都有默认值,除非在HBase -site.xml中被覆盖,否则将使用该值。有关更多信息,请参见“配置文件”。

<configuration>
  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://namenode.example.org:8020/hbase</value>
  </property>
  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
  </property>
  <property>
    <name>hbase.zookeeper.quorum</name>
    <value>node-a.example.com,node-b.example.com,node-c.example.com</value>
  </property>
</configuration>

This is an example conf/regionservers file, which contains a list of nodes that should run a RegionServer in the cluster. These nodes need HBase installed and they need to use the same contents of the conf/ directory as the Master server

这是一个示例conf/ RegionServer文件,其中包含一个节点列表,该列表应该在集群中运行一个区域服务器。这些节点需要安装HBase,并且它们需要使用conf/目录的相同内容作为主服务器。

node-a.example.com
node-b.example.com
node-c.example.com

This is an example conf/backup-masters file, which contains a list of each node that should run a backup Master instance. The backup Master instances will sit idle unless the main Master becomes unavailable.

这是一个conf/backup-masters文件的示例,其中包含每个节点的列表,该列表应该运行一个备份主实例。除非主主变得不可用,否则备份主实例将处于空闲状态。

node-b.example.com
node-c.example.com
Distributed HBase Quickstart

See quickstart-fully-distributed for a walk-through of a simple three-node cluster configuration with multiple ZooKeeper, backup HMaster, and RegionServer instances.

请参阅quickstart-完全分布式,以完成一个简单的三节点集群配置,其中包含多个ZooKeeper、备份HMaster和区域服务器实例。

Procedure: HDFS Client Configuration
  1. Of note, if you have made HDFS client configuration changes on your Hadoop cluster, such as configuration directives for HDFS clients, as opposed to server-side configurations, you must use one of the following methods to enable HBase to see and use these configuration changes:

    值得注意的是,如果您在Hadoop集群上做了HDFS客户端配置更改,比如HDFS客户机的配置指令,而不是服务器端配置,那么您必须使用以下方法之一来启用HBase查看和使用这些配置更改:

    1. Add a pointer to your HADOOP_CONF_DIR to the HBASE_CLASSPATH environment variable in hbase-env.sh.

      在hbase-env.sh中添加一个指向您的hadoop op_conf_dir到HBASE_CLASSPATH环境变量的指针。

    2. Add a copy of hdfs-site.xml (or hadoop-site.xml) or, better, symlinks, under ${HBASE_HOME}/conf, or

      添加一个hdfs-site拷贝。在${HBASE_HOME}/conf下的xml(或hadoop-site.xml)或更好的符号链接。

    3. if only a small set of HDFS client configurations, add them to hbase-site.xml.

      如果只有一小部分HDFS客户机配置,将它们添加到hbase-site.xml中。

An example of such an HDFS client configuration is dfs.replication. If for example, you want to run with a replication factor of 5, HBase will create files with the default of 3 unless you do the above to make the configuration available to HBase.

这种HDFS客户机配置的一个示例是dfs.replication。例如,如果您想要以5的复制因子运行,HBase将创建默认为3的文件,除非您执行上述操作,以使HBase可用配置。

6. Running and Confirming Your Installation

6。运行和确认您的安装。

Make sure HDFS is running first. Start and stop the Hadoop HDFS daemons by running bin/start-hdfs.sh over in the HADOOP_HOME directory. You can ensure it started properly by testing the put and get of files into the Hadoop filesystem. HBase does not normally use the MapReduce or YARN daemons. These do not need to be started.

确保HDFS首先运行。通过运行bin/ Start - HDFS启动和停止Hadoop HDFS守护进程。在hadoop - home目录中。您可以通过在Hadoop文件系统中测试put和get来确保它正确地启动。HBase通常不使用MapReduce或纱线守护进程。这些不需要启动。

If you are managing your own ZooKeeper, start it and confirm it’s running, else HBase will start up ZooKeeper for you as part of its start process.

如果你正在管理你自己的动物管理员,启动它并确认它正在运行,否则HBase将启动你作为它开始进程的一部分的动物管理员。

Start HBase with the following command:

使用以下命令启动HBase:

bin/start-hbase.sh

Run the above from the HBASE_HOME directory.

运行上面的HBASE_HOME目录。

You should now have a running HBase instance. HBase logs can be found in the logs subdirectory. Check them out especially if HBase had trouble starting.

现在应该有一个运行的HBase实例。可以在日志子目录中找到HBase日志。检查他们,特别是如果HBase有麻烦开始。

HBase also puts up a UI listing vital attributes. By default it’s deployed on the Master host at port 16010 (HBase RegionServers listen on port 16020 by default and put up an informational HTTP server at port 16030). If the Master is running on a host named master.example.org on the default port, point your browser at http://master.example.org:16010 to see the web interface.

HBase还设置了一个用于列出重要属性的UI。默认情况下,它被部署在16010端口的主主机上(HBase区域服务器默认监听16020端口,并在16030端口上安装一个信息HTTP服务器)。如果Master在默认端口上运行名为master.example.org的主机,请将浏览器指向http://master.example.org:16010查看web界面。

Once HBase has started, see the shell exercises section for how to create tables, add data, scan your insertions, and finally disable and drop your tables.

一旦HBase启动,请参见shell练习小节,了解如何创建表、添加数据、扫描插入,最后禁用和删除表。

To stop HBase after exiting the HBase shell enter

在退出HBase shell后停止HBase。

$ ./bin/stop-hbase.sh
stopping hbase...............

Shutdown can take a moment to complete. It can take longer if your cluster is comprised of many machines. If you are running a distributed operation, be sure to wait until HBase has shut down completely before stopping the Hadoop daemons.

关闭可能需要一段时间才能完成。如果您的集群由许多机器组成,则需要更长的时间。如果您正在运行一个分布式操作,一定要等到HBase完全关闭之后才停止Hadoop守护进程。

7. Default Configuration

7所示。默认配置

7.1. hbase-site.xml and hbase-default.xml

7.1。hbase-site。xml和hbase-default.xml

Just as in Hadoop where you add site-specific HDFS configuration to the hdfs-site.xml file, for HBase, site specific customizations go into the file conf/hbase-site.xml. For the list of configurable properties, see hbase default configurations below or view the raw hbase-default.xml source file in the HBase source code at src/main/resources.

就像在Hadoop中,将特定于站点的HDFS配置添加到HDFS站点。对于HBase,特定于站点的定制会进入文件conf/ HBase -site.xml。对于可配置属性的列表,请参见下面的hbase默认配置或查看原始hbase-default。xml源文件在src/main/resources的HBase源代码中。

Not all configuration options make it out to hbase-default.xml. Some configurations would only appear in source code; the only way to identify these changes are through code review.

并不是所有的配置选项都显示为hbase-default.xml。一些配置只出现在源代码中;识别这些变化的唯一方法是通过代码复查。

Currently, changes here will require a cluster restart for HBase to notice the change.

当前,这里的更改将要求HBase重新启动集群以注意更改。

7.2. HBase Default Configuration

7.2。HBase默认配置

The documentation below is generated using the default hbase configuration file, hbase-default.xml, as source.

下面的文档是使用默认的hbase配置文件hbase-default生成的。xml源。

hbase.tmp.dir
Description

Temporary directory on the local filesystem. Change this setting to point to a location more permanent than '/tmp', the usual resolve for java.io.tmpdir, as the '/tmp' directory is cleared on machine restart.

本地文件系统上的临时目录。更改此设置以指向比“/tmp”更持久的位置,即java.io的通常解决方案。tmpdir,作为“/tmp”目录在机器重新启动时被清除。

Default

${java.io.tmpdir}/hbase-${user.name}

$ { java.io.tmpdir } / hbase - $ { user.name }

hbase.rootdir
Description

The directory shared by region servers and into which HBase persists. The URL should be 'fully-qualified' to include the filesystem scheme. For example, to specify the HDFS directory '/hbase' where the HDFS instance’s namenode is running at namenode.example.org on port 9000, set this value to: hdfs://namenode.example.org:9000/hbase. By default, we write to whatever ${hbase.tmp.dir} is set too — usually /tmp — so change this configuration or else all data will be lost on machine restart.

该目录由区域服务器共享,并且HBase继续存在。URL应该是“完全限定的”,以包括文件系统方案。例如,要指定HDFS目录“/hbase”,其中HDFS实例的namenode在namenode.example.org上运行,请将该值设置为:HDFS://namenode.example.org:9000/hbase。默认情况下,我们写入任何${hbase.tmp。设置too - usually /tmp—因此更改此配置,否则将在机器重新启动时丢失所有数据。

Default

${hbase.tmp.dir}/hbase

$ { hbase.tmp.dir } / hbase

hbase.cluster.distributed
Description

The mode the cluster will be in. Possible values are false for standalone mode and true for distributed mode. If false, startup will run all HBase and ZooKeeper daemons together in the one JVM.

集群将进入的模式。可能的值对于独立模式来说是错误的,对于分布式模式是正确的。如果false,启动将在一个JVM中运行所有HBase和ZooKeeper守护进程。

Default

false

hbase.zookeeper.quorum
Description

Comma separated list of servers in the ZooKeeper ensemble (This config. should have been named hbase.zookeeper.ensemble). For example, "host1.mydomain.com,host2.mydomain.com,host3.mydomain.com". By default this is set to localhost for local and pseudo-distributed modes of operation. For a fully-distributed setup, this should be set to a full list of ZooKeeper ensemble servers. If HBASE_MANAGES_ZK is set in hbase-env.sh this is the list of servers which hbase will start/stop ZooKeeper on as part of cluster start/stop. Client-side, we will take this list of ensemble members and put it together with the hbase.zookeeper.property.clientPort config. and pass it into zookeeper constructor as the connectString parameter.

在ZooKeeper组中,逗号分隔的服务器列表(此配置)。应该被命名为hbase.zookeeper.。例如,host1.mydomain.com,host2.mydomain.com,host3.mydomain.com。默认情况下,这将设置为本地和伪分布式操作的本地主机。对于完全分布的设置,应该将其设置为ZooKeeper集成服务器的完整列表。如果HBASE_MANAGES_ZK设置在hbase-env中。这是hbase将在集群开始/停止时启动/停止ZooKeeper的服务器列表。客户端,我们将把这个集合成员的列表和hbase.zookeeper.property一起放在一起。clientPort配置。并将其作为connectString参数传递给zookeeper构造函数。

Default

localhost

本地主机

zookeeper.recovery.retry.maxsleeptime
Description

Max sleep time before retry zookeeper operations in milliseconds, a max time is needed here so that sleep time won’t grow unboundedly

在重新尝试zookeeper操作几毫秒之前,最大的睡眠时间在这里是需要的,这样睡眠时间就不会无限制地增长。

Default

60000

60000年

hbase.local.dir
Description

Directory on the local filesystem to be used as a local storage.

本地文件系统上的目录,用作本地存储。

Default

${hbase.tmp.dir}/local/

$ { hbase.tmp.dir } /地方/

hbase.master.port
Description

The port the HBase Master should bind to.

HBase主机的端口应该绑定到。

Default

16000

16000年

hbase.master.info.port
Description

The port for the HBase Master web UI. Set to -1 if you do not want a UI instance run.

HBase主web UI的端口。如果您不想运行UI实例,则设置为-1。

Default

16010

16010年

hbase.master.info.bindAddress
Description

The bind address for the HBase Master web UI

HBase主web UI的绑定地址。

Default

0.0.0.0

0.0.0.0

hbase.master.logcleaner.plugins
Description

A comma-separated list of BaseLogCleanerDelegate invoked by the LogsCleaner service. These WAL cleaners are called in order, so put the cleaner that prunes the most files in front. To implement your own BaseLogCleanerDelegate, just put it in HBase’s classpath and add the fully qualified class name here. Always add the above default log cleaners in the list.

LogsCleaner服务调用的一个逗号分隔的BaseLogCleanerDelegate列表。这些WAL -清洁工是按顺序被调用的,所以把最前面的文件清除干净。要实现您自己的BaseLogCleanerDelegate,只需将它放在HBase的类路径中,并在这里添加完全限定的类名。总是在列表中添加上面的默认日志清除器。

Default

org.apache.hadoop.hbase.master.cleaner.TimeToLiveLogCleaner,org.apache.hadoop.hbase.master.cleaner.TimeToLiveProcedureWALCleaner

org.apache.hadoop.hbase.master.cleaner.TimeToLiveLogCleaner,org.apache.hadoop.hbase.master.cleaner.TimeToLiveProcedureWALCleaner

hbase.master.logcleaner.ttl
Description

How long a WAL remain in the archive ({hbase.rootdir}/oldWALs) directory, after which it will be cleaned by a Master thread. The value is in milliseconds.

在归档({hbase.rootdir}/oldWALs)目录中,一个WAL保持多长时间,之后将由一个主线程进行清理。这个值是以毫秒为单位的。

Default

600000

600000年

hbase.master.procedurewalcleaner.ttl
Description

How long a Procedure WAL will remain in the archive directory, after which it will be cleaned by a Master thread. The value is in milliseconds.

一个过程会在存档目录中保留多长时间,之后将由一个主线程进行清理。这个值是以毫秒为单位的。

Default

604800000

604800000

hbase.master.hfilecleaner.plugins
Description

A comma-separated list of BaseHFileCleanerDelegate invoked by the HFileCleaner service. These HFiles cleaners are called in order, so put the cleaner that prunes the most files in front. To implement your own BaseHFileCleanerDelegate, just put it in HBase’s classpath and add the fully qualified class name here. Always add the above default log cleaners in the list as they will be overwritten in hbase-site.xml.

由HFileCleaner服务调用的以逗号分隔的BaseHFileCleanerDelegate列表。这些HFiles清洗器是按顺序被调用的,所以请将最前面的文件清除干净。要实现您自己的BaseHFileCleanerDelegate,只需将它放在HBase的类路径中,并在这里添加完全限定的类名。总是在列表中添加上面的默认日志清除器,因为它们将被覆盖在hbase-site.xml中。

Default

org.apache.hadoop.hbase.master.cleaner.TimeToLiveHFileCleaner

org.apache.hadoop.hbase.master.cleaner.TimeToLiveHFileCleaner

hbase.master.infoserver.redirect
Description

Whether or not the Master listens to the Master web UI port (hbase.master.info.port) and redirects requests to the web UI server shared by the Master and RegionServer. Config. makes sense when Master is serving Regions (not the default).

无论主服务器是否侦听主web UI端口(hbase.master.info.port),并将请求重定向到主服务器和区域服务器共享的web UI服务器。配置。当Master是服务区域(而不是默认区域)时,这是有意义的。

Default

true

真正的

hbase.master.fileSplitTimeout
Description

Splitting a region, how long to wait on the file-splitting step before aborting the attempt. Default: 600000. This setting used to be known as hbase.regionserver.fileSplitTimeout in hbase-1.x. Split is now run master-side hence the rename (If a 'hbase.master.fileSplitTimeout' setting found, will use it to prime the current 'hbase.master.fileSplitTimeout' Configuration.

分割一个区域,在中止尝试之前等待文件拆分步骤需要多长时间。默认值:600000。该设置以前称为hbase.区域性服务器。在hbase fileSplitTimeout - 1. x。Split现在运行主端,因此重命名(如果是“hbase.master”。fileSplitTimeout的设置,将使用它来启动当前的hbase.master。fileSplitTimeout的配置。

Default

600000

600000年

hbase.regionserver.port
Description

The port the HBase RegionServer binds to.

HBase区域服务器绑定到的端口。

Default

16020

16020年

hbase.regionserver.info.port
Description

The port for the HBase RegionServer web UI Set to -1 if you do not want the RegionServer UI to run.

如果您不希望区域服务器UI运行,那么HBase区域服务器web UI的端口设置为-1。

Default

16030

16030年

hbase.regionserver.info.bindAddress
Description

The address for the HBase RegionServer web UI

HBase区域服务器web UI的地址。

Default

0.0.0.0

0.0.0.0

hbase.regionserver.info.port.auto
Description

Whether or not the Master or RegionServer UI should search for a port to bind to. Enables automatic port search if hbase.regionserver.info.port is already in use. Useful for testing, turned off by default.

主服务器或区域服务器UI是否应该搜索一个端口来绑定。支持自动端口搜索,如果hbase.org .info.port已经在使用中。用于测试,默认关闭。

Default

false

hbase.regionserver.handler.count
Description

Count of RPC Listener instances spun up on RegionServers. Same property is used by the Master for count of master handlers. Too many handlers can be counter-productive. Make it a multiple of CPU count. If mostly read-only, handlers count close to cpu count does well. Start with twice the CPU count and tune from there.

在区域服务器上旋转的RPC侦听器实例的计数。主处理程序的主机使用相同的属性。太多的处理程序可能会适得其反。使它成为一个多CPU计数。如果大多数都是只读的,那么处理程序数接近于cpu的数就很好了。从CPU数量的两倍开始。

Default

30

30.

hbase.ipc.server.callqueue.handler.factor
Description

Factor to determine the number of call queues. A value of 0 means a single queue shared between all the handlers. A value of 1 means that each handler has its own queue.

决定调用队列数量的因素。值0表示所有处理程序之间共享一个队列。值1意味着每个处理程序都有自己的队列。

Default

0.1

0.1

hbase.ipc.server.callqueue.read.ratio
Description

Split the call queues into read and write queues. The specified interval (which should be between 0.0 and 1.0) will be multiplied by the number of call queues. A value of 0 indicate to not split the call queues, meaning that both read and write requests will be pushed to the same set of queues. A value lower than 0.5 means that there will be less read queues than write queues. A value of 0.5 means there will be the same number of read and write queues. A value greater than 0.5 means that there will be more read queues than write queues. A value of 1.0 means that all the queues except one are used to dispatch read requests. Example: Given the total number of call queues being 10 a read.ratio of 0 means that: the 10 queues will contain both read/write requests. a read.ratio of 0.3 means that: 3 queues will contain only read requests and 7 queues will contain only write requests. a read.ratio of 0.5 means that: 5 queues will contain only read requests and 5 queues will contain only write requests. a read.ratio of 0.8 means that: 8 queues will contain only read requests and 2 queues will contain only write requests. a read.ratio of 1 means that: 9 queues will contain only read requests and 1 queues will contain only write requests.

将调用队列拆分为读和写队列。指定的间隔(应该在0.0到1.0之间)将被调用队列的数量乘以。值0表示不拆分调用队列,这意味着读和写请求将被推到同一组队列。低于0.5的值意味着要比写队列少读取队列。值0.5意味着会有相同数量的读写队列。大于0.5的值意味着会有更多的读取队列,而不是写队列。值为1.0意味着除了一个队列以外的所有队列都用于分派读取请求。示例:给定读取队列的总数为10。0的比率意味着:10个队列将包含读/写请求。一个阅读。0.3的比率意味着:3个队列只包含读请求,7个队列只包含写请求。一个阅读。比率0.5表示:5个队列只包含读请求,5个队列只包含写请求。一个阅读。0.8表示:8个队列只包含读请求,2个队列只包含写请求。一个阅读。比率1表示:9个队列只包含读请求,1个队列只包含写请求。

Default

0

0

hbase.ipc.server.callqueue.scan.ratio
Description

Given the number of read call queues, calculated from the total number of call queues multiplied by the callqueue.read.ratio, the scan.ratio property will split the read call queues into small-read and long-read queues. A value lower than 0.5 means that there will be less long-read queues than short-read queues. A value of 0.5 means that there will be the same number of short-read and long-read queues. A value greater than 0.5 means that there will be more long-read queues than short-read queues A value of 0 or 1 indicate to use the same set of queues for gets and scans. Example: Given the total number of read call queues being 8 a scan.ratio of 0 or 1 means that: 8 queues will contain both long and short read requests. a scan.ratio of 0.3 means that: 2 queues will contain only long-read requests and 6 queues will contain only short-read requests. a scan.ratio of 0.5 means that: 4 queues will contain only long-read requests and 4 queues will contain only short-read requests. a scan.ratio of 0.8 means that: 6 queues will contain only long-read requests and 2 queues will contain only short-read requests.

给定读调用队列的数量,计算从调用队列的总数乘以callqueue。read。比,扫描。ratio属性将read调用队列拆分为小读和长读队列。低于0.5的值意味着,长读队列比短读队列少。值为0.5意味着有相同数量的短读和长读队列。一个大于0.5的值意味着,将会有比短读队列更多的长读队列,一个0或1的值表示使用相同的队列来获取和扫描。示例:给定读取调用队列的总数为8次扫描。0或1的比率意味着:8个队列将包含长时间和短读请求。扫描。0.3的比率意味着:2个队列只包含长读请求,6个队列只包含短读请求。扫描。比率0.5表示:4个队列只包含长读请求,4个队列只包含短读请求。扫描。0.8的比率意味着:6个队列只包含长读请求,2个队列只包含短读请求。

Default

0

0

hbase.regionserver.msginterval
Description

Interval between messages from the RegionServer to Master in milliseconds.

区域服务器之间的消息间隔以毫秒为单位。

Default

3000

3000年

hbase.regionserver.logroll.period
Description

Period at which we will roll the commit log regardless of how many edits it has.

在此期间,我们将滚动提交日志,而不管它有多少编辑器。

Default

3600000

3600000

hbase.regionserver.logroll.errors.tolerated
Description

The number of consecutive WAL close errors we will allow before triggering a server abort. A setting of 0 will cause the region server to abort if closing the current WAL writer fails during log rolling. Even a small value (2 or 3) will allow a region server to ride over transient HDFS errors.

在触发服务器中止之前,我们将允许连续的WAL - close错误数。如果在日志滚动期间关闭当前的WAL writer失败,那么设置0将导致该区域服务器中止。即使是很小的值(2或3),也会允许一个区域服务器通过短暂的HDFS错误。

Default

2

2

hbase.regionserver.hlog.reader.impl
Description

The WAL file reader implementation.

WAL - file阅读器实现。

Default

org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader

org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader

hbase.regionserver.hlog.writer.impl
Description

The WAL file writer implementation.

WAL - file writer的实现。

Default

org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter

org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter

hbase.regionserver.global.memstore.size
Description

Maximum size of all memstores in a region server before new updates are blocked and flushes are forced. Defaults to 40% of heap (0.4). Updates are blocked and flushes are forced until size of all memstores in a region server hits hbase.regionserver.global.memstore.size.lower.limit. The default value in this configuration has been intentionally left empty in order to honor the old hbase.regionserver.global.memstore.upperLimit property if present.

在新更新被阻塞和刷新之前,区域服务器中所有memstore的最大大小。默认值为堆的40%(0.4)。更新被阻塞和刷新,直到一个区域服务器的所有memstore的大小到达hbase.regionserver. global.memstore.lower .limit。这个配置中的默认值是为了纪念旧的hbase.org .global.memstore而故意留下的。如果存在upperLimit属性。

Default

none

没有一个

hbase.regionserver.global.memstore.size.lower.limit
Description

Maximum size of all memstores in a region server before flushes are forced. Defaults to 95% of hbase.regionserver.global.memstore.size (0.95). A 100% value for this value causes the minimum possible flushing to occur when updates are blocked due to memstore limiting. The default value in this configuration has been intentionally left empty in order to honor the old hbase.regionserver.global.memstore.lowerLimit property if present.

在刷新之前,区域服务器中所有memstore的最大大小都是强制的。默认值为95%的hbase.org .global.memstore。大小(0.95)。此值的100%值会导致当更新由于memstore限制而阻塞时可能发生的最小刷新。这个配置中的默认值是为了纪念旧的hbase.org .global.memstore而故意留下的。如果存在lowerLimit属性。

Default

none

没有一个

hbase.systemtables.compacting.memstore.type
Description

Determines the type of memstore to be used for system tables like META, namespace tables etc. By default NONE is the type and hence we use the default memstore for all the system tables. If we need to use compacting memstore for system tables then set this property to BASIC/EAGER

确定用于系统表的memstore类型,如元、名称空间表等。缺省情况下,NONE是类型,因此我们为所有系统表使用默认的memstore。如果需要对系统表使用压缩memstore,则将此属性设置为BASIC/EAGER。

Default

NONE

没有一个

hbase.regionserver.optionalcacheflushinterval
Description

Maximum amount of time an edit lives in memory before being automatically flushed. Default 1 hour. Set it to 0 to disable automatic flushing.

在自动刷新之前,在内存中编辑生命的最长时间。默认的1小时。设置为0以禁用自动刷新。

Default

3600000

3600000

hbase.regionserver.dns.interface
Description

The name of the Network Interface from which a region server should report its IP address.

一个区域服务器应该报告其IP地址的网络接口的名称。

Default

default

默认的

hbase.regionserver.dns.nameserver
Description

The host name or IP address of the name server (DNS) which a region server should use to determine the host name used by the master for communication and display purposes.

名称服务器(DNS)的主机名或IP地址,区域服务器应该使用它来确定主用于通信和显示目的使用的主机名。

Default

default

默认的

hbase.regionserver.region.split.policy
Description

A split policy determines when a region should be split. The various other split policies that are available currently are BusyRegionSplitPolicy, ConstantSizeRegionSplitPolicy, DisabledRegionSplitPolicy, DelimitedKeyPrefixRegionSplitPolicy, KeyPrefixRegionSplitPolicy, and SteppingSplitPolicy. DisabledRegionSplitPolicy blocks manual region splitting.

分割策略决定一个区域何时应该被分割。当前可用的各种其他分裂策略包括busysplitpolicy、ConstantSizeRegionSplitPolicy、DisabledRegionSplitPolicy、delimitedkeyprefixsplitpolicy、keyprefixsplitpolicy和SteppingSplitPolicy。DisabledRegionSplitPolicy阻止手动区域分割。

Default

org.apache.hadoop.hbase.regionserver.SteppingSplitPolicy

org.apache.hadoop.hbase.regionserver.SteppingSplitPolicy

hbase.regionserver.regionSplitLimit
Description

Limit for the number of regions after which no more region splitting should take place. This is not hard limit for the number of regions but acts as a guideline for the regionserver to stop splitting after a certain limit. Default is set to 1000.

在没有更多区域分裂的区域的数量限制应该发生。这对于区域的数量并不是严格的限制,而是作为区域服务器在一定限度后停止分裂的指南。默认设置为1000。

Default

1000

1000年

zookeeper.session.timeout
Description

ZooKeeper session timeout in milliseconds. It is used in two different ways. First, this value is used in the ZK client that HBase uses to connect to the ensemble. It is also used by HBase when it starts a ZK server and it is passed as the 'maxSessionTimeout'. See http://hadoop.apache.org/zookeeper/docs/current/zookeeperProgrammers.html#ch_zkSessions. For example, if an HBase region server connects to a ZK ensemble that’s also managed by HBase, then the session timeout will be the one specified by this configuration. But, a region server that connects to an ensemble managed with a different configuration will be subjected that ensemble’s maxSessionTimeout. So, even though HBase might propose using 90 seconds, the ensemble can have a max timeout lower than this and it will take precedence. The current default that ZK ships with is 40 seconds, which is lower than HBase’s.

ZooKeeper会话超时以毫秒为单位。它有两种用途。首先,该值用于HBase用于连接集成的ZK客户端。HBase在启动ZK服务器时也使用它,并作为“maxSessionTimeout”传递。见http://hadoop.apache.org/zookeeper/docs/current/zookeeperProgrammers.html # ch_zkSessions。例如,如果HBase区域服务器连接到由HBase管理的ZK集成,那么会话超时将由该配置指定。但是,连接到一个由不同配置管理的集成的区域服务器将会受到集成的maxSessionTimeout的影响。因此,尽管HBase可能建议使用90秒,但是集成可以有一个比这个更低的超时值,而且它将优先。ZK船现在的默认值是40秒,低于HBase。

Default

90000

90000年

zookeeper.znode.parent
Description

Root ZNode for HBase in ZooKeeper. All of HBase’s ZooKeeper files that are configured with a relative path will go under this node. By default, all of HBase’s ZooKeeper file paths are configured with a relative path, so they will all go under this directory unless changed.

在ZooKeeper中HBase的根ZNode。所有配置了相对路径的HBase的ZooKeeper文件都将在这个节点下运行。默认情况下,所有HBase的ZooKeeper文件路径都配置了相对路径,所以它们都将在这个目录下运行,除非发生更改。

Default

/hbase

/ hbase

zookeeper.znode.acl.parent
Description

Root ZNode for access control lists.

访问控制列表的根ZNode。

Default

acl

acl

hbase.zookeeper.dns.interface
Description

The name of the Network Interface from which a ZooKeeper server should report its IP address.

一个ZooKeeper服务器应该报告其IP地址的网络接口的名称。

Default

default

默认的

hbase.zookeeper.dns.nameserver
Description

The host name or IP address of the name server (DNS) which a ZooKeeper server should use to determine the host name used by the master for communication and display purposes.

名称服务器(DNS)的主机名或IP地址,由ZooKeeper服务器使用,以确定主用于通信和显示目的使用的主机名。

Default

default

默认的

hbase.zookeeper.peerport
Description

Port used by ZooKeeper peers to talk to each other. See http://hadoop.apache.org/zookeeper/docs/r3.1.1/zookeeperStarted.html#sc_RunningReplicatedZooKeeper for more information.

ZooKeeper的同伴使用的端口,可以互相交谈。查看http://hadoop. apache.org/zookeeper/docs/r3.1.1/zookeeperstar.html #sc_RunningReplicatedZooKeeper获取更多信息。

Default

2888

2888年

hbase.zookeeper.leaderport
Description

Port used by ZooKeeper for leader election. See http://hadoop.apache.org/zookeeper/docs/r3.1.1/zookeeperStarted.html#sc_RunningReplicatedZooKeeper for more information.

用于领导选举的动物管理员使用的港口。查看http://hadoop. apache.org/zookeeper/docs/r3.1.1/zookeeperstar.html #sc_RunningReplicatedZooKeeper获取更多信息。

Default

3888

3888年

hbase.zookeeper.property.initLimit
Description

Property from ZooKeeper’s config zoo.cfg. The number of ticks that the initial synchronization phase can take.

ZooKeeper的配置zoo.cfg的属性。初始同步阶段可以使用的滴答数。

Default

10

10

hbase.zookeeper.property.syncLimit
Description

Property from ZooKeeper’s config zoo.cfg. The number of ticks that can pass between sending a request and getting an acknowledgment.

ZooKeeper的配置zoo.cfg的属性。可以在发送请求和得到确认之间传递的滴答数。

Default

5

5

hbase.zookeeper.property.dataDir
Description

Property from ZooKeeper’s config zoo.cfg. The directory where the snapshot is stored.

ZooKeeper的配置zoo.cfg的属性。存储快照的目录。

Default

${hbase.tmp.dir}/zookeeper

$ { hbase.tmp.dir } /动物园管理员

hbase.zookeeper.property.clientPort
Description

Property from ZooKeeper’s config zoo.cfg. The port at which the clients will connect.

ZooKeeper的配置zoo.cfg的属性。客户端连接的端口。

Default

2181

2181年

hbase.zookeeper.property.maxClientCnxns
Description

Property from ZooKeeper’s config zoo.cfg. Limit on number of concurrent connections (at the socket level) that a single client, identified by IP address, may make to a single member of the ZooKeeper ensemble. Set high to avoid zk connection issues running standalone and pseudo-distributed.

ZooKeeper的配置zoo.cfg的属性。对单个客户端(由IP地址标识)的并发连接数量的限制,可以向ZooKeeper集成的单个成员发送。设置高,以避免zk连接问题运行独立和伪分布。

Default

300

300年

hbase.client.write.buffer
Description

Default size of the BufferedMutator write buffer in bytes. A bigger buffer takes more memory — on both the client and server side since server instantiates the passed write buffer to process it — but a larger buffer size reduces the number of RPCs made. For an estimate of server-side memory-used, evaluate hbase.client.write.buffer * hbase.regionserver.handler.count

BufferedMutator的默认大小以字节为单位编写缓冲区。一个更大的缓冲区会占用更多的内存——因为服务器实例化了已传递的写缓冲区来处理它——但是更大的缓冲区大小减少了RPCs的数量。对于服务器端内存使用的估计,评估hbase.client.write。缓冲* hbase.regionserver.handler.count

Default

2097152

2097152

hbase.client.pause
Description

General client pause value. Used mostly as value to wait before running a retry of a failed get, region lookup, etc. See hbase.client.retries.number for description of how we backoff from this initial pause amount and how this pause works w/ retries.

一般客户暂停价值。在运行一个失败的get、区域查找等重试之前,主要使用的是等待的值。描述我们如何从最初的暂停数量和暂停如何工作w/重试的描述。

Default

100

One hundred.

hbase.client.pause.cqtbe
Description

Whether or not to use a special client pause for CallQueueTooBigException (cqtbe). Set this property to a higher value than hbase.client.pause if you observe frequent CQTBE from the same RegionServer and the call queue there keeps full

是否使用一个特殊的客户端暂停调用CallQueueTooBigException (cqtbe)。将此属性设置为高于hbase.client的值。如果您在同一区域服务器上观察频繁的CQTBE,并且在那里的调用队列保持完整,请暂停。

Default

none

没有一个

hbase.client.retries.number
Description

Maximum retries. Used as maximum for all retryable operations such as the getting of a cell’s value, starting a row update, etc. Retry interval is a rough function based on hbase.client.pause. At first we retry at this interval but then with backoff, we pretty quickly reach retrying every ten seconds. See HConstants#RETRY_BACKOFF for how the backup ramps up. Change this setting and hbase.client.pause to suit your workload.

最大重试。对于所有可重新尝试的操作,如获取单元的值、开始行更新等,使用的最大。重试间隔是基于hbase.client的粗糙函数。一开始我们在这个时间间隔重试,但之后我们会很快的重新尝试每十秒。请参阅hconstant #RETRY_BACKOFF,以了解备份是如何提高的。更改此设置和hbase.client。暂停以适应你的工作量。

Default

15

15

hbase.client.max.total.tasks
Description

The maximum number of concurrent mutation tasks a single HTable instance will send to the cluster.

单个HTable实例将发送到集群的并发突变任务的最大数量。

Default

100

One hundred.

hbase.client.max.perserver.tasks
Description

The maximum number of concurrent mutation tasks a single HTable instance will send to a single region server.

单个HTable实例将发送到单个区域服务器的并发突变任务的最大数量。

Default

2

2

hbase.client.max.perregion.tasks
Description

The maximum number of concurrent mutation tasks the client will maintain to a single Region. That is, if there is already hbase.client.max.perregion.tasks writes in progress for this region, new puts won’t be sent to this region until some writes finishes.

客户机将维护到单个区域的并发突变任务的最大数量。也就是说,如果已经有hbase.client.max.perregion。在这个区域的任务写入过程中,新的put将不会被发送到这个区域,直到一些写完为止。

Default

1

1

hbase.client.perserver.requests.threshold
Description

The max number of concurrent pending requests for one server in all client threads (process level). Exceeding requests will be thrown ServerTooBusyException immediately to prevent user’s threads being occupied and blocked by only one slow region server. If you use a fix number of threads to access HBase in a synchronous way, set this to a suitable value which is related to the number of threads will help you. See https://issues.apache.org/jira/browse/HBASE-16388 for details.

在所有客户端线程(流程级别)中,一个服务器的并发挂起请求的最大数量。超过请求将立即抛出ServerTooBusyException,以防止用户的线程被一个缓慢的区域服务器占用和阻塞。如果您使用固定数量的线程以同步的方式访问HBase,将其设置为与线程数相关的适当值将帮助您。有关详细信息,请参阅https://issues.apache.org/jira/browse/hbase - 16388。

Default

2147483647

2147483647

hbase.client.scanner.caching
Description

Number of rows that we try to fetch when calling next on a scanner if it is not served from (local, client) memory. This configuration works together with hbase.client.scanner.max.result.size to try and use the network efficiently. The default value is Integer.MAX_VALUE by default so that the network will fill the chunk size defined by hbase.client.scanner.max.result.size rather than be limited by a particular number of rows since the size of rows varies table to table. If you know ahead of time that you will not require more than a certain number of rows from a scan, this configuration should be set to that row limit via Scan#setCaching. Higher caching values will enable faster scanners but will eat up more memory and some calls of next may take longer and longer times when the cache is empty. Do not set this value such that the time between invocations is greater than the scanner timeout; i.e. hbase.client.scanner.timeout.period

如果没有从(本地、客户端)内存中提供服务,那么我们在调用next时尝试获取的行数。该配置与hbase.client.scanner.max.result一起工作。大小尝试并有效地使用网络。默认值是整数。默认的MAX_VALUE,这样网络将填充由hbase.client.scanner.max.result定义的块大小。大小,而不是受特定的行数限制,因为行的大小因表而异。如果您提前知道,您将不需要从扫描中获得一定数量的行,那么该配置应该通过scan # setcache设置为该行限制。更高的缓存值将启用更快的扫描器,但会消耗更多的内存,而当缓存为空时,接下来的一些调用可能会花费更长的时间。不要设置此值,因为调用之间的时间大于扫描超时;即hbase.client.scanner.timeout.period

Default

2147483647

2147483647

hbase.client.keyvalue.maxsize
Description

Specifies the combined maximum allowed size of a KeyValue instance. This is to set an upper boundary for a single entry saved in a storage file. Since they cannot be split it helps avoiding that a region cannot be split any further because the data is too large. It seems wise to set this to a fraction of the maximum region size. Setting it to zero or less disables the check.

指定键值实例的最大允许大小。这是为保存在存储文件中的单个条目设置一个上限。因为它们不能被分割,所以避免了一个区域不能被分割,因为数据太大了。将其设置为最大区域大小的一小部分似乎是明智的。将其设置为零或更少的禁用该检查。

Default

10485760

10485760

hbase.server.keyvalue.maxsize
Description

Maximum allowed size of an individual cell, inclusive of value and all key components. A value of 0 or less disables the check. The default value is 10MB. This is a safety setting to protect the server from OOM situations.

允许单个单元的最大允许大小,包括值和所有关键组件。值为0或更小的值将会禁用该检查。默认值是10MB。这是一个安全设置,以保护服务器不受OOM情况的影响。

Default

10485760

10485760

hbase.client.scanner.timeout.period
Description

Client scanner lease period in milliseconds.

客户端扫描仪租期以毫秒为单位。

Default

60000

60000年

hbase.client.localityCheck.threadPoolSize
Default

2

2

hbase.bulkload.retries.number
Description

Maximum retries. This is maximum number of iterations to atomic bulk loads are attempted in the face of splitting operations 0 means never give up.

最大重试。这是对原子批量负载的最大迭代次数的尝试,在分裂操作0意味着永不放弃。

Default

10

10

hbase.master.balancer.maxRitPercent
Description

The max percent of regions in transition when balancing. The default value is 1.0. So there are no balancer throttling. If set this config to 0.01, It means that there are at most 1% regions in transition when balancing. Then the cluster’s availability is at least 99% when balancing.

在平衡时,过渡区域的最大百分比。默认值是1.0。所以没有平衡。如果将此配置设置为0.01,则意味着在平衡时,在转换过程中最多有1%的区域。然后,在平衡时,集群的可用性至少达到99%。

Default

1.0

1.0

hbase.balancer.period
Description

Period at which the region balancer runs in the Master.

区域平衡器在主内运行的周期。

Default

300000

300000年

hbase.normalizer.period
Description

Period at which the region normalizer runs in the Master.

在主程序中,区域正常程序运行的时间。

Default

300000

300000年

hbase.regions.slop
Description

Rebalance if any regionserver has average + (average * slop) regions. The default value of this parameter is 0.001 in StochasticLoadBalancer (the default load balancer), while the default is 0.2 in other load balancers (i.e., SimpleLoadBalancer).

如果任何区域服务器有平均+(平均* slop)区域,则重新平衡。这个参数的默认值是StochasticLoadBalancer(默认负载均衡器)的0.001,而其他负载均衡器的默认值为0.2。SimpleLoadBalancer)。

Default

0.001

0.001

hbase.server.thread.wakefrequency
Description

Time to sleep in between searches for work (in milliseconds). Used as sleep interval by service threads such as log roller.

在搜索工作(以毫秒为单位)之间的时间。以服务线程(如日志滚轮)作为睡眠时间间隔。

Default

10000

10000年

hbase.server.versionfile.writeattempts
Description

How many times to retry attempting to write a version file before just aborting. Each attempt is separated by the hbase.server.thread.wakefrequency milliseconds.

有多少次尝试在放弃之前尝试写一个版本文件。每个尝试都由hbase.server.thread分隔。wakefrequency毫秒。

Default

3

3

hbase.hregion.memstore.flush.size
Description

Memstore will be flushed to disk if size of the memstore exceeds this number of bytes. Value is checked by a thread that runs every hbase.server.thread.wakefrequency.

如果Memstore的大小超过这个字节数,Memstore将被刷新到磁盘。值是由运行每个hbase.server. server.thread.wakefrequency的线程来检查的。

Default

134217728

134217728

hbase.hregion.percolumnfamilyflush.size.lower.bound.min
Description

If FlushLargeStoresPolicy is used and there are multiple column families, then every time that we hit the total memstore limit, we find out all the column families whose memstores exceed a "lower bound" and only flush them while retaining the others in memory. The "lower bound" will be "hbase.hregion.memstore.flush.size / column_family_number" by default unless value of this property is larger than that. If none of the families have their memstore size more than lower bound, all the memstores will be flushed (just as usual).

如果使用FlushLargeStoresPolicy,并且有多个列族,那么每次我们达到内存存储限制时,我们就会发现所有的列家庭,它们的memstore超过了一个“下界”,只在内存中保留其他的时候刷新它们。“下界”将是“hbase.hzone .memstore.flush”。默认情况下,除非这个属性的值大于这个值,否则默认值为“size / column_family_number”。如果没有一个家庭的memstore大小超过下界,那么所有的memstore都会被刷新(就像往常一样)。

Default

16777216

16777216

hbase.hregion.preclose.flush.size
Description

If the memstores in a region are this size or larger when we go to close, run a "pre-flush" to clear out memstores before we put up the region closed flag and take the region offline. On close, a flush is run under the close flag to empty memory. During this time the region is offline and we are not taking on any writes. If the memstore content is large, this flush could take a long time to complete. The preflush is meant to clean out the bulk of the memstore before putting up the close flag and taking the region offline so the flush that runs under the close flag has little to do.

如果一个区域的memstore在我们关闭的时候是这个大小或者更大,那么在我们挂起该区域关闭标志并将该区域关闭之前,运行一个“预刷新”来清空memstores。在关闭的情况下,一个刷新在关闭的标志下运行到空内存。在此期间,该区域处于脱机状态,我们不承担任何写入操作。如果memstore内容很大,那么这个刷新可能需要很长时间才能完成。preflush的意思是在挂起关闭的标志和关闭该区域之前清除内存中的大部分,所以在关闭标志下运行的刷新没有什么作用。

Default

5242880

5242880

hbase.hregion.memstore.block.multiplier
Description

Block updates if memstore has hbase.hregion.memstore.block.multiplier times hbase.hregion.memstore.flush.size bytes. Useful preventing runaway memstore during spikes in update traffic. Without an upper-bound, memstore fills such that when it flushes the resultant flush files take a long time to compact or split, or worse, we OOME.

如果memstore有hbase. hzone .memstore.block,就可以对其进行块更新。hbase.hregion.memstore.flush乘数倍。字节大小。在更新流量的峰值期间防止失控的memstore。如果没有上界,memstore就会填充,这样当它刷新合成的刷新文件时,就需要很长的时间来压缩或拆分,或者更糟,我们OOME。

Default

4

4

hbase.hregion.memstore.mslab.enabled
Description

Enables the MemStore-Local Allocation Buffer, a feature which works to prevent heap fragmentation under heavy write loads. This can reduce the frequency of stop-the-world GC pauses on large heaps.

启用MemStore-Local分配缓冲区,该特性可以防止在重写负载下的堆碎片。这可以减少在大型堆上停止世界GC暂停的频率。

Default

true

真正的

hbase.hregion.max.filesize
Description

Maximum HFile size. If the sum of the sizes of a region’s HFiles has grown to exceed this value, the region is split in two.

最大HFile大小。如果一个区域的HFiles的大小总和已经超过这个值,那么该区域将被一分为二。

Default

10737418240

10737418240

hbase.hregion.majorcompaction
Description

Time between major compactions, expressed in milliseconds. Set to 0 to disable time-based automatic major compactions. User-requested and size-based major compactions will still run. This value is multiplied by hbase.hregion.majorcompaction.jitter to cause compaction to start at a somewhat-random time during a given window of time. The default value is 7 days, expressed in milliseconds. If major compactions are causing disruption in your environment, you can configure them to run at off-peak times for your deployment, or disable time-based major compactions by setting this parameter to 0, and run major compactions in a cron job or by another external mechanism.

主要压实之间的时间,以毫秒表示。设置为0,以禁用基于时间的自动主要压缩。用户请求和基于大小的主要压缩仍然会运行。这个值乘以hbase。hbase。在给定时间窗口内的某个随机时间,抖动导致压缩。默认值为7天,以毫秒表示。如果主要的压缩在您的环境中造成了破坏,您可以配置它们在非高峰时间运行您的部署,或者通过将此参数设置为0来禁用基于时间的主要压缩,并在cron作业或其他外部机制中运行主要的压缩。

Default

604800000

604800000

hbase.hregion.majorcompaction.jitter
Description

A multiplier applied to hbase.hregion.majorcompaction to cause compaction to occur a given amount of time either side of hbase.hregion.majorcompaction. The smaller the number, the closer the compactions will happen to the hbase.hregion.majorcompaction interval.

一个用于hbase.hregion的乘数。主要压实作用,使压实发生在某一给定的时间内,即hbase.h。数字越小,hbase.hregion就会越紧密。majorcompaction区间。

Default

0.50

0.50

hbase.hstore.compactionThreshold
Description

If more than this number of StoreFiles exist in any one Store (one StoreFile is written per flush of MemStore), a compaction is run to rewrite all StoreFiles into a single StoreFile. Larger values delay compaction, but when compaction does occur, it takes longer to complete.

如果在任何一个商店中存在超过此数量的存储文件(每个存储文件都有一个存储文件),那么就会运行一个compaction来将所有的存储文件重写为一个单一的StoreFile。较大的值延迟了压缩,但是当压缩确实发生时,需要更长的时间才能完成。

Default

3

3

hbase.hstore.flusher.count
Description

The number of flush threads. With fewer threads, the MemStore flushes will be queued. With more threads, the flushes will be executed in parallel, increasing the load on HDFS, and potentially causing more compactions.

刷新线程的数量。使用更少的线程,MemStore刷新将被排队。有了更多的线程,这些刷新将并行执行,增加了对HDFS的负载,并可能导致更多的压缩。

Default

2

2

hbase.hstore.blockingStoreFiles
Description

If more than this number of StoreFiles exist in any one Store (one StoreFile is written per flush of MemStore), updates are blocked for this region until a compaction is completed, or until hbase.hstore.blockingWaitTime has been exceeded.

如果在任何一个存储库中存在超过此数量的存储文件(每个存储文件都有一个存储文件),那么该区域的更新将被阻塞,直到完成压缩,或者直到hbase.hstore。blockingWaitTime已经超过了。

Default

16

16

hbase.hstore.blockingWaitTime
Description

The time for which a region will block updates after reaching the StoreFile limit defined by hbase.hstore.blockingStoreFiles. After this time has elapsed, the region will stop blocking updates even if a compaction has not been completed.

在到达由hbase.hstore.blockingStoreFiles定义的存储文件限制后,该区域将阻塞更新的时间。在这段时间过后,即使没有完成压缩,该区域也将停止阻塞更新。

Default

90000

90000年

hbase.hstore.compaction.min
Description

The minimum number of StoreFiles which must be eligible for compaction before compaction can run. The goal of tuning hbase.hstore.compaction.min is to avoid ending up with too many tiny StoreFiles to compact. Setting this value to 2 would cause a minor compaction each time you have two StoreFiles in a Store, and this is probably not appropriate. If you set this value too high, all the other values will need to be adjusted accordingly. For most cases, the default value is appropriate. In previous versions of HBase, the parameter hbase.hstore.compaction.min was named hbase.hstore.compactionThreshold.

在compaction运行之前,必须符合压缩条件的最小存储文件数。优化hbase.hstore.compaction.min的目的是避免使用太多的小存储文件来压缩。将此值设置为2会导致每次在存储中有两个storefile时都会产生一个小的压缩,而这可能是不合适的。如果将此值设置得过高,则需要相应地调整所有其他值。对于大多数情况,默认值是合适的。在HBase的以前版本中,HBase .hstore.compaction.min被命名为hbase.hstore.compactionThreshold。

Default

3

3

hbase.hstore.compaction.max
Description

The maximum number of StoreFiles which will be selected for a single minor compaction, regardless of the number of eligible StoreFiles. Effectively, the value of hbase.hstore.compaction.max controls the length of time it takes a single compaction to complete. Setting it larger means that more StoreFiles are included in a compaction. For most cases, the default value is appropriate.

将为单个小型压缩而选择的存储文件的最大数量,而不考虑合格的存储文件的数量。实际上,hbase.hstore.compaction.max控制了单个压缩完成的时间长度。更大的设置意味着在压缩中包含更多的存储文件。对于大多数情况,默认值是合适的。

Default

10

10

hbase.hstore.compaction.min.size
Description

A StoreFile (or a selection of StoreFiles, when using ExploringCompactionPolicy) smaller than this size will always be eligible for minor compaction. HFiles this size or larger are evaluated by hbase.hstore.compaction.ratio to determine if they are eligible. Because this limit represents the "automatic include" limit for all StoreFiles smaller than this value, this value may need to be reduced in write-heavy environments where many StoreFiles in the 1-2 MB range are being flushed, because every StoreFile will be targeted for compaction and the resulting StoreFiles may still be under the minimum size and require further compaction. If this parameter is lowered, the ratio check is triggered more quickly. This addressed some issues seen in earlier versions of HBase but changing this parameter is no longer necessary in most situations. Default: 128 MB expressed in bytes.

小于这个大小的StoreFile(或使用ExploringCompactionPolicy的存储文件的选择)总是符合小型压缩的条件。这个大小的HFiles由hbase.hstore.compaction.ratio来计算,以确定它们是否符合条件。因为这个限制代表了“自动包括“限制StoreFiles小于这个值,这个值可能需要减少write-heavy环境中1 - 2 MB的许多StoreFiles范围被刷新,因为每个StoreFile将针对压实和由此产生的StoreFiles仍可能受到最小的大小和需要进一步压实。如果这个参数被降低,比率检查被触发的更快。这解决了在早期版本的HBase中所看到的一些问题,但是在大多数情况下更改此参数不再是必要的。默认值:128 MB以字节表示。

Default

134217728

134217728

hbase.hstore.compaction.max.size
Description

A StoreFile (or a selection of StoreFiles, when using ExploringCompactionPolicy) larger than this size will be excluded from compaction. The effect of raising hbase.hstore.compaction.max.size is fewer, larger StoreFiles that do not get compacted often. If you feel that compaction is happening too often without much benefit, you can try raising this value. Default: the value of LONG.MAX_VALUE, expressed in bytes.

大于这个大小的StoreFile(或使用ExploringCompactionPolicy的存储文件的选择)将被排除在compaction之外。增大hbase.hstore.compaction.max.size的效果更小,更大的存储文件不经常被压缩。如果你觉得压实经常发生,没有太多好处,你可以试着提高这个值。默认值:LONG的值。MAX_VALUE,表示字节。

Default

9223372036854775807

9223372036854775807

hbase.hstore.compaction.ratio
Description

For minor compaction, this ratio is used to determine whether a given StoreFile which is larger than hbase.hstore.compaction.min.size is eligible for compaction. Its effect is to limit compaction of large StoreFiles. The value of hbase.hstore.compaction.ratio is expressed as a floating-point decimal. A large ratio, such as 10, will produce a single giant StoreFile. Conversely, a low value, such as .25, will produce behavior similar to the BigTable compaction algorithm, producing four StoreFiles. A moderate value of between 1.0 and 1.4 is recommended. When tuning this value, you are balancing write costs with read costs. Raising the value (to something like 1.4) will have more write costs, because you will compact larger StoreFiles. However, during reads, HBase will need to seek through fewer StoreFiles to accomplish the read. Consider this approach if you cannot take advantage of Bloom filters. Otherwise, you can lower this value to something like 1.0 to reduce the background cost of writes, and use Bloom filters to control the number of StoreFiles touched during reads. For most cases, the default value is appropriate.

对于较小的压缩,此比率用于确定一个给定的存储文件是否大于hbase.hstore.hstore.comaction .min.size具有压缩的条件。它的作用是限制大型存储文件的压缩。hbase.hstore.compaction.ratio被表示为浮点小数。一个大的比率,例如10,将产生一个单一的巨大的存储文件。相反,像.25这样的低值将产生类似于BigTable compaction算法的行为,生成4个StoreFiles。建议在1.0和1.4之间有一个适中的值。在调优这个值时,您需要平衡写成本和读取成本。提高值(大约1.4)会有更多的写成本,因为你会压缩更大的存储文件。但是,在读取过程中,HBase需要通过更少的存储文件来完成读取操作。如果不能利用Bloom过滤器,请考虑这种方法。否则,您可以将这个值降低到类似于1.0这样的东西,以降低写的后台成本,并使用Bloom过滤器来控制在读取过程中被触摸的存储文件的数量。对于大多数情况,默认值是合适的。

Default

1.2F

1.2度

hbase.hstore.compaction.ratio.offpeak
Description

Allows you to set a different (by default, more aggressive) ratio for determining whether larger StoreFiles are included in compactions during off-peak hours. Works in the same way as hbase.hstore.compaction.ratio. Only applies if hbase.offpeak.start.hour and hbase.offpeak.end.hour are also enabled.

允许您设置不同的(默认的,更积极的)比率,以确定在非繁忙时间内是否包含更大的存储文件。与hbase.hstore.c .compaction.ratio相同。仅适用于如果hbase.offpeak.start。小时,hbase.offpeak.end。小时也启用。

Default

5.0F

5.0度

hbase.hstore.time.to.purge.deletes
Description

The amount of time to delay purging of delete markers with future timestamps. If unset, or set to 0, all delete markers, including those with future timestamps, are purged during the next major compaction. Otherwise, a delete marker is kept until the major compaction which occurs after the marker’s timestamp plus the value of this setting, in milliseconds.

在未来时间戳中延迟删除标记的时间。如果未设置或设置为0,所有删除标记,包括那些具有未来时间戳的标记,将在接下来的主要压缩过程中被清除。否则,在标记的时间戳加上该设置的值(以毫秒为单位)后,将保留一个删除标记,直到主压缩。

Default

0

0

hbase.offpeak.start.hour
Description

The start of off-peak hours, expressed as an integer between 0 and 23, inclusive. Set to -1 to disable off-peak.

非高峰时间的开始,表示为0到23之间的整数,包括。设置为-1,以禁用非峰值。

Default

-1

1

hbase.offpeak.end.hour
Description

The end of off-peak hours, expressed as an integer between 0 and 23, inclusive. Set to -1 to disable off-peak.

非高峰时间的结束,表示为0到23之间的整数,包括。设置为-1,以禁用非峰值。

Default

-1

1

hbase.regionserver.thread.compaction.throttle
Description

There are two different thread pools for compactions, one for large compactions and the other for small compactions. This helps to keep compaction of lean tables (such as hbase:meta) fast. If a compaction is larger than this threshold, it goes into the large compaction pool. In most cases, the default value is appropriate. Default: 2 x hbase.hstore.compaction.max x hbase.hregion.memstore.flush.size (which defaults to 128MB). The value field assumes that the value of hbase.hregion.memstore.flush.size is unchanged from the default.

compaction有两个不同的线程池,一个用于大的压缩,另一个用于小的压缩。这有助于保持瘦表(如hbase:meta)的紧凑。如果一个压缩比这个阈值大,它就会进入大的压缩池。在大多数情况下,默认值是适当的。默认值:2 x hbase. hstore.m。hbase。大小(默认为128MB)。值字段假定hbase.harea .memstore.flush的值。大小与默认值保持不变。

Default

2684354560

2684354560

hbase.regionserver.majorcompaction.pagecache.drop
Description

Specifies whether to drop pages read/written into the system page cache by major compactions. Setting it to true helps prevent major compactions from polluting the page cache, which is almost always required, especially for clusters with low/moderate memory to storage ratio.

指定是否通过主要的压缩文件将读/写入到系统页面缓存中。将其设置为true有助于防止主要的压缩行为污染页面缓存,这几乎总是必需的,特别是对于低/中等内存的集群来说。

Default

true

真正的

hbase.regionserver.minorcompaction.pagecache.drop
Description

Specifies whether to drop pages read/written into the system page cache by minor compactions. Setting it to true helps prevent minor compactions from polluting the page cache, which is most beneficial on clusters with low memory to storage ratio or very write heavy clusters. You may want to set it to false under moderate to low write workload when bulk of the reads are on the most recently written data.

指定是否通过较小的压缩来将读/写进系统页面缓存。将其设置为true可以帮助防止轻微的压缩导致页面缓存的污染,这对低内存的集群非常有益,也可以很好地编写集群。当大部分读操作都在最近的书面数据上时,您可能希望将其设置为在中等到低的写工作负载下。

Default

true

真正的

hbase.hstore.compaction.kv.max
Description

The maximum number of KeyValues to read and then write in a batch when flushing or compacting. Set this lower if you have big KeyValues and problems with Out Of Memory Exceptions Set this higher if you have wide, small rows.

当刷新或压缩时,要读取的键值的最大数量,然后在批处理中写入。如果你有大的键值,如果你有很大的内存不足,那么设置这个值,如果你有宽的小的行,就设置这个值。

Default

10

10

hbase.storescanner.parallel.seek.enable
Description

Enables StoreFileScanner parallel-seeking in StoreScanner, a feature which can reduce response latency under special conditions.

在StoreScanner中启用StoreFileScanner并行搜索,这一特性可以在特殊情况下减少响应延迟。

Default

false

hbase.storescanner.parallel.seek.threads
Description

The default thread pool size if parallel-seeking feature enabled.

如果启用了并行查找功能,则默认的线程池大小。

Default

10

10

hfile.block.cache.size
Description

Percentage of maximum heap (-Xmx setting) to allocate to block cache used by a StoreFile. Default of 0.4 means allocate 40%. Set to 0 to disable but it’s not recommended; you need at least enough cache to hold the storefile indices.

最大堆(-Xmx设置)的百分比,用于分配存储文件使用的块缓存。默认的0.4意味着分配40%。设置为0禁用,但不建议使用;您需要至少足够的缓存来保存storefile索引。

Default

0.4

0.4

hfile.block.index.cacheonwrite
Description

This allows to put non-root multi-level index blocks into the block cache at the time the index is being written.

这允许在编写索引时将非根的多级索引块放到块缓存中。

Default

false

hfile.index.block.max.size
Description

When the size of a leaf-level, intermediate-level, or root-level index block in a multi-level block index grows to this size, the block is written out and a new block is started.

当一个多层块索引的叶级、中间层或根级索引块的大小增长到这个大小时,块就被写出来了,一个新的块就开始了。

Default

131072

131072年

hbase.bucketcache.ioengine
Description

Where to store the contents of the bucketcache. One of: offheap, file, files or mmap. If a file or files, set it to file(s):PATH_TO_FILE. mmap means the content will be in an mmaped file. Use mmap:PATH_TO_FILE. See http://hbase.apache.org/book.html#offheap.blockcache for more information.

在哪里存储bucketcache的内容。一个:offheap, file, files或mmap。如果文件或文件,将其设置为file(s):PATH_TO_FILE。mmap意味着内容将在mmaped文件中。用mmap:PATH_TO_FILE。参见http://hbase.apache.org/book.html#offheap.blockcache获取更多信息。

Default

none

没有一个

hbase.bucketcache.size
Description

A float that EITHER represents a percentage of total heap memory size to give to the cache (if < 1.0) OR, it is the total capacity in megabytes of BucketCache. Default: 0.0

一个浮点数,它代表总堆内存大小的百分比给缓存(如果< 1.0),或者,它是总容量的兆字节的BucketCache。默认值:0.0

Default

none

没有一个

hbase.bucketcache.bucket.sizes
Description

A comma-separated list of sizes for buckets for the bucketcache. Can be multiple sizes. List block sizes in order from smallest to largest. The sizes you use will depend on your data access patterns. Must be a multiple of 256 else you will run into 'java.io.IOException: Invalid HFile block magic' when you go to read from cache. If you specify no values here, then you pick up the default bucketsizes set in code (See BucketAllocator#DEFAULT_BUCKET_SIZES).

一个逗号分隔的桶的大小列表,用于桶缓存。可以多个大小。列表块大小从最小到最大。您使用的大小将取决于您的数据访问模式。必须是256的倍数,否则你会遇到“java.io”。当你从缓存中读取时,IOException:无效的HFile块魔法。如果您在这里指定了没有值,那么您可以选择代码中设置的默认大小(参见BucketAllocator# default_bucket_size)。

Default

none

没有一个

hfile.format.version
Description

The HFile format version to use for new files. Version 3 adds support for tags in hfiles (See http://hbase.apache.org/book.html#hbase.tags). Also see the configuration 'hbase.replication.rpc.codec'.

用于新文件的HFile格式版本。版本3在hfiles中添加了对标记的支持(参见http://hbase.apache.org/book.html#hbase.tag)。也可以看到配置'hbase. replices.rpc.codec '。

Default

3

3

hfile.block.bloom.cacheonwrite
Description

Enables cache-on-write for inline blocks of a compound Bloom filter.

启用复合Bloom filter的内联块的缓存。

Default

false

io.storefile.bloom.block.size
Description

The size in bytes of a single block ("chunk") of a compound Bloom filter. This size is approximate, because Bloom blocks can only be inserted at data block boundaries, and the number of keys per data block varies.

一个复合Bloom filter的一个块(“块”)的字节大小。这个大小是近似的,因为Bloom块只能插入到数据块边界,每个数据块的键数也不同。

Default

131072

131072年

hbase.rs.cacheblocksonwrite
Description

Whether an HFile block should be added to the block cache when the block is finished.

当块完成时,是否应该将HFile块添加到块缓存中。

Default

false

hbase.rpc.timeout
Description

This is for the RPC layer to define how long (millisecond) HBase client applications take for a remote call to time out. It uses pings to check connections but will eventually throw a TimeoutException.

这是用于RPC层来定义远程调用的远程调用需要多长(毫秒)的HBase客户机应用程序。它使用pings来检查连接,但最终会抛出一个TimeoutException。

Default

60000

60000年

hbase.client.operation.timeout
Description

Operation timeout is a top-level restriction (millisecond) that makes sure a blocking operation in Table will not be blocked more than this. In each operation, if rpc request fails because of timeout or other reason, it will retry until success or throw RetriesExhaustedException. But if the total time being blocking reach the operation timeout before retries exhausted, it will break early and throw SocketTimeoutException.

操作超时是一个顶级限制(毫秒),确保表中的阻塞操作不会被阻塞。在每个操作中,如果rpc请求由于超时或其他原因而失败,它将重试直到成功或抛出retriesdexception。但是,如果在重试耗尽之前阻塞的总时间到达操作超时,那么它将提前断开并抛出SocketTimeoutException。

Default

1200000

1200000

hbase.cells.scanned.per.heartbeat.check
Description

The number of cells scanned in between heartbeat checks. Heartbeat checks occur during the processing of scans to determine whether or not the server should stop scanning in order to send back a heartbeat message to the client. Heartbeat messages are used to keep the client-server connection alive during long running scans. Small values mean that the heartbeat checks will occur more often and thus will provide a tighter bound on the execution time of the scan. Larger values mean that the heartbeat checks occur less frequently

在心跳检查之间扫描的细胞数量。在扫描过程中发生心跳检查,以确定服务器是否应该停止扫描,以便将心跳消息发送给客户端。Heartbeat消息用于在长时间运行扫描期间保持客户机-服务器连接。小的值意味着心跳检查会更频繁地发生,从而在扫描的执行时间上提供更严格的绑定。更大的值意味着心跳检查频率降低。

Default

10000

10000年

hbase.rpc.shortoperation.timeout
Description

This is another version of "hbase.rpc.timeout". For those RPC operation within cluster, we rely on this configuration to set a short timeout limitation for short operation. For example, short rpc timeout for region server’s trying to report to active master can benefit quicker master failover process.

这是另一个版本的“hbase.rpc.timeout”。对于集群内的RPC操作,我们依赖于此配置来为短操作设置短的超时限制。例如,区域服务器试图向active master报告的短rpc超时可以受益更快的主故障转移过程。

Default

10000

10000年

hbase.ipc.client.tcpnodelay
Description

Set no delay on rpc socket connections. See http://docs.oracle.com/javase/1.5.0/docs/api/java/net/Socket.html#getTcpNoDelay()

不要延迟rpc套接字连接。看到http://docs.oracle.com/javase/1.5.0/docs/api/java/net/Socket.html getTcpNoDelay()

Default

true

真正的

hbase.regionserver.hostname
Description

This config is for experts: don’t set its value unless you really know what you are doing. When set to a non-empty value, this represents the (external facing) hostname for the underlying server. See https://issues.apache.org/jira/browse/HBASE-12954 for details.

这个配置是针对专家的:除非你真的知道你在做什么,否则不要设置它的值。当设置为非空值时,这表示底层服务器的(外部面向)主机名。有关详细信息,请参阅https://issues.apache.org/jira/browse/hbase - 12954。

Default

none

没有一个

hbase.regionserver.hostname.disable.master.reversedns
Description

This config is for experts: don’t set its value unless you really know what you are doing. When set to true, regionserver will use the current node hostname for the servername and HMaster will skip reverse DNS lookup and use the hostname sent by regionserver instead. Note that this config and hbase.regionserver.hostname are mutually exclusive. See https://issues.apache.org/jira/browse/HBASE-18226 for more details.

这个配置是针对专家的:除非你真的知道你在做什么,否则不要设置它的值。当设置为true时,区域服务器将使用当前的节点主机名,而HMaster将跳过反向DNS查找并使用地区服务器发送的主机名。注意,这个配置和hbase.区域性服务器。主机名是互斥的。有关更多细节,请参见https://www.apache.org/jira/browse/hbase -18226。

Default

false

hbase.master.keytab.file
Description

Full path to the kerberos keytab file to use for logging in the configured HMaster server principal.

在配置的HMaster服务器主体中使用kerberos keytab文件的完整路径。

Default

none

没有一个

hbase.master.kerberos.principal
Description

Ex. "hbase/_HOST@EXAMPLE.COM". The kerberos principal name that should be used to run the HMaster process. The principal name should be in the form: user/hostname@DOMAIN. If "_HOST" is used as the hostname portion, it will be replaced with the actual hostname of the running instance.

前女友。“hbase / _HOST@EXAMPLE.COM”。应该用于运行HMaster进程的kerberos主体名称。主名称应该是:user/hostname@DOMAIN。如果“_HOST”用作主机名部分,则将用运行实例的实际主机名替换它。

Default

none

没有一个

hbase.regionserver.keytab.file
Description

Full path to the kerberos keytab file to use for logging in the configured HRegionServer server principal.

完整路径到kerberos keytab文件,用于在已配置的hlocationserver服务器主体中进行日志记录。

Default

none

没有一个

hbase.regionserver.kerberos.principal
Description

Ex. "hbase/_HOST@EXAMPLE.COM". The kerberos principal name that should be used to run the HRegionServer process. The principal name should be in the form: user/hostname@DOMAIN. If "_HOST" is used as the hostname portion, it will be replaced with the actual hostname of the running instance. An entry for this principal must exist in the file specified in hbase.regionserver.keytab.file

前女友。“hbase / _HOST@EXAMPLE.COM”。应该用于运行h区域性服务器进程的kerberos主体名称。主名称应该是:user/hostname@DOMAIN。如果“_HOST”用作主机名部分,则将用运行实例的实际主机名替换它。该主体的一个条目必须存在于hbase.domain server.keytab.file中指定的文件中。

Default

none

没有一个

hadoop.policy.file
Description

The policy configuration file used by RPC servers to make authorization decisions on client requests. Only used when HBase security is enabled.

RPC服务器使用的策略配置文件,用于在客户端请求上进行授权决策。仅在启用HBase安全时使用。

Default

hbase-policy.xml

hbase-policy.xml

hbase.superuser
Description

List of users or groups (comma-separated), who are allowed full privileges, regardless of stored ACLs, across the cluster. Only used when HBase security is enabled.

用户或组的列表(逗号分隔),允许在集群中不考虑存储acl的所有特权。仅在启用HBase安全时使用。

Default

none

没有一个

hbase.auth.key.update.interval
Description

The update interval for master key for authentication tokens in servers in milliseconds. Only used when HBase security is enabled.

用于在服务器中以毫秒为单位的认证标志的主密钥的更新间隔。仅在启用HBase安全时使用。

Default

86400000

86400000

hbase.auth.token.max.lifetime
Description

The maximum lifetime in milliseconds after which an authentication token expires. Only used when HBase security is enabled.

授权令牌过期后的毫秒级的最大生命周期。仅在启用HBase安全时使用。

Default

604800000

604800000

hbase.ipc.client.fallback-to-simple-auth-allowed
Description

When a client is configured to attempt a secure connection, but attempts to connect to an insecure server, that server may instruct the client to switch to SASL SIMPLE (unsecure) authentication. This setting controls whether or not the client will accept this instruction from the server. When false (the default), the client will not allow the fallback to SIMPLE authentication, and will abort the connection.

当一个客户端被配置为尝试一个安全连接,但是试图连接到一个不安全的服务器时,该服务器可能指示客户端切换到SASL简单(不安全)身份验证。该设置控制客户端是否接受来自服务器的指令。当false(缺省值)时,客户端将不允许回退到简单的身份验证,并且将中止连接。

Default

false

hbase.ipc.server.fallback-to-simple-auth-allowed
Description

When a server is configured to require secure connections, it will reject connection attempts from clients using SASL SIMPLE (unsecure) authentication. This setting allows secure servers to accept SASL SIMPLE connections from clients when the client requests. When false (the default), the server will not allow the fallback to SIMPLE authentication, and will reject the connection. WARNING: This setting should ONLY be used as a temporary measure while converting clients over to secure authentication. It MUST BE DISABLED for secure operation.

当服务器配置为需要安全连接时,它将拒绝使用SASL简单(不安全)身份验证的客户端连接尝试。此设置允许安全服务器在客户端请求时从客户端接受SASL简单连接。当false(缺省值)时,服务器将不允许回退到简单的身份验证,并且将拒绝连接。警告:此设置仅在将客户端转换为安全身份验证时作为临时措施使用。必须为安全操作禁用它。

Default

false

hbase.display.keys
Description

When this is set to true the webUI and such will display all start/end keys as part of the table details, region names, etc. When this is set to false, the keys are hidden.

当这个设置为true时,webUI将显示所有的开始/结束键作为表细节的一部分,区域名称等。当这被设置为false时,密钥将被隐藏。

Default

true

真正的

hbase.coprocessor.enabled
Description

Enables or disables coprocessor loading. If 'false' (disabled), any other coprocessor related configuration will be ignored.

启用或禁用协处理器加载。如果“false”(禁用),任何其他协处理器相关配置将被忽略。

Default

true

真正的

hbase.coprocessor.user.enabled
Description

Enables or disables user (aka. table) coprocessor loading. If 'false' (disabled), any table coprocessor attributes in table descriptors will be ignored. If "hbase.coprocessor.enabled" is 'false' this setting has no effect.

启用或禁用用户(aka)。表)协处理器加载。如果“false”(禁用),表描述符中的任何表协处理器属性都将被忽略。如果“hbase.coprocessor。启用“是假的”这个设置没有效果。

Default

true

真正的

hbase.coprocessor.region.classes
Description

A comma-separated list of Coprocessors that are loaded by default on all tables. For any override coprocessor method, these classes will be called in order. After implementing your own Coprocessor, just put it in HBase’s classpath and add the fully qualified class name here. A coprocessor can also be loaded on demand by setting HTableDescriptor.

在所有表上默认加载的由逗号分隔的协处理器列表。对于任何覆盖的协处理器方法,这些类将被依次调用。在实现了自己的协处理器之后,只需将它放在HBase的类路径中,并在这里添加完全限定的类名。也可以通过设置HTableDescriptor来满足需求。

Default

none

没有一个

hbase.coprocessor.master.classes
Description

A comma-separated list of org.apache.hadoop.hbase.coprocessor.MasterObserver coprocessors that are loaded by default on the active HMaster process. For any implemented coprocessor methods, the listed classes will be called in order. After implementing your own MasterObserver, just put it in HBase’s classpath and add the fully qualified class name here.

一个逗号分隔的org.apache.hadoop.hbase.coprocessor。在活动的HMaster进程中默认加载的MasterObserver协处理器。对于任何实现的协处理器方法,将按顺序调用所列出的类。在实现了自己的MasterObserver之后,只需将它放在HBase的类路径中,并在这里添加完全限定的类名。

Default

none

没有一个

hbase.coprocessor.abortonerror
Description

Set to true to cause the hosting server (master or regionserver) to abort if a coprocessor fails to load, fails to initialize, or throws an unexpected Throwable object. Setting this to false will allow the server to continue execution but the system wide state of the coprocessor in question will become inconsistent as it will be properly executing in only a subset of servers, so this is most useful for debugging only.

设置为true,以使托管服务器(主服务器或区域服务器)在协处理器未能加载时终止,无法初始化,或抛出一个意外的可抛出对象。将此设置为false将允许服务器继续执行,但是由于将在服务器的一个子集中正确地执行,因此该协处理器的系统范围将变得不一致,因此这对于调试非常有用。

Default

true

真正的

hbase.rest.port
Description

The port for the HBase REST server.

HBase REST服务器的端口。

Default

8080

8080年

hbase.rest.readonly
Description

Defines the mode the REST server will be started in. Possible values are: false: All HTTP methods are permitted - GET/PUT/POST/DELETE. true: Only the GET method is permitted.

定义REST服务器将启动的模式。可能的值是:false:所有HTTP方法都是允许的- GET/PUT/POST/DELETE。正确:只有GET方法是允许的。

Default

false

hbase.rest.threads.max
Description

The maximum number of threads of the REST server thread pool. Threads in the pool are reused to process REST requests. This controls the maximum number of requests processed concurrently. It may help to control the memory used by the REST server to avoid OOM issues. If the thread pool is full, incoming requests will be queued up and wait for some free threads.

REST服务器线程池的最大线程数。池中的线程被重用以处理REST请求。这将控制并发处理的请求的最大数量。它可以帮助控制REST服务器使用的内存,以避免OOM问题。如果线程池是满的,传入的请求将排队等待一些空闲线程。

Default

100

One hundred.

hbase.rest.threads.min
Description

The minimum number of threads of the REST server thread pool. The thread pool always has at least these number of threads so the REST server is ready to serve incoming requests.

REST服务器线程池的最小线程数。线程池至少有这些线程数,因此REST服务器准备好为传入请求提供服务。

Default

2

2

hbase.rest.support.proxyuser
Description

Enables running the REST server to support proxy-user mode.

支持运行REST服务器以支持代理用户模式。

Default

false

hbase.defaults.for.version.skip
Description

Set to true to skip the 'hbase.defaults.for.version' check. Setting this to true can be useful in contexts other than the other side of a maven generation; i.e. running in an IDE. You’ll want to set this boolean to true to avoid seeing the RuntimeException complaint: "hbase-default.xml file seems to be for and old version of HBase (\${hbase.version}), this version is X.X.X-SNAPSHOT"

设置为true,以跳过“hbase.default .for.version”检查。将其设置为true可以在maven生成的另一端的上下文中有用;即在IDE中运行。您需要将这个布尔值设置为true,以避免看到RuntimeException的抱怨:“hbase-default。xml文件似乎是HBase的旧版本(\${HBase .version}),这个版本是X.X.X-SNAPSHOT。

Default

false

hbase.table.lock.enable
Description

Set to true to enable locking the table in zookeeper for schema change operations. Table locking from master prevents concurrent schema modifications to corrupt table state.

设置为true,以便在zookeeper中锁定表,用于模式更改操作。从主表锁定可以防止并发模式修改到损坏的表状态。

Default

true

真正的

hbase.table.max.rowsize
Description

Maximum size of single row in bytes (default is 1 Gb) for Get’ting or Scan’ning without in-row scan flag set. If row size exceeds this limit RowTooBigException is thrown to client.

如果没有行扫描标记集,那么就可以使用字节(默认为1gb)来获取或扫描的单行最大大小。如果行大小超过这个限制,RowTooBigException就会被抛出到客户端。

Default

1073741824

1073741824

hbase.thrift.minWorkerThreads
Description

The "core size" of the thread pool. New threads are created on every connection until this many threads are created.

线程池的“核心大小”。在创建多个线程之前,将在每个连接上创建新的线程。

Default

16

16

hbase.thrift.maxWorkerThreads
Description

The maximum size of the thread pool. When the pending request queue overflows, new threads are created until their number reaches this number. After that, the server starts dropping connections.

线程池的最大大小。当挂起的请求队列溢出时,将创建新的线程,直到它们的数量达到这个数字。在此之后,服务器开始删除连接。

Default

1000

1000年

hbase.thrift.maxQueuedRequests
Description

The maximum number of pending Thrift connections waiting in the queue. If there are no idle threads in the pool, the server queues requests. Only when the queue overflows, new threads are added, up to hbase.thrift.maxQueuedRequests threads.

在队列中等待的等待的节俭连接的最大数量。如果池中没有空闲线程,则服务器队列请求。只有当队列溢出时,才会添加新的线程,直到hbase.thrift。maxQueuedRequests线程。

Default

1000

1000年

hbase.regionserver.thrift.framed
Description

Use Thrift TFramedTransport on the server side. This is the recommended transport for thrift servers and requires a similar setting on the client side. Changing this to false will select the default transport, vulnerable to DoS when malformed requests are issued due to THRIFT-601.

在服务器端使用节俭的TFramedTransport。这是为节俭服务器推荐的传输方式,并且在客户端需要类似的设置。将此更改为false将选择缺省传输,当由于THRIFT-601发出错误请求时,将容易受到DoS攻击。

Default

false

hbase.regionserver.thrift.framed.max_frame_size_in_mb
Description

Default frame size when using framed transport, in MB

使用框架传输时的默认帧大小,以MB为单位。

Default

2

2

hbase.regionserver.thrift.compact
Description

Use Thrift TCompactProtocol binary serialization protocol.

使用节约TCompactProtocol二进制序列化协议。

Default

false

hbase.rootdir.perms
Description

FS Permissions for the root data subdirectory in a secure (kerberos) setup. When master starts, it creates the rootdir with this permissions or sets the permissions if it does not match.

在安全(kerberos)设置中对根数据子目录的FS权限。当主启动时,它将使用此权限创建rootdir,或设置不匹配的权限。

Default

700

700年

hbase.wal.dir.perms
Description

FS Permissions for the root WAL directory in a secure(kerberos) setup. When master starts, it creates the WAL dir with this permissions or sets the permissions if it does not match.

在一个安全的(kerberos)设置中,对根WAL目录的权限。当master启动时,它会使用该权限创建WAL dir,或者如果它不匹配,则设置权限。

Default

700

700年

hbase.data.umask.enable
Description

Enable, if true, that file permissions should be assigned to the files written by the regionserver

如果是true,则应该将该文件权限分配给区域服务器所写的文件。

Default

false

hbase.data.umask
Description

File permissions that should be used to write data files when hbase.data.umask.enable is true

当hbase.data.umask时,应该用来写入数据文件的文件权限。使是真的

Default

000

000年

hbase.snapshot.enabled
Description

Set to true to allow snapshots to be taken / restored / cloned.

设置为true,以允许进行快照/恢复/克隆。

Default

true

真正的

hbase.snapshot.restore.take.failsafe.snapshot
Description

Set to true to take a snapshot before the restore operation. The snapshot taken will be used in case of failure, to restore the previous state. At the end of the restore operation this snapshot will be deleted

设置为true,在恢复操作之前获取快照。如果出现故障,将使用快照,以恢复以前的状态。在还原操作结束时,此快照将被删除。

Default

true

真正的

hbase.snapshot.restore.failsafe.name
Description

Name of the failsafe snapshot taken by the restore operation. You can use the {snapshot.name}, {table.name} and {restore.timestamp} variables to create a name based on what you are restoring.

恢复操作所采取的故障安全快照的名称。您可以使用{snapshot.name}、{table.name}和{restore。时间戳变量,根据您正在恢复的内容创建一个名称。

Default

hbase-failsafe-{snapshot.name}-{restore.timestamp}

hbase-failsafe - { snapshot.name } { restore.timestamp }

hbase.server.compactchecker.interval.multiplier
Description

The number that determines how often we scan to see if compaction is necessary. Normally, compactions are done after some events (such as memstore flush), but if region didn’t receive a lot of writes for some time, or due to different compaction policies, it may be necessary to check it periodically. The interval between checks is hbase.server.compactchecker.interval.multiplier multiplied by hbase.server.thread.wakefrequency.

这个数字决定了我们多久扫描一次,看看是否需要压缩。通常情况下,在一些事件(比如memstore flush)之后会进行压缩,但是如果区域在一段时间内没有收到大量的写入,或者由于不同的压缩策略,可能需要定期检查它。检查的间隔是hbase.server. compactchecker.interval.乘数乘以hbase.server.thread.wakefrequency。

Default

1000

1000年

hbase.lease.recovery.timeout
Description

How long we wait on dfs lease recovery in total before giving up.

在放弃之前,我们在dfs租约上总共等待了多长时间。

Default

900000

900000年

hbase.lease.recovery.dfs.timeout
Description

How long between dfs recover lease invocations. Should be larger than the sum of the time it takes for the namenode to issue a block recovery command as part of datanode; dfs.heartbeat.interval and the time it takes for the primary datanode, performing block recovery to timeout on a dead datanode; usually dfs.client.socket-timeout. See the end of HBASE-8389 for more.

dfs恢复租约调用之间的时间。应该大于namenode发出块恢复命令作为datanode的一部分所需的时间之和;dfs.heartbeat.interval和主datanode的时间,执行阻塞恢复到死datanode的超时;通常dfs.client.socket-timeout。请参阅HBASE-8389的结尾。

Default

64000

64000年

hbase.column.max.version
Description

New column family descriptors will use this value as the default number of versions to keep.

新的列家族描述符将使用这个值作为保留的默认版本号。

Default

1

1

dfs.client.read.shortcircuit
Description

If set to true, this configuration parameter enables short-circuit local reads.

如果设置为true,则此配置参数启用了短路本地读取。

Default

false

dfs.domain.socket.path
Description

This is a path to a UNIX domain socket that will be used for communication between the DataNode and local HDFS clients, if dfs.client.read.shortcircuit is set to true. If the string "_PORT" is present in this path, it will be replaced by the TCP port of the DataNode. Be careful about permissions for the directory that hosts the shared domain socket; dfsclient will complain if open to other users than the HBase user.

这是一个UNIX域套接字的路径,它将用于DataNode和本地HDFS客户机之间的通信,如果dfs.client.read。短路设置为真。如果在这条路径中存在字符串“_PORT”,那么它将被DataNode的TCP端口所替代。要注意托管共享域套接字的目录的权限;如果对其他用户开放,dfsclient将会向HBase用户投诉。

Default

none

没有一个

hbase.dfs.client.read.shortcircuit.buffer.size
Description

If the DFSClient configuration dfs.client.read.shortcircuit.buffer.size is unset, we will use what is configured here as the short circuit read default direct byte buffer size. DFSClient native default is 1MB; HBase keeps its HDFS files open so number of file blocks * 1MB soon starts to add up and threaten OOME because of a shortage of direct memory. So, we set it down from the default. Make it > the default hbase block size set in the HColumnDescriptor which is usually 64k.

如果DFSClient配置dfs.client.read.短路.buffer。大小是未设置的,我们将使用这里配置的,因为短路读取默认的直接字节缓冲大小。DFSClient本机默认为1MB;HBase保持它的HDFS文件打开,所以文件块的数量* 1MB很快就开始增加并威胁到OOME,因为缺乏直接内存。因此,我们将它从默认设置下。使其>在HColumnDescriptor中默认的hbase块大小设置为64k。

Default

131072

131072年

hbase.regionserver.checksum.verify
Description

If set to true (the default), HBase verifies the checksums for hfile blocks. HBase writes checksums inline with the data when it writes out hfiles. HDFS (as of this writing) writes checksums to a separate file than the data file necessitating extra seeks. Setting this flag saves some on i/o. Checksum verification by HDFS will be internally disabled on hfile streams when this flag is set. If the hbase-checksum verification fails, we will switch back to using HDFS checksums (so do not disable HDFS checksums! And besides this feature applies to hfiles only, not to WALs). If this parameter is set to false, then hbase will not verify any checksums, instead it will depend on checksum verification being done in the HDFS client.

如果设置为true(默认值),HBase将验证hfile块的校验和。HBase在写入hfiles时与数据内联地编写校验和。HDFS(在撰写本文时)将校验和写入一个单独的文件,而不是需要额外查找的数据文件。设置此标志可以节省一些i/o。当设置此标志时,HDFS的校验和将在hfile流中被内部禁用。如果hbase-checksum验证失败,我们将切换回使用HDFS校验和(所以不要禁用HDFS校验和!此外,这个特性只适用于hfiles,而不适用于WALs。如果该参数被设置为false,那么hbase将不验证任何校验和,相反,它将依赖于在HDFS客户机中进行的校验和验证。

Default

true

真正的

hbase.hstore.bytes.per.checksum
Description

Number of bytes in a newly created checksum chunk for HBase-level checksums in hfile blocks.

在hfile块中,新创建的用于hbase级校验和的校验和块中的字节数。

Default

16384

16384年

hbase.hstore.checksum.algorithm
Description

Name of an algorithm that is used to compute checksums. Possible values are NULL, CRC32, CRC32C.

用于计算校验和的算法的名称。可能的值为NULL, CRC32, CRC32C。

Default

CRC32C

CRC32C

hbase.client.scanner.max.result.size
Description

Maximum number of bytes returned when calling a scanner’s next method. Note that when a single row is larger than this limit the row is still returned completely. The default value is 2MB, which is good for 1ge networks. With faster and/or high latency networks this value should be increased.

当调用扫描器的下一个方法时返回的最大字节数。注意,当单个行大于这个限制时,行仍然完全返回。默认值为2MB,这对1ge网络很好。随着更快和/或高延迟网络,这个值应该增加。

Default

2097152

2097152

hbase.server.scanner.max.result.size
Description

Maximum number of bytes returned when calling a scanner’s next method. Note that when a single row is larger than this limit the row is still returned completely. The default value is 100MB. This is a safety setting to protect the server from OOM situations.

当调用扫描器的下一个方法时返回的最大字节数。注意,当单个行大于这个限制时,行仍然完全返回。默认值是100MB。这是一个安全设置,以保护服务器不受OOM情况的影响。

Default

104857600

104857600

hbase.status.published
Description

This setting activates the publication by the master of the status of the region server. When a region server dies and its recovery starts, the master will push this information to the client application, to let them cut the connection immediately instead of waiting for a timeout.

此设置将激活该区域服务器状态的主发布。当一个区域服务器死亡并开始恢复时,主服务器将把这个信息推送到客户端应用程序,让他们立即切断连接,而不是等待超时。

Default

false

hbase.status.publisher.class
Description

Implementation of the status publication with a multicast message.

使用多播消息实现状态发布。

Default

org.apache.hadoop.hbase.master.ClusterStatusPublisher$MulticastPublisher

org.apache.hadoop.hbase.master.ClusterStatusPublisher MulticastPublisher美元

hbase.status.listener.class
Description

Implementation of the status listener with a multicast message.

使用多播消息实现状态监听器。

Default

org.apache.hadoop.hbase.client.ClusterStatusListener$MulticastListener

org.apache.hadoop.hbase.client.ClusterStatusListener MulticastListener美元

hbase.status.multicast.address.ip
Description

Multicast address to use for the status publication by multicast.

多播地址用于状态发布的多点广播。

Default

226.1.1.3

226.1.1.3

hbase.status.multicast.address.port
Description

Multicast port to use for the status publication by multicast.

多播端口用于状态发布的多播。

Default

16100

16100年

hbase.dynamic.jars.dir
Description

The directory from which the custom filter JARs can be loaded dynamically by the region server without the need to restart. However, an already loaded filter/co-processor class would not be un-loaded. See HBASE-1936 for more details. Does not apply to coprocessors.

自定义筛选器jar可以由区域服务器动态加载的目录,而无需重新启动。但是,已经加载的过滤器/协同处理器类不会被卸载。参见HBASE-1936,了解更多细节。不适用于协处理器。

Default

${hbase.rootdir}/lib

$ { hbase.rootdir } / lib

hbase.security.authentication
Description

Controls whether or not secure authentication is enabled for HBase. Possible values are 'simple' (no authentication), and 'kerberos'.

控制是否为HBase启用了安全身份验证。可能的值是“简单的”(没有身份验证)和“kerberos”。

Default

simple

简单的

hbase.rest.filter.classes
Description

Servlet filters for REST service.

Servlet过滤器用于REST服务。

Default

org.apache.hadoop.hbase.rest.filter.GzipFilter

org.apache.hadoop.hbase.rest.filter.GzipFilter

hbase.master.loadbalancer.class
Description

Class used to execute the regions balancing when the period occurs. See the class comment for more on how it works http://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/master/balancer/StochasticLoadBalancer.html It replaces the DefaultLoadBalancer as the default (since renamed as the SimpleLoadBalancer).

当周期发生时,用于执行区域平衡的类。请参阅类评论,以了解它如何工作的:http://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/hbase/master/balancer/stochasticloadbalancer.html以默认的方式替换DefaultLoadBalancer(因为它被重新命名为SimpleLoadBalancer)。

Default

org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer

org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer

hbase.master.loadbalance.bytable
Description

Factor Table name when the balancer runs. Default: false.

当平衡器运行时,元素表名。默认值:false。

Default

false

hbase.master.normalizer.class
Description

Class used to execute the region normalization when the period occurs. See the class comment for more on how it works http://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/master/normalizer/SimpleRegionNormalizer.html

当周期发生时,用于执行区域标准化的类。请参阅类评论,了解更多关于它如何工作的信息。

Default

org.apache.hadoop.hbase.master.normalizer.SimpleRegionNormalizer

org.apache.hadoop.hbase.master.normalizer.SimpleRegionNormalizer

hbase.rest.csrf.enabled
Description

Set to true to enable protection against cross-site request forgery (CSRF)

设置为真,以防止跨站请求伪造(CSRF)

Default

false

hbase.rest-csrf.browser-useragents-regex
Description

A comma-separated list of regular expressions used to match against an HTTP request’s User-Agent header when protection against cross-site request forgery (CSRF) is enabled for REST server by setting hbase.rest.csrf.enabled to true. If the incoming User-Agent matches any of these regular expressions, then the request is considered to be sent by a browser, and therefore CSRF prevention is enforced. If the request’s User-Agent does not match any of these regular expressions, then the request is considered to be sent by something other than a browser, such as scripted automation. In this case, CSRF is not a potential attack vector, so the prevention is not enforced. This helps achieve backwards-compatibility with existing automation that has not been updated to send the CSRF prevention header.

用于与HTTP请求的用户代理头匹配的正则表达式列表,当保护针对跨站请求(CSRF)时,通过设置hbase.rest.csrf为REST服务器启用。启用为true。如果传入的用户代理与这些正则表达式相匹配,则会认为该请求是由浏览器发送的,因此将执行CSRF预防。如果请求的用户代理不匹配这些正则表达式,那么请求将被认为是由浏览器以外的其他东西发送的,比如脚本化的自动化。在这种情况下,CSRF并不是一个潜在的攻击向量,因此不强制执行。这有助于实现向后兼容现有的自动化,而目前的自动化还没有更新来发送CSRF预防报头。

Default

Mozilla.,Opera.

Mozilla,歌剧。

hbase.security.exec.permission.checks
Description

If this setting is enabled and ACL based access control is active (the AccessController coprocessor is installed either as a system coprocessor or on a table as a table coprocessor) then you must grant all relevant users EXEC privilege if they require the ability to execute coprocessor endpoint calls. EXEC privilege, like any other permission, can be granted globally to a user, or to a user on a per table or per namespace basis. For more information on coprocessor endpoints, see the coprocessor section of the HBase online manual. For more information on granting or revoking permissions using the AccessController, see the security section of the HBase online manual.

如果启用了这个设置,并且基于ACL的访问控制是活动的(AccessController协处理器是作为一个系统协处理器或作为表协处理器安装的),那么如果它们需要执行协处理器端点调用的能力,则必须授予所有相关用户EXEC特权。与任何其他权限一样,EXEC特权可以在全局范围内授予用户,也可以在每个表或每个名称空间的基础上授予用户。有关协处理器端点的更多信息,请参阅HBase在线手册的协处理器部分。有关使用AccessController授予或撤销权限的更多信息,请参见HBase在线手册的安全部分。

Default

false

hbase.procedure.regionserver.classes
Description

A comma-separated list of org.apache.hadoop.hbase.procedure.RegionServerProcedureManager procedure managers that are loaded by default on the active HRegionServer process. The lifecycle methods (init/start/stop) will be called by the active HRegionServer process to perform the specific globally barriered procedure. After implementing your own RegionServerProcedureManager, just put it in HBase’s classpath and add the fully qualified class name here.

一个逗号分隔的org.apache.hadoop.hbase.procedure。在活动的h分区服务器进程中默认加载的区域服务器过程管理程序管理器。生命周期方法(init/start/stop)将由活动的h分区服务器进程调用,以执行特定的全局隔离过程。在实现您自己的区域服务器过程管理器之后,只需将它放入HBase的类路径中,并在这里添加完全限定的类名。

Default

none

没有一个

hbase.procedure.master.classes
Description

A comma-separated list of org.apache.hadoop.hbase.procedure.MasterProcedureManager procedure managers that are loaded by default on the active HMaster process. A procedure is identified by its signature and users can use the signature and an instant name to trigger an execution of a globally barriered procedure. After implementing your own MasterProcedureManager, just put it in HBase’s classpath and add the fully qualified class name here.

一个逗号分隔的org.apache.hadoop.hbase.procedure。在活动的HMaster进程中默认加载的主进程管理程序管理器。程序由其签名标识,用户可以使用签名和即时名称来触发一个全局隔离过程的执行。在实现了自己的masterprocessduremanager之后,只需将它放入HBase的类路径中,并在这里添加完全限定的类名。

Default

none

没有一个

hbase.coordinated.state.manager.class
Description

Fully qualified name of class implementing coordinated state manager.

执行协调状态管理器类的全限定名。

Default

org.apache.hadoop.hbase.coordination.ZkCoordinatedStateManager

org.apache.hadoop.hbase.coordination.ZkCoordinatedStateManager

hbase.regionserver.storefile.refresh.period
Description

The period (in milliseconds) for refreshing the store files for the secondary regions. 0 means this feature is disabled. Secondary regions sees new files (from flushes and compactions) from primary once the secondary region refreshes the list of files in the region (there is no notification mechanism). But too frequent refreshes might cause extra Namenode pressure. If the files cannot be refreshed for longer than HFile TTL (hbase.master.hfilecleaner.ttl) the requests are rejected. Configuring HFile TTL to a larger value is also recommended with this setting.

用于刷新次要区域的存储文件的周期(以毫秒为单位)。0表示此功能被禁用。次要区域在次要区域刷新该区域的文件列表(没有通知机制)时,会从主区域看到新的文件(从刷新和压缩)。但是频繁的刷新可能会造成额外的Namenode压力。如果文件不能刷新的时间超过HFile TTL (hbase.master.hfilecleaner.ttl),请求就会被拒绝。将HFile TTL配置为更大的值也建议使用此设置。

Default

0

0

hbase.region.replica.replication.enabled
Description

Whether asynchronous WAL replication to the secondary region replicas is enabled or not. If this is enabled, a replication peer named "region_replica_replication" will be created which will tail the logs and replicate the mutations to region replicas for tables that have region replication > 1. If this is enabled once, disabling this replication also requires disabling the replication peer using shell or Admin java class. Replication to secondary region replicas works over standard inter-cluster replication.

是否启用了对辅助区域副本的异步复制。如果启用了这个功能,将创建一个名为“region_replica_replication”的复制节点,它将跟踪日志,并将这些突变复制到具有区域复制> 1的表的区域副本中。如果启用此功能,禁用此复制还需要使用shell或Admin java类禁用复制对等项。复制到次要区域副本的工作超过标准的集群复制。

Default

false

hbase.http.filter.initializers
Description

A comma separated list of class names. Each class in the list must extend org.apache.hadoop.hbase.http.FilterInitializer. The corresponding Filter will be initialized. Then, the Filter will be applied to all user facing jsp and servlet web pages. The ordering of the list defines the ordering of the filters. The default StaticUserWebFilter add a user principal as defined by the hbase.http.staticuser.user property.

一个逗号分隔的类名列表。列表中的每个类都必须扩展org.apache.hadoop.hbase.http.FilterInitializer。相应的过滤器将被初始化。然后,该过滤器将应用于所有面向jsp和servlet web页面的用户。列表的顺序定义了过滤器的顺序。默认StaticUserWebFilter添加一个由hbase.http.staticuser定义的用户主体。用户属性。

Default

org.apache.hadoop.hbase.http.lib.StaticUserWebFilter

org.apache.hadoop.hbase.http.lib.StaticUserWebFilter

hbase.security.visibility.mutations.checkauths
Description

This property if enabled, will check whether the labels in the visibility expression are associated with the user issuing the mutation

如果启用此属性,将检查可见性表达式中的标签是否与发出该突变的用户关联。

Default

false

hbase.http.max.threads
Description

The maximum number of threads that the HTTP Server will create in its ThreadPool.

HTTP服务器将在其ThreadPool中创建的最大线程数。

Default

16

16

hbase.replication.rpc.codec
Description

The codec that is to be used when replication is enabled so that the tags are also replicated. This is used along with HFileV3 which supports tags in them. If tags are not used or if the hfile version used is HFileV2 then KeyValueCodec can be used as the replication codec. Note that using KeyValueCodecWithTags for replication when there are no tags causes no harm.

在启用复制时要使用的编解码器,以便复制标记。这与HFileV3一起使用,HFileV3支持这些标签。如果没有使用标记,或者使用的hfile版本是HFileV2,那么KeyValueCodec可以作为复制编解码器。注意,在没有标记的情况下,使用KeyValueCodecWithTags进行复制不会造成任何损害。

Default

org.apache.hadoop.hbase.codec.KeyValueCodecWithTags

org.apache.hadoop.hbase.codec.KeyValueCodecWithTags

hbase.replication.source.maxthreads
Description

The maximum number of threads any replication source will use for shipping edits to the sinks in parallel. This also limits the number of chunks each replication batch is broken into. Larger values can improve the replication throughput between the master and slave clusters. The default of 10 will rarely need to be changed.

任何复制源的最大线程数将用于并行地对接收器进行编辑。这也限制了每个复制批被分解成的块的数量。更大的值可以提高主集群和从属集群之间的复制吞吐量。默认的10将很少需要更改。

Default

10

10

hbase.http.staticuser.user
Description

The user name to filter as, on static web filters while rendering content. An example use is the HDFS web UI (user to be used for browsing files).

在呈现内容时,将用户名过滤为静态web过滤器。示例使用的是HDFS web UI(用于浏览文件的用户)。

Default

dr.stack

dr.stack

hbase.regionserver.handler.abort.on.error.percent
Description

The percent of region server RPC threads failed to abort RS. -1 Disable aborting; 0 Abort if even a single handler has died; 0.x Abort only when this percent of handlers have died; 1 Abort only all of the handers have died.

区域服务器RPC线程的百分比未能终止RS. 1禁用中止;如果单个处理程序已经死亡,则终止;0。只有当这个百分比的处理程序已经死亡时,x才会中止;只有所有的手都死了。

Default

0.5

0.5

hbase.mob.file.cache.size
Description

Number of opened file handlers to cache. A larger value will benefit reads by providing more file handlers per mob file cache and would reduce frequent file opening and closing. However, if this is set too high, this could lead to a "too many opened file handlers" The default value is 1000.

打开的文件处理程序的数量。一个更大的值将通过为每个mob文件缓存提供更多的文件处理程序而受益,并减少频繁的文件打开和关闭。但是,如果设置得太高,这可能导致“太多打开的文件处理程序”,默认值是1000。

Default

1000

1000年

hbase.mob.cache.evict.period
Description

The amount of time in seconds before the mob cache evicts cached mob files. The default value is 3600 seconds.

在暴民缓存清除缓存的mob文件之前的几秒钟时间。默认值为3600秒。

Default

3600

3600年

hbase.mob.cache.evict.remain.ratio
Description

The ratio (between 0.0 and 1.0) of files that remains cached after an eviction is triggered when the number of cached mob files exceeds the hbase.mob.file.cache.size. The default value is 0.5f.

当缓存的mob文件的数量超过hbase.mob.file.cache.size时,在被驱逐后缓存的文件的比率(介于0.0和1.0之间)将被触发。默认值是0。5f。

Default

0.5f

0.5度

hbase.master.mob.ttl.cleaner.period
Description

The period that ExpiredMobFileCleanerChore runs. The unit is second. The default value is one day. The MOB file name uses only the date part of the file creation time in it. We use this time for deciding TTL expiry of the files. So the removal of TTL expired files might be delayed. The max delay might be 24 hrs.

结束mobfilecleanerchore运行的期间。单位是秒。默认值是一天。MOB文件名称只使用文件创建时间的日期部分。我们使用这个时间来决定文件的TTL过期。因此,删除TTL过期的文件可能会被延迟。最大延迟可能是24小时。

Default

86400

86400年

hbase.mob.compaction.mergeable.threshold
Description

If the size of a mob file is less than this value, it’s regarded as a small file and needs to be merged in mob compaction. The default value is 1280MB.

如果一个mob文件的大小小于这个值,它就被认为是一个小文件,需要合并到mob compaction中。默认值是1280MB。

Default

1342177280

1342177280

hbase.mob.delfile.max.count
Description

The max number of del files that is allowed in the mob compaction. In the mob compaction, when the number of existing del files is larger than this value, they are merged until number of del files is not larger this value. The default value is 3.

在暴民压缩中允许的del文件的最大数量。在mob compaction中,当现有del文件的数量大于这个值时,它们会被合并,直到del文件的数量不大于这个值。默认值是3。

Default

3

3

hbase.mob.compaction.batch.size
Description

The max number of the mob files that is allowed in a batch of the mob compaction. The mob compaction merges the small mob files to bigger ones. If the number of the small files is very large, it could lead to a "too many opened file handlers" in the merge. And the merge has to be split into batches. This value limits the number of mob files that are selected in a batch of the mob compaction. The default value is 100.

mob文件的最大数量允许在一组暴民压缩。暴徒的密实把小暴徒的文件合并成大的。如果小文件的数量很大,那么在合并中可能会导致“太多打开的文件处理程序”。合并必须分批进行。这个值限制了mob文件中被选中的mob文件的数量。默认值是100。

Default

100

One hundred.

hbase.mob.compaction.chore.period
Description

The period that MobCompactionChore runs. The unit is second. The default value is one week.

MobCompactionChore运行的周期。单位是秒。默认值是一个星期。

Default

604800

604800年

hbase.mob.compactor.class
Description

Implementation of mob compactor, the default one is PartitionedMobCompactor.

实现了mob compactor,默认的是PartitionedMobCompactor。

Default

org.apache.hadoop.hbase.mob.compactions.PartitionedMobCompactor

org.apache.hadoop.hbase.mob.compactions.PartitionedMobCompactor

hbase.mob.compaction.threads.max
Description

The max number of threads used in MobCompactor.

MobCompactor中使用的最大线程数。

Default

1

1

hbase.snapshot.master.timeout.millis
Description

Timeout for master for the snapshot procedure execution.

用于快照过程执行的主超时。

Default

300000

300000年

hbase.snapshot.region.timeout
Description

Timeout for regionservers to keep threads in snapshot request pool waiting.

区域服务器的超时,以便在快照请求池中保持线程等待。

Default

300000

300000年

hbase.rpc.rows.warning.threshold
Description

Number of rows in a batch operation above which a warning will be logged.

在上面的批处理操作中,将记录一个警告的行数。

Default

5000

5000年

hbase.master.wait.on.service.seconds
Description

Default is 5 minutes. Make it 30 seconds for tests. See HBASE-19794 for some context.

默认是5分钟。做30秒的测试。参见HBASE-19794,了解一些上下文。

Default

30

30.

7.3. hbase-env.sh

7.3。hbase-env.sh

Set HBase environment variables in this file. Examples include options to pass the JVM on start of an HBase daemon such as heap size and garbage collector configs. You can also set configurations for HBase configuration, log directories, niceness, ssh options, where to locate process pid files, etc. Open the file at conf/hbase-env.sh and peruse its content. Each option is fairly well documented. Add your own environment variables here if you want them read by HBase daemons on startup.

在此文件中设置HBase环境变量。示例包括在HBase守护进程(如堆大小和垃圾收集器configs)启动时传递JVM的选项。您还可以为HBase配置、日志目录、niceness、ssh选项、在何处定位进程pid文件等设置配置。并仔细阅读它的内容。每个选项都有相当详细的文档。如果您想让HBase守护进程在启动时读取它们,那么在这里添加您自己的环境变量。

Changes here will require a cluster restart for HBase to notice the change.

这里的更改将要求HBase重新启动集群,以注意更改。

7.4. log4j.properties

7.4。log4j . properties

Edit this file to change rate at which HBase files are rolled and to change the level at which HBase logs messages.

编辑此文件以更改HBase文件的滚动速度,并更改HBase日志消息的级别。

Changes here will require a cluster restart for HBase to notice the change though log levels can be changed for particular daemons via the HBase UI.

这里的更改将要求HBase重新启动集群,以注意更改,但是可以通过HBase UI为特定的守护进程更改日志级别。

7.5. Client configuration and dependencies connecting to an HBase cluster

7.5。连接到HBase集群的客户端配置和依赖关系。

If you are running HBase in standalone mode, you don’t need to configure anything for your client to work provided that they are all on the same machine.

如果在独立模式下运行HBase,则不需要为客户机配置任何东西,前提是它们都在同一台机器上。

Since the HBase Master may move around, clients bootstrap by looking to ZooKeeper for current critical locations. ZooKeeper is where all these values are kept. Thus clients require the location of the ZooKeeper ensemble before they can do anything else. Usually this ensemble location is kept out in the hbase-site.xml and is picked up by the client from the CLASSPATH.

由于HBase主机可以移动,客户端通过查找当前关键位置的ZooKeeper来引导。ZooKeeper是所有这些值保存的地方。因此,客户需要在他们可以做任何其他事情之前,先将ZooKeeper集成在一起。通常这个集成位置在hbase站点中被保留。xml并由来自类路径的客户机接收。

If you are configuring an IDE to run an HBase client, you should include the conf/ directory on your classpath so hbase-site.xml settings can be found (or add src/test/resources to pick up the hbase-site.xml used by tests).

如果您正在配置一个IDE来运行一个HBase客户端,那么您应该在类路径上包含conf/目录,这样HBase -site就可以了。可以找到xml设置(或者添加src/test/resources来获取hbase站点。测试所使用的xml)。

Minimally, an HBase client needs hbase-client module in its dependencies when connecting to a cluster:

在连接到集群时,HBase客户端需要HBase -client模块。

<dependency>
  <groupId>org.apache.hbase</groupId>
  <artifactId>hbase-client</artifactId>
  <version>1.2.4</version>
</dependency>

A basic example hbase-site.xml for client only may look as follows:

一个基本的例子hbase-site。客户端的xml可能如下图所示:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>hbase.zookeeper.quorum</name>
    <value>example1,example2,example3</value>
    <description>The directory shared by region servers.
    </description>
  </property>
</configuration>

7.5.1. Java client configuration

7.5.1。Java客户端配置

The configuration used by a Java client is kept in an HBaseConfiguration instance.

Java客户端使用的配置保存在HBaseConfiguration实例中。

The factory method on HBaseConfiguration, HBaseConfiguration.create();, on invocation, will read in the content of the first hbase-site.xml found on the client’s CLASSPATH, if one is present (Invocation will also factor in any hbase-default.xml found; an hbase-default.xml ships inside the hbase.X.X.X.jar). It is also possible to specify configuration directly without having to read from a hbase-site.xml. For example, to set the ZooKeeper ensemble for the cluster programmatically do as follows:

HBaseConfiguration的工厂方法,hbaseconfigur. create();在调用时,将读取第一个hbase站点的内容。在客户机的类路径上发现的xml(如果有的话)(调用也会导致任何hbase-default)。xml发现;一个hbase-default。xml在hbase.X.X.X.jar中。还可以直接指定配置,而不必从hbase-site.xml读取。例如,要以编程方式设置集群的ZooKeeper集合,如下所示:

Configuration config = HBaseConfiguration.create();
config.set("hbase.zookeeper.quorum", "localhost");  // Here we are running zookeeper locally

If multiple ZooKeeper instances make up your ZooKeeper ensemble, they may be specified in a comma-separated list (just as in the hbase-site.xml file). This populated Configuration instance can then be passed to an Table, and so on.

如果多个ZooKeeper实例组成了您的ZooKeeper集合,它们可以在逗号分隔的列表中指定(就像在hbase站点中一样)。xml文件)。然后可以将这个填充的配置实例传递给一个表,等等。

8. Example Configurations

8。示例配置

8.1. Basic Distributed HBase Install

8.1。基本的分布式HBase安装

Here is a basic configuration example for a distributed ten node cluster: * The nodes are named example0, example1, etc., through node example9 in this example. * The HBase Master and the HDFS NameNode are running on the node example0. * RegionServers run on nodes example1-example9. * A 3-node ZooKeeper ensemble runs on example1, example2, and example3 on the default ports. * ZooKeeper data is persisted to the directory /export/zookeeper.

下面是分布式10节点集群的一个基本配置示例:*节点被命名为example0, example1,等等,在本例中通过节点example9。* HBase主机和HDFS NameNode在节点example0上运行。*区域服务器运行在节点example1-example9。*一个3节点的ZooKeeper集合在默认端口上运行在example1、example2和example3上。* ZooKeeper数据保存到目录/导出/ ZooKeeper。

Below we show what the main configuration files — hbase-site.xml, regionservers, and hbase-env.sh — found in the HBase conf directory might look like.

下面我们将展示主配置文件- hbase-site。xml、regionservers和hbase-env。在HBase conf目录中找到的sh可能是这样的。

8.1.1. hbase-site.xml

8.1.1。hbase-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>hbase.zookeeper.quorum</name>
    <value>example1,example2,example3</value>
    <description>The directory shared by RegionServers.
    </description>
  </property>
  <property>
    <name>hbase.zookeeper.property.dataDir</name>
    <value>/export/zookeeper</value>
    <description>Property from ZooKeeper config zoo.cfg.
    The directory where the snapshot is stored.
    </description>
  </property>
  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://example0:8020/hbase</value>
    <description>The directory shared by RegionServers.
    </description>
  </property>
  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
    <description>The mode the cluster will be in. Possible values are
      false: standalone and pseudo-distributed setups with managed ZooKeeper
      true: fully-distributed with unmanaged ZooKeeper Quorum (see hbase-env.sh)
    </description>
  </property>
</configuration>

8.1.2. regionservers

8.1.2。regionservers

In this file you list the nodes that will run RegionServers. In our case, these nodes are example1-example9.

在这个文件中,您将列出将运行区域服务器的节点。在我们的例子中,这些节点是example1-example9。

example1
example2
example3
example4
example5
example6
example7
example8
example9

8.1.3. hbase-env.sh

8.1.3。hbase-env.sh

The following lines in the hbase-env.sh file show how to set the JAVA_HOME environment variable (required for HBase) and set the heap to 4 GB (rather than the default value of 1 GB). If you copy and paste this example, be sure to adjust the JAVA_HOME to suit your environment.

hbase-env中的以下几行。sh文件显示了如何设置JAVA_HOME环境变量(HBase所需),并将堆设置为4 GB(而不是默认值为1 GB)。如果您复制并粘贴这个示例,请确保调整JAVA_HOME以适应您的环境。

# The java implementation to use.
export JAVA_HOME=/usr/java/jdk1.8.0/

# The maximum amount of heap to use. Default is left to JVM default.
export HBASE_HEAPSIZE=4G

Use rsync to copy the content of the conf directory to all nodes of the cluster.

使用rsync将conf目录的内容复制到集群的所有节点。

9. The Important Configurations

9。重要的配置

Below we list some important configurations. We’ve divided this section into required configuration and worth-a-look recommended configs.

下面列出一些重要的配置。我们将此部分划分为所需的配置和值得推荐的配置。

9.1. Required Configurations

9.1。需要配置

Review the os and hadoop sections.

检查操作系统和hadoop部分。

9.1.1. Big Cluster Configurations

9.1.1。大集群配置

If you have a cluster with a lot of regions, it is possible that a Regionserver checks in briefly after the Master starts while all the remaining RegionServers lag behind. This first server to check in will be assigned all regions which is not optimal. To prevent the above scenario from happening, up the hbase.master.wait.on.regionservers.mintostart property from its default value of 1. See HBASE-6389 Modify the conditions to ensure that Master waits for sufficient number of Region Servers before starting region assignments for more detail.

如果您的集群中有很多区域,那么在主启动后,区域服务器可能会进行短暂的检查,而其余的区域服务器则会滞后。第一个要检查的服务器将被分配到所有不是最优的区域。为了防止上述情况的发生,请在hbase.master. wait.e.c服务器上执行。mintostart属性的默认值为1。请参阅HBASE-6389修改条件,以确保主服务器在开始区域分配之前等待足够数量的区域服务器。

zookeeper.session.timeout
zookeeper.session.timeout

The default timeout is three minutes (specified in milliseconds). This means that if a server crashes, it will be three minutes before the Master notices the crash and starts recovery. You might need to tune the timeout down to a minute or even less so the Master notices failures sooner. Before changing this value, be sure you have your JVM garbage collection configuration under control, otherwise, a long garbage collection that lasts beyond the ZooKeeper session timeout will take out your RegionServer. (You might be fine with this — you probably want recovery to start on the server if a RegionServer has been in GC for a long period of time).

默认超时为3分钟(以毫秒为单位)。这意味着,如果服务器崩溃,将在主人注意到崩溃并开始恢复之前三分钟。您可能需要将超时时间调到一分钟或更短,这样大师就会更快地注意到故障。在更改此值之前,请确保您的JVM垃圾收集配置处于控制之下,否则,长时间的垃圾收集将超出ZooKeeper会话超时,将会占用您的区域服务器。(您可能对此很满意——如果一个区域性服务器长期处于GC状态,那么您可能希望从服务器开始恢复)。

To change this configuration, edit hbase-site.xml, copy the changed file across the cluster and restart.

要更改此配置,请编辑hbase-site。xml,在集群中复制已更改的文件并重新启动。

We set this value high to save our having to field questions up on the mailing lists asking why a RegionServer went down during a massive import. The usual cause is that their JVM is untuned and they are running into long GC pauses. Our thinking is that while users are getting familiar with HBase, we’d save them having to know all of its intricacies. Later when they’ve built some confidence, then they can play with configuration such as this.

我们将这个值设为高,以节省我们在邮件列表上的问题,询问为什么在大量的导入过程中,区域服务器会崩溃。通常的原因是它们的JVM没有调优,它们正在运行长GC暂停。我们的想法是,虽然用户对HBase越来越熟悉,但我们可以让他们知道所有的复杂情况。当他们建立了一些信心之后,他们就可以玩这样的配置了。

Number of ZooKeeper Instances
动物园管理员实例

See zookeeper.

看到动物园管理员。

9.2.2. HDFS Configurations

9.2.2。HDFS配置

dfs.datanode.failed.volumes.tolerated
dfs.datanode.failed.volumes.tolerated

This is the "…​number of volumes that are allowed to fail before a DataNode stops offering service. By default any volume failure will cause a datanode to shutdown" from the hdfs-default.xml description. You might want to set this to about half the amount of your available disks.

这是“……在DataNode停止提供服务之前允许失败的卷的数量”。默认情况下,任何容量失败都会导致datanode关闭“从hdfs-default”。xml描述。您可能希望将其设置为可用磁盘的一半。

hbase.regionserver.handler.count
hbase.regionserver.handler.count

This setting defines the number of threads that are kept open to answer incoming requests to user tables. The rule of thumb is to keep this number low when the payload per request approaches the MB (big puts, scans using a large cache) and high when the payload is small (gets, small puts, ICVs, deletes). The total size of the queries in progress is limited by the setting hbase.ipc.server.max.callqueue.size.

该设置定义了保持打开的线程数,以响应对用户表的传入请求。经验法则是,当每个请求的有效负载接近MB(大容量、扫描使用大缓存)和高负载时(获取、小put、ICVs、删除)时,保持这个数字低。在进程中查询的总大小受设置hbase.ipc.server.max.callqueue.size的限制。

It is safe to set that number to the maximum number of incoming clients if their payload is small, the typical example being a cluster that serves a website since puts aren’t typically buffered and most of the operations are gets.

如果它们的有效负载很小,那么将这个数字设置为最大的传入客户端是安全的,典型的例子是一个服务于一个网站的集群,因为它不是典型的缓冲,而且大多数操作都是得到的。

The reason why it is dangerous to keep this setting high is that the aggregate size of all the puts that are currently happening in a region server may impose too much pressure on its memory, or even trigger an OutOfMemoryError. A RegionServer running on low memory will trigger its JVM’s garbage collector to run more frequently up to a point where GC pauses become noticeable (the reason being that all the memory used to keep all the requests' payloads cannot be trashed, no matter how hard the garbage collector tries). After some time, the overall cluster throughput is affected since every request that hits that RegionServer will take longer, which exacerbates the problem even more.

保持这个设置的高度危险的原因是,当前在区域服务器上发生的所有put的聚合大小可能会对其内存施加太大的压力,甚至引发OutOfMemoryError错误。运行在低内存上的区域服务器将触发JVM的垃圾收集器,以便更频繁地运行GC暂停变得明显的点(原因是,无论垃圾收集器如何努力,所有用于保存所有请求的内存的内存都不能被丢弃)。经过一段时间之后,整个集群的吞吐量都会受到影响,因为每个对该区域服务器的请求都将花费更长的时间,这将使问题更加严重。

You can get a sense of whether you have too little or too many handlers by rpc.logging on an individual RegionServer then tailing its logs (Queued requests consume memory).

通过rpc,您可以了解您是否拥有太多或太多的处理程序。在单个区域服务器上进行日志记录,然后跟踪其日志(队列请求消耗内存)。

9.2.3. Configuration for large memory machines

9.2.3。大型内存机器的配置。

HBase ships with a reasonable, conservative configuration that will work on nearly all machine types that people might want to test with. If you have larger machines — HBase has 8G and larger heap — you might find the following configuration options helpful. TODO.

HBase有一个合理的、保守的配置,它将在几乎所有人们想要测试的机器类型上工作。如果您有更大的机器——HBase有8G和更大的堆——您可能会发现以下配置选项是有帮助的。待办事项。

9.2.4. Compression

9.2.4。压缩

You should consider enabling ColumnFamily compression. There are several options that are near-frictionless and in most all cases boost performance by reducing the size of StoreFiles and thus reducing I/O.

您应该考虑启用ColumnFamily压缩。有几种几乎无摩擦的选项,在大多数情况下,通过减小StoreFiles的大小来提高性能,从而减少I/O。

See compression for more information.

有关更多信息,请参见压缩。

9.2.5. Configuring the size and number of WAL files

9.2.5。配置WAL - files的大小和数量。

HBase uses wal to recover the memstore data that has not been flushed to disk in case of an RS failure. These WAL files should be configured to be slightly smaller than HDFS block (by default a HDFS block is 64Mb and a WAL file is ~60Mb).

HBase使用wal来恢复在RS失败时没有刷新到磁盘的memstore数据。这些WAL - file应该配置为比HDFS块稍微小一点(默认情况下,HDFS块是64Mb,而一个WAL - file是~60Mb)。

HBase also has a limit on the number of WAL files, designed to ensure there’s never too much data that needs to be replayed during recovery. This limit needs to be set according to memstore configuration, so that all the necessary data would fit. It is recommended to allocate enough WAL files to store at least that much data (when all memstores are close to full). For example, with 16Gb RS heap, default memstore settings (0.4), and default WAL file size (~60Mb), 16Gb*0.4/60, the starting point for WAL file count is ~109. However, as all memstores are not expected to be full all the time, less WAL files can be allocated.

HBase对WAL - mail的数量也有限制,这是为了确保在恢复过程中不会有太多的数据需要重放。这个限制需要根据memstore配置来设置,以便所有必要的数据都适合。建议分配足够的WAL - file来存储至少这么多的数据(当所有的内存存储都接近满的时候)。例如,有16Gb的RS堆,默认的memstore设置(0.4),默认的WAL file size (~60Mb), 16Gb*0.4/60, WAL file count的起始点是~109。然而,由于所有的内存存储都不可能一直都是满的,所以可以分配更少的WAL文件。

9.2.6. Managed Splitting

9.2.6。管理的分裂

HBase generally handles splitting of your regions based upon the settings in your hbase-default.xml and hbase-site.xml configuration files. Important settings include hbase.regionserver.region.split.policy, hbase.hregion.max.filesize, hbase.regionserver.regionSplitLimit. A simplistic view of splitting is that when a region grows to hbase.hregion.max.filesize, it is split. For most usage patterns, you should use automatic splitting. See manual region splitting decisions for more information about manual region splitting.

HBase通常根据HBase -default的设置处理区域的分割。xml和hbase-site。xml配置文件。重要的设置包括hbase.regionserver.region.split。政策,hbase.hregion.max。文件大小,hbase.regionserver.regionSplitLimit。分割的一个简单的观点是,当一个区域增长到hbase.hregion.max。文件大小,分裂。对于大多数使用模式,您应该使用自动拆分。有关手动区域拆分的更多信息,请参见手动区域拆分决策。

Instead of allowing HBase to split your regions automatically, you can choose to manage the splitting yourself. This feature was added in HBase 0.90.0. Manually managing splits works if you know your keyspace well, otherwise let HBase figure where to split for you. Manual splitting can mitigate region creation and movement under load. It also makes it so region boundaries are known and invariant (if you disable region splitting). If you use manual splits, it is easier doing staggered, time-based major compactions to spread out your network IO load.

不允许HBase自动分割区域,您可以选择管理您自己。该特性在HBase 0.90.0中添加。如果你知道你的关键空间,手动管理分拆工作,否则让HBase数据为你分配。手动拆分可以减轻区域的创建和负载下的移动。它也使得区域边界是已知的和不变的(如果你禁用区域分割)。如果您使用手动分割,则更容易进行交错的、基于时间的主要压缩来扩展您的网络IO负载。

Disable Automatic Splitting

To disable automatic splitting, you can set region split policy in either cluster configuration or table configuration to be org.apache.hadoop.hbase.regionserver.DisabledRegionSplitPolicy

为了禁用自动分割,您可以在集群配置或表配置中设置区域分割策略。

Automatic Splitting Is Recommended

If you disable automatic splits to diagnose a problem or during a period of fast data growth, it is recommended to re-enable them when your situation becomes more stable. The potential benefits of managing region splits yourself are not undisputed.

如果您禁用自动拆分来诊断问题或在快速数据增长期间,建议在您的情况变得更稳定时重新启用它们。管理地区的潜在好处并不是无可争议的。

Determine the Optimal Number of Pre-Split Regions

The optimal number of pre-split regions depends on your application and environment. A good rule of thumb is to start with 10 pre-split regions per server and watch as data grows over time. It is better to err on the side of too few regions and perform rolling splits later. The optimal number of regions depends upon the largest StoreFile in your region. The size of the largest StoreFile will increase with time if the amount of data grows. The goal is for the largest region to be just large enough that the compaction selection algorithm only compacts it during a timed major compaction. Otherwise, the cluster can be prone to compaction storms with a large number of regions under compaction at the same time. It is important to understand that the data growth causes compaction storms and not the manual split decision.

预分割区域的最佳数量取决于您的应用程序和环境。一个很好的经验法则是,从每个服务器上的10个预分割区域开始,随着时间的推移,随着数据的增长,我们会注意到这一点。最好是在太过少的区域内犯错,之后再进行滚动分割。最优的区域数量取决于区域内最大的存储文件。如果数据量增加,最大的存储文件的大小会随着时间的增加而增加。目标是使最大的区域足够大,使得压实选择算法只在一个定时的主要压缩过程中进行压缩。否则,集群可能会在压缩的同时出现大量区域的压实风暴。重要的是要理解,数据增长会导致压实风暴,而不是手工拆分决策。

If the regions are split into too many large regions, you can increase the major compaction interval by configuring HConstants.MAJOR_COMPACTION_PERIOD. HBase 0.90 introduced org.apache.hadoop.hbase.util.RegionSplitter, which provides a network-IO-safe rolling split of all regions.

如果区域被分割为太多的大区域,您可以通过配置hconstant . major_compaction_period来增加主要的压缩时间间隔。HBase 0.90引入org.apache.hadoop.hbase.util。区域分割器,它提供了所有区域的网络安全滚动分割。

9.2.7. Managed Compactions

9.2.7。件管理

By default, major compactions are scheduled to run once in a 7-day period. Prior to HBase 0.96.x, major compactions were scheduled to happen once per day by default.

默认情况下,主要的compaction计划在7天的时间内运行一次。HBase 0.96之前。在默认情况下,主要的compaction每天都要发生一次。

If you need to control exactly when and how often major compaction runs, you can disable managed major compactions. See the entry for hbase.hregion.majorcompaction in the compaction.parameters table for details.

如果需要精确控制主压缩运行的时间和频率,则可以禁用管理的主要压缩。参见hbase.hregion的条目。majorcompaction压实。参数表细节。

Do Not Disable Major Compactions

Major compactions are absolutely necessary for StoreFile clean-up. Do not disable them altogether. You can run major compactions manually via the HBase shell or via the Admin API.

主要的压缩对于清理仓库是绝对必要的。不要完全禁用它们。您可以通过HBase shell或通过管理API手动运行主要的压缩操作。

For more information about compactions and the compaction file selection process, see compaction

有关compaction和compaction文件选择过程的更多信息,请参见compaction。

9.2.8. Speculative Execution

9.2.8。投机执行

Speculative Execution of MapReduce tasks is on by default, and for HBase clusters it is generally advised to turn off Speculative Execution at a system-level unless you need it for a specific case, where it can be configured per-job. Set the properties mapreduce.map.speculative and mapreduce.reduce.speculative to false.

默认情况下,对于MapReduce任务的推测执行是在默认情况下进行的,对于HBase集群,通常建议在系统级关闭投机性的执行,除非您需要它用于特定的情况,在那里可以配置每个作业。mapreduce.map设置属性。投机和mapreduce.reduce。投机为false。

9.3. Other Configurations

9.3。其他配置

9.3.1. Balancer

设备上装。平衡器

The balancer is a periodic operation which is run on the master to redistribute regions on the cluster. It is configured via hbase.balancer.period and defaults to 300000 (5 minutes).

平衡器是一个周期性的操作,在主服务器上运行,以在集群上重新分配区域。它是通过hbase.balancer配置的。期间和默认为300000(5分钟)。

See master.processes.loadbalancer for more information on the LoadBalancer.

看到master.processes。负载均衡器获取更多关于负载平衡器的信息。

9.3.2. Disabling Blockcache

9.3.2。禁用Blockcache

Do not turn off block cache (You’d do it by setting hfile.block.cache.size to zero). Currently we do not do well if you do this because the RegionServer will spend all its time loading HFile indices over and over again. If your working set is such that block cache does you no good, at least size the block cache such that HFile indices will stay up in the cache (you can get a rough idea on the size you need by surveying RegionServer UIs; you’ll see index block size accounted near the top of the webpage).

不要关闭块缓存(您可以通过设置hfile.block.cache来完成它)。大小为零)。如果您这样做,我们现在做得不好,因为区域服务器将会一次又一次地加载HFile索引。如果您的工作集是这样的,那么块缓存对您没有好处,至少是大小块缓存,这样HFile索引就会停留在缓存中(您可以通过测量区域服务器UIs来了解您需要的大小);你会看到索引块大小在网页顶部附近。

9.3.3. Nagle’s or the small package problem

9.3.3。Nagle或小包装问题。

If a big 40ms or so occasional delay is seen in operations against HBase, try the Nagles' setting. For example, see the user mailing list thread, Inconsistent scan performance with caching set to 1 and the issue cited therein where setting notcpdelay improved scan speeds. You might also see the graphs on the tail of HBASE-7008 Set scanner caching to a better default where our Lars Hofhansl tries various data sizes w/ Nagle’s on and off measuring the effect.

如果在HBase的操作中出现了40毫秒左右的延迟,试试Nagles的设置。例如,查看用户邮件列表线程,不一致的扫描性能和缓存设置为1,以及在其中所引用的设置notcpdelay提高扫描速度的问题。您还可以看到在HBASE-7008设置扫描高速缓存的尾部的图形,使我们的Lars Hofhansl尝试各种数据大小w/ Nagle的on和off测量结果。

9.3.4. Better Mean Time to Recover (MTTR)

9.3.4。更好的平均恢复时间(MTTR)

This section is about configurations that will make servers come back faster after a fail. See the Deveraj Das and Nicolas Liochon blog post Introduction to HBase Mean Time to Recover (MTTR) for a brief introduction.

本节讨论的是在失败后服务器恢复得更快的配置。参见Deveraj Das和Nicolas Liochon的博客文章介绍HBase平均时间恢复(MTTR)进行简要介绍。

The issue HBASE-8354 forces Namenode into loop with lease recovery requests is messy but has a bunch of good discussion toward the end on low timeouts and how to cause faster recovery including citation of fixes added to HDFS. Read the Varun Sharma comments. The below suggested configurations are Varun’s suggestions distilled and tested. Make sure you are running on a late-version HDFS so you have the fixes he refers to and himself adds to HDFS that help HBase MTTR (e.g. HDFS-3703, HDFS-3712, and HDFS-4791 — Hadoop 2 for sure has them and late Hadoop 1 has some). Set the following in the RegionServer.

HBASE-8354将Namenode与租约恢复请求进行循环的问题很麻烦,但是在低超时时间的结束以及如何导致更快的恢复(包括添加到HDFS的修复程序)方面进行了大量的讨论。请阅读Varun Sharma的评论。下面建议的配置是Varun的建议经过蒸馏和测试。确保你运行的是最新版本的HDFS,这样你就有了他所提到的补丁,并且他自己添加到HDFS中来帮助HBase MTTR(例如HDFS-3703, HDFS-3712,和HDFS-4791 - Hadoop 2,这是肯定的,并且后期的Hadoop 1有一些)。在区域服务器中设置以下内容。

<property>
  <name>hbase.lease.recovery.dfs.timeout</name>
  <value>23000</value>
  <description>How much time we allow elapse between calls to recover lease.
  Should be larger than the dfs timeout.</description>
</property>
<property>
  <name>dfs.client.socket-timeout</name>
  <value>10000</value>
  <description>Down the DFS timeout from 60 to 10 seconds.</description>
</property>

And on the NameNode/DataNode side, set the following to enable 'staleness' introduced in HDFS-3703, HDFS-3912.

在NameNode/DataNode方面,设置下面的“staleness”在HDFS-3703中引入,HDFS-3912。

<property>
  <name>dfs.client.socket-timeout</name>
  <value>10000</value>
  <description>Down the DFS timeout from 60 to 10 seconds.</description>
</property>
<property>
  <name>dfs.datanode.socket.write.timeout</name>
  <value>10000</value>
  <description>Down the DFS timeout from 8 * 60 to 10 seconds.</description>
</property>
<property>
  <name>ipc.client.connect.timeout</name>
  <value>3000</value>
  <description>Down from 60 seconds to 3.</description>
</property>
<property>
  <name>ipc.client.connect.max.retries.on.timeouts</name>
  <value>2</value>
  <description>Down from 45 seconds to 3 (2 == 3 retries).</description>
</property>
<property>
  <name>dfs.namenode.avoid.read.stale.datanode</name>
  <value>true</value>
  <description>Enable stale state in hdfs</description>
</property>
<property>
  <name>dfs.namenode.stale.datanode.interval</name>
  <value>20000</value>
  <description>Down from default 30 seconds</description>
</property>
<property>
  <name>dfs.namenode.avoid.write.stale.datanode</name>
  <value>true</value>
  <description>Enable stale state in hdfs</description>
</property>

9.3.5. JMX

9.3.5。JMX

JMX (Java Management Extensions) provides built-in instrumentation that enables you to monitor and manage the Java VM. To enable monitoring and management from remote systems, you need to set system property com.sun.management.jmxremote.port (the port number through which you want to enable JMX RMI connections) when you start the Java VM. See the official documentation for more information. Historically, besides above port mentioned, JMX opens two additional random TCP listening ports, which could lead to port conflict problem. (See HBASE-10289 for details)

JMX (Java管理扩展)提供了内置的工具,使您能够监视和管理Java VM。为了从远程系统启用监视和管理,您需要设置system property .sun.management.jmxremote。在启动Java VM时,端口(希望启用JMX RMI连接的端口号)。有关更多信息,请参见官方文档。历史上,除了上述端口之外,JMX还打开了两个额外的随机TCP监听端口,这可能导致端口冲突问题。(有关详细信息,请参阅hbase - 10289)

As an alternative, You can use the coprocessor-based JMX implementation provided by HBase. To enable it in 0.99 or above, add below property in hbase-site.xml:

作为一种替代方法,您可以使用HBase提供的基于协处理器的JMX实现。要使其在0.99或以上,在hbase-site.xml中添加以下属性:

<property>
  <name>hbase.coprocessor.regionserver.classes</name>
  <value>org.apache.hadoop.hbase.JMXListener</value>
</property>
DO NOT set com.sun.management.jmxremote.port for Java VM at the same time.

Currently it supports Master and RegionServer Java VM. By default, the JMX listens on TCP port 10102, you can further configure the port using below properties:

目前它支持主服务器和区域服务器Java VM。默认情况下,JMX监听TCP端口10102,您可以使用以下属性进一步配置端口:

<property>
  <name>regionserver.rmi.registry.port</name>
  <value>61130</value>
</property>
<property>
  <name>regionserver.rmi.connector.port</name>
  <value>61140</value>
</property>

The registry port can be shared with connector port in most cases, so you only need to configure regionserver.rmi.registry.port. However if you want to use SSL communication, the 2 ports must be configured to different values.

在大多数情况下,注册表端口可以与连接器端口共享,因此您只需要配置区域服务器.rmi.registry.port。但是,如果要使用SSL通信,则必须将两个端口配置为不同的值。

By default the password authentication and SSL communication is disabled. To enable password authentication, you need to update hbase-env.sh like below:

默认情况下,密码身份验证和SSL通信是禁用的。要启用密码身份验证,您需要更新hbase-env。sh像下图:

export HBASE_JMX_BASE="-Dcom.sun.management.jmxremote.authenticate=true                  \
                       -Dcom.sun.management.jmxremote.password.file=your_password_file   \
                       -Dcom.sun.management.jmxremote.access.file=your_access_file"

export HBASE_MASTER_OPTS="$HBASE_MASTER_OPTS $HBASE_JMX_BASE "
export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS $HBASE_JMX_BASE "

See example password/access file under $JRE_HOME/lib/management.

参见$JRE_HOME/lib/management下的示例密码/访问文件。

To enable SSL communication with password authentication, follow below steps:

要启用SSL通信与密码身份验证,请遵循以下步骤:

#1. generate a key pair, stored in myKeyStore
keytool -genkey -alias jconsole -keystore myKeyStore

#2. export it to file jconsole.cert
keytool -export -alias jconsole -keystore myKeyStore -file jconsole.cert

#3. copy jconsole.cert to jconsole client machine, import it to jconsoleKeyStore
keytool -import -alias jconsole -keystore jconsoleKeyStore -file jconsole.cert

And then update hbase-env.sh like below:

然后更新hbase-env。sh像下图:

export HBASE_JMX_BASE="-Dcom.sun.management.jmxremote.ssl=true                         \
                       -Djavax.net.ssl.keyStore=/home/tianq/myKeyStore                 \
                       -Djavax.net.ssl.keyStorePassword=your_password_in_step_1       \
                       -Dcom.sun.management.jmxremote.authenticate=true                \
                       -Dcom.sun.management.jmxremote.password.file=your_password file \
                       -Dcom.sun.management.jmxremote.access.file=your_access_file"

export HBASE_MASTER_OPTS="$HBASE_MASTER_OPTS $HBASE_JMX_BASE "
export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS $HBASE_JMX_BASE "

Finally start jconsole on the client using the key store:

最后,使用密钥存储库在客户机上启动jconsole:

jconsole -J-Djavax.net.ssl.trustStore=/home/tianq/jconsoleKeyStore
To enable the HBase JMX implementation on Master, you also need to add below property in hbase-site.xml:
<property>
  <name>hbase.coprocessor.master.classes</name>
  <value>org.apache.hadoop.hbase.JMXListener</value>
</property>

The corresponding properties for port configuration are master.rmi.registry.port (by default 10101) and master.rmi.connector.port (by default the same as registry.port)

端口配置的相应属性是master.rmi.registry。端口(默认为10101)和master.rmi.connector。端口(默认情况下与注册端口相同)

10. Dynamic Configuration

10。动态配置

Since HBase 1.0.0, it is possible to change a subset of the configuration without requiring a server restart. In the HBase shell, there are new operators, update_config and update_all_config that will prompt a server or all servers to reload configuration.

由于HBase 1.0.0,可以在不需要服务器重启的情况下更改配置的子集。在HBase shell中,有新的操作符、update_config和update_all_config,它将提示服务器或所有服务器重新加载配置。

Only a subset of all configurations can currently be changed in the running server. Here are those configurations:

在运行的服务器中,目前只有所有配置的一个子集可以被更改。这里是这些配置:

Table 3. Configurations support dynamically change
Key

hbase.ipc.server.fallback-to-simple-auth-allowed

hbase.ipc.server.fallback-to-simple-auth-allowed

hbase.cleaner.scan.dir.concurrent.size

hbase.cleaner.scan.dir.concurrent.size

hbase.regionserver.thread.compaction.large

hbase.regionserver.thread.compaction.large

hbase.regionserver.thread.compaction.small

hbase.regionserver.thread.compaction.small

hbase.regionserver.thread.split

hbase.regionserver.thread.split

hbase.regionserver.throughput.controller

hbase.regionserver.throughput.controller

hbase.regionserver.thread.hfilecleaner.throttle

hbase.regionserver.thread.hfilecleaner.throttle

hbase.regionserver.hfilecleaner.large.queue.size

hbase.regionserver.hfilecleaner.large.queue.size

hbase.regionserver.hfilecleaner.small.queue.size

hbase.regionserver.hfilecleaner.small.queue.size

hbase.regionserver.hfilecleaner.large.thread.count

hbase.regionserver.hfilecleaner.large.thread.count

hbase.regionserver.hfilecleaner.small.thread.count

hbase.regionserver.hfilecleaner.small.thread.count

hbase.regionserver.flush.throughput.controller

hbase.regionserver.flush.throughput.controller

hbase.hstore.compaction.max.size

hbase.hstore.compaction.max.size

hbase.hstore.compaction.max.size.offpeak

hbase.hstore.compaction.max.size.offpeak

hbase.hstore.compaction.min.size

hbase.hstore.compaction.min.size

hbase.hstore.compaction.min

hbase.hstore.compaction.min

hbase.hstore.compaction.max

hbase.hstore.compaction.max

hbase.hstore.compaction.ratio

hbase.hstore.compaction.ratio

hbase.hstore.compaction.ratio.offpeak

hbase.hstore.compaction.ratio.offpeak

hbase.regionserver.thread.compaction.throttle

hbase.regionserver.thread.compaction.throttle

hbase.hregion.majorcompaction

hbase.hregion.majorcompaction

hbase.hregion.majorcompaction.jitter

hbase.hregion.majorcompaction.jitter

hbase.hstore.min.locality.to.skip.major.compact

hbase.hstore.min.locality.to.skip.major.compact

hbase.hstore.compaction.date.tiered.max.storefile.age.millis

hbase.hstore.compaction.date.tiered.max.storefile.age.millis

hbase.hstore.compaction.date.tiered.incoming.window.min

hbase.hstore.compaction.date.tiered.incoming.window.min

hbase.hstore.compaction.date.tiered.window.policy.class

hbase.hstore.compaction.date.tiered.window.policy.class

hbase.hstore.compaction.date.tiered.single.output.for.minor.compaction

hbase.hstore.compaction.date.tiered.single.output.for.minor.compaction

hbase.hstore.compaction.date.tiered.window.factory.class

hbase.hstore.compaction.date.tiered.window.factory.class

hbase.offpeak.start.hour

hbase.offpeak.start.hour

hbase.offpeak.end.hour

hbase.offpeak.end.hour

hbase.oldwals.cleaner.thread.size

hbase.oldwals.cleaner.thread.size

hbase.procedure.worker.keep.alive.time.msec

hbase.procedure.worker.keep.alive.time.msec

hbase.procedure.worker.add.stuck.percentage

hbase.procedure.worker.add.stuck.percentage

hbase.procedure.worker.monitor.interval.msec

hbase.procedure.worker.monitor.interval.msec

hbase.procedure.worker.stuck.threshold.msec

hbase.procedure.worker.stuck.threshold.msec

hbase.regions.slop

hbase.regions.slop

hbase.regions.overallSlop

hbase.regions.overallSlop

hbase.balancer.tablesOnMaster

hbase.balancer.tablesOnMaster

hbase.balancer.tablesOnMaster.systemTablesOnly

hbase.balancer.tablesOnMaster.systemTablesOnly

hbase.util.ip.to.rack.determiner

hbase.util.ip.to.rack.determiner

hbase.ipc.server.max.callqueue.length

hbase.ipc.server.max.callqueue.length

hbase.ipc.server.priority.max.callqueue.length

hbase.ipc.server.priority.max.callqueue.length

hbase.ipc.server.callqueue.type

hbase.ipc.server.callqueue.type

hbase.ipc.server.callqueue.codel.target.delay

hbase.ipc.server.callqueue.codel.target.delay

hbase.ipc.server.callqueue.codel.interval

hbase.ipc.server.callqueue.codel.interval

hbase.ipc.server.callqueue.codel.lifo.threshold

hbase.ipc.server.callqueue.codel.lifo.threshold

hbase.master.balancer.stochastic.maxSteps

hbase.master.balancer.stochastic.maxSteps

hbase.master.balancer.stochastic.stepsPerRegion

hbase.master.balancer.stochastic.stepsPerRegion

hbase.master.balancer.stochastic.maxRunningTime

hbase.master.balancer.stochastic.maxRunningTime

hbase.master.balancer.stochastic.runMaxSteps

hbase.master.balancer.stochastic.runMaxSteps

hbase.master.balancer.stochastic.numRegionLoadsToRemember

hbase.master.balancer.stochastic.numRegionLoadsToRemember

hbase.master.loadbalance.bytable

hbase.master.loadbalance.bytable

hbase.master.balancer.stochastic.minCostNeedBalance

hbase.master.balancer.stochastic.minCostNeedBalance

hbase.master.balancer.stochastic.localityCost

hbase.master.balancer.stochastic.localityCost

hbase.master.balancer.stochastic.rackLocalityCost

hbase.master.balancer.stochastic.rackLocalityCost

hbase.master.balancer.stochastic.readRequestCost

hbase.master.balancer.stochastic.readRequestCost

hbase.master.balancer.stochastic.writeRequestCost

hbase.master.balancer.stochastic.writeRequestCost

hbase.master.balancer.stochastic.memstoreSizeCost

hbase.master.balancer.stochastic.memstoreSizeCost

hbase.master.balancer.stochastic.storefileSizeCost

hbase.master.balancer.stochastic.storefileSizeCost

hbase.master.balancer.stochastic.regionReplicaHostCostKey

hbase.master.balancer.stochastic.regionReplicaHostCostKey

hbase.master.balancer.stochastic.regionReplicaRackCostKey

hbase.master.balancer.stochastic.regionReplicaRackCostKey

hbase.master.balancer.stochastic.regionCountCost

hbase.master.balancer.stochastic.regionCountCost

hbase.master.balancer.stochastic.primaryRegionCountCost

hbase.master.balancer.stochastic.primaryRegionCountCost

hbase.master.balancer.stochastic.moveCost

hbase.master.balancer.stochastic.moveCost

hbase.master.balancer.stochastic.maxMovePercent

hbase.master.balancer.stochastic.maxMovePercent

hbase.master.balancer.stochastic.tableSkewCost

hbase.master.balancer.stochastic.tableSkewCost

Upgrading

升级

You cannot skip major versions when upgrading. If you are upgrading from version 0.90.x to 0.94.x, you must first go from 0.90.x to 0.92.x and then go from 0.92.x to 0.94.x.

升级时不能跳过主要版本。如果升级到0.90版本。0.94 x。x,你必须从0。90开始。0.92 x。x,然后从0。92。x 0.94.x。

It may be possible to skip across versions — for example go from 0.92.2 straight to 0.98.0 just following the 0.96.x upgrade instructions — but these scenarios are untested.

Review Apache HBase Configuration, in particular Hadoop. Familiarize yourself with Support and Testing Expectations.

回顾Apache HBase配置,特别是Hadoop。熟悉支持和测试期望。

11. HBase version number and compatibility

11。HBase版本号和兼容性。

HBase has two versioning schemes, pre-1.0 and post-1.0. Both are detailed below.

HBase有两个版本控制方案,1.0版和1.0版。两者都是详细的下面。

11.1. Post 1.0 versions

11.1。发布1.0版本

Starting with the 1.0.0 release, HBase is working towards Semantic Versioning for its release versioning. In summary:

从1.0.0版本开始,HBase正在努力实现版本控制的语义版本控制。总而言之:

Given a version number MAJOR.MINOR.PATCH, increment the:
  • MAJOR version when you make incompatible API changes,

    当你做出不兼容的API变更时,主要版本,

  • MINOR version when you add functionality in a backwards-compatible manner, and

    次要版本,当您以向后兼容的方式添加功能时,以及。

  • PATCH version when you make backwards-compatible bug fixes.

    当您进行向后兼容的错误修复时,补丁版本。

  • Additional labels for pre-release and build metadata are available as extensions to the MAJOR.MINOR.PATCH format.

    用于预发布和构建元数据的附加标签可以作为主要的扩展名。补丁格式。

Compatibility Dimensions

In addition to the usual API versioning considerations HBase has other compatibility dimensions that we need to consider.

除了通常的API版本控制之外,HBase还有其他我们需要考虑的兼容性维度。

Client-Server wire protocol compatibility
  • Allows updating client and server out of sync.

    允许更新客户端和服务器不同步。

  • We could only allow upgrading the server first. I.e. the server would be backward compatible to an old client, that way new APIs are OK.

    我们只能允许先升级服务器。也就是说,服务器将向后兼容老客户端,这样新的api就可以了。

  • Example: A user should be able to use an old client to connect to an upgraded cluster.

    示例:用户应该能够使用旧客户端连接到升级的集群。

Server-Server protocol compatibility
  • Servers of different versions can co-exist in the same cluster.

    不同版本的服务器可以在同一个集群中共存。

  • The wire protocol between servers is compatible.

    服务器之间的连线协议是兼容的。

  • Workers for distributed tasks, such as replication and log splitting, can co-exist in the same cluster.

    分布式任务的工作人员,例如复制和日志分割,可以在同一个集群中共存。

  • Dependent protocols (such as using ZK for coordination) will also not be changed.

    依赖协议(例如使用ZK进行协调)也不会改变。

  • Example: A user can perform a rolling upgrade.

    示例:用户可以执行滚动升级。

File format compatibility
  • Support file formats backward and forward compatible

    支持文件格式向后和向前兼容。

  • Example: File, ZK encoding, directory layout is upgraded automatically as part of an HBase upgrade. User can downgrade to the older version and everything will continue to work.

    示例:文件、ZK编码、目录布局自动升级,作为HBase升级的一部分。用户可以降级到旧版本,一切将继续工作。

Client API compatibility
  • Allow changing or removing existing client APIs.

    允许更改或删除现有客户端api。

  • An API needs to be deprecated for a major version before we will change/remove it.

    在我们更改/删除它之前,需要对一个主要版本的API进行弃用。

  • APIs available in a patch version will be available in all later patch versions. However, new APIs may be added which will not be available in earlier patch versions.

    补丁版本中可用的api将在以后的补丁版本中提供。但是,新的api可能会被添加,这在以前的补丁版本中是不可用的。

  • New APIs introduced in a patch version will only be added in a source compatible way [1]: i.e. code that implements public APIs will continue to compile.

    补丁版本中引入的新api只会添加到源兼容的方式[1]:即实现公共api的代码将继续编译。

    • Example: A user using a newly deprecated API does not need to modify application code with HBase API calls until the next major version. *

      示例:使用新弃用API的用户不需要使用HBase API调用修改应用程序代码,直到下一个主要版本。*

Client Binary compatibility
  • Client code written to APIs available in a given patch release can run unchanged (no recompilation needed) against the new jars of later patch versions.

    在一个给定的补丁版本中,为可用的api编写的客户机代码可以在以后的补丁版本的新jar中保持不变(不需要重新编译)。

  • Client code written to APIs available in a given patch release might not run against the old jars from an earlier patch version.

    在给定的补丁版本中提供给api的客户端代码可能不会从早期补丁版本的旧jar中运行。

    • Example: Old compiled client code will work unchanged with the new jars.

      示例:旧的编译后的客户机代码将与新jar保持一致。

  • If a Client implements an HBase Interface, a recompile MAY be required upgrading to a newer minor version (See release notes for warning about incompatible changes). All effort will be made to provide a default implementation so this case should not arise.

    如果客户端实现了HBase接口,则可能需要对更新的小版本进行重新编译(请参阅发行说明以警告不兼容的更改)。所有的努力都将提供一个默认的实现,所以这个案例不应该出现。

Server-Side Limited API compatibility (taken from Hadoop)
  • Internal APIs are marked as Stable, Evolving, or Unstable

    内部api被标记为稳定的、演进的或不稳定的。

  • This implies binary compatibility for coprocessors and plugins (pluggable classes, including replication) as long as these are only using marked interfaces/classes.

    这意味着对协处理器和插件的二进制兼容性(可插入类,包括复制),只要它们只使用标记的接口/类。

  • Example: Old compiled Coprocessor, Filter, or Plugin code will work unchanged with the new jars.

    示例:旧的编译过的协处理器、过滤器或插件代码将与新jar保持一致。

Dependency Compatibility
  • An upgrade of HBase will not require an incompatible upgrade of a dependent project, including the Java runtime.

    HBase的升级不需要对依赖项目进行不兼容的升级,包括Java运行时。

  • Example: An upgrade of Hadoop will not invalidate any of the compatibilities guarantees we made.

    Hadoop的升级不会使我们做出的兼容性保证失效。

Operational Compatibility
  • Metric changes

    指标的变化

  • Behavioral changes of services

    行为变化的服务

  • JMX APIs exposed via the /jmx/ endpoint

    通过/ JMX /端点公开的JMX api。

Summary
  • A patch upgrade is a drop-in replacement. Any change that is not Java binary and source compatible would not be allowed.[2] Downgrading versions within patch releases may not be compatible.

    补丁升级是替代的替代。任何非Java二进制和源兼容的更改都是不允许的。[2]在补丁版本中降级版本可能不兼容。

  • A minor upgrade requires no application/client code modification. Ideally it would be a drop-in replacement but client code, coprocessors, filters, etc might have to be recompiled if new jars are used.

    一个小的升级不需要应用程序/客户端代码修改。理想情况下,如果使用新的jar,它将是一个替代的替代品,但客户代码、协处理器、过滤器等可能需要重新编译。

  • A major upgrade allows the HBase community to make breaking changes.

    一个主要的升级可以让HBase社区做出改变。

Table 4. Compatibility Matrix [3]

Major

主要

Minor

Patch

补丁

Client-Server wire Compatibility

客户机-服务器连接的兼容性

N

N

Y

Y

Y

Y

Server-Server Compatibility

server服务器兼容性

N

N

Y

Y

Y

Y

File Format Compatibility

文件格式兼容性

N [4]

N[4]

Y

Y

Y

Y

Client API Compatibility

客户端API兼容性

N

N

Y

Y

Y

Y

Client Binary Compatibility

客户端二进制兼容性

N

N

N

N

Y

Y

Server-Side Limited API Compatibility

服务器端API兼容性有限

Stable

稳定的

N

N

Y

Y

Y

Y

Evolving

不断发展的

N

N

N

N

Y

Y

Unstable

不稳定

N

N

N

N

N

N

Dependency Compatibility

依赖的兼容性

N

N

Y

Y

Y

Y

Operational Compatibility

操作的兼容性

N

N

N

N

Y

Y

11.1.1. HBase API Surface

11.1.1。HBase API表面

HBase has a lot of API points, but for the compatibility matrix above, we differentiate between Client API, Limited Private API, and Private API. HBase uses Apache Yetus Audience Annotations to guide downstream expectations for stability.

HBase有很多API点,但是对于上面的兼容性矩阵,我们区分了客户机API、有限的私有API和私有API。HBase使用Apache Yetus听众注释来指导下游对稳定性的期望。

  • InterfaceAudience (javadocs): captures the intended audience, possible values include:

    InterfaceAudience (javadocs):捕获目标受众,可能的值包括:

    • Public: safe for end users and external projects

      公众:安全的终端用户和外部项目。

    • LimitedPrivate: used for internals we expect to be pluggable, such as coprocessors

      限制私有:用于我们期望可插入的内部构件,例如协处理器。

    • Private: strictly for use within HBase itself Classes which are defined as IA.Private may be used as parameters or return values for interfaces which are declared IA.LimitedPrivate. Treat the IA.Private object as opaque; do not try to access its methods or fields directly.

      私有:严格地用于HBase本身的类,定义为IA。私有可以用作接口的参数或返回值,这些接口被声明为IA.LimitedPrivate。治疗IA。私有对象作为不透明;不要试图直接访问它的方法或字段。

  • InterfaceStability (javadocs): describes what types of interface changes are permitted. Possible values include:

    InterfaceStability (javadocs):描述允许哪些类型的接口更改。可能的值有:

    • Stable: the interface is fixed and is not expected to change

      稳定:接口是固定的,预计不会改变。

    • Evolving: the interface may change in future minor verisons

      进化:界面可能会在未来的小verisons中发生变化。

    • Unstable: the interface may change at any time

      不稳定:界面随时可能发生变化。

Please keep in mind the following interactions between the InterfaceAudience and InterfaceStability annotations within the HBase project:

请记住在HBase项目中,InterfaceAudience和InterfaceStability注释之间的交互作用:

  • IA.Public classes are inherently stable and adhere to our stability guarantees relating to the type of upgrade (major, minor, or patch).

    IA。公共类本质上是稳定的,并遵循与升级类型(主要的、次要的或补丁)相关的稳定性保证。

  • IA.LimitedPrivate classes should always be annotated with one of the given InterfaceStability values. If they are not, you should presume they are IS.Unstable.

    IA。限定的私有类应该总是用给定的InterfaceStability值之一进行注释。如果它们不是,你应该假定它们是。不稳定的。

  • IA.Private classes should be considered implicitly unstable, with no guarantee of stability between releases.

    IA。私有类应该被认为是隐式不稳定的,不能保证发布之间的稳定性。

HBase Client API

HBase Client API consists of all the classes or methods that are marked with InterfaceAudience.Public interface. All main classes in hbase-client and dependent modules have either InterfaceAudience.Public, InterfaceAudience.LimitedPrivate, or InterfaceAudience.Private marker. Not all classes in other modules (hbase-server, etc) have the marker. If a class is not annotated with one of these, it is assumed to be a InterfaceAudience.Private class.

HBase客户端API由所有的类或方法组成,这些类或方法都被标记为InterfaceAudience。公共接口。hbase-客户端和依赖模块的所有主要类都有InterfaceAudience。公共场所,InterfaceAudience。LimitedPrivate或InterfaceAudience。私人标记。不是其他模块中的所有类(hbase-server等)都有标记。如果一个类没有被注释,那么它被假定为一个InterfaceAudience。私有类。

HBase LimitedPrivate API

LimitedPrivate annotation comes with a set of target consumers for the interfaces. Those consumers are coprocessors, phoenix, replication endpoint implementations or similar. At this point, HBase only guarantees source and binary compatibility for these interfaces between patch versions.

LimitedPrivate注释附带了一组用于接口的目标用户。这些消费者是协处理器、凤凰、复制端点实现或类似的。在这一点上,HBase只保证补丁版本之间的这些接口的源代码和二进制兼容性。

HBase Private API

All classes annotated with InterfaceAudience.Private or all classes that do not have the annotation are for HBase internal use only. The interfaces and method signatures can change at any point in time. If you are relying on a particular interface that is marked Private, you should open a jira to propose changing the interface to be Public or LimitedPrivate, or an interface exposed for this purpose.

所有的类都带有InterfaceAudience的注解。没有注释的私有或所有类只用于HBase内部使用。接口和方法签名可以在任何时间点发生变化。如果您依赖于一个标记为私有的特定接口,那么您应该打开jira,建议将接口更改为公共或限制私有,或为此目的公开接口。

11.2. Pre 1.0 versions

11.2。1.0之前的版本

HBase Pre-1.0 versions are all EOM
For new installations, do not deploy 0.94.y, 0.96.y, or 0.98.y. Deploy our stable version. See EOL 0.96, clean up of EOM releases, and the header of our downloads.

Before the semantic versioning scheme pre-1.0, HBase tracked either Hadoop’s versions (0.2x) or 0.9x versions. If you are into the arcane, checkout our old wiki page on HBase Versioning which tries to connect the HBase version dots. Below sections cover ONLY the releases before 1.0.

在语义版本控制方案1.0之前,HBase跟踪了Hadoop的版本(0.2x)或0.9x版本。如果您进入了这个神秘的领域,请在HBase版本控制上签出我们的旧wiki页面,该页面尝试连接HBase版本的点。下面的小节只介绍1.0之前的版本。

Odd/Even Versioning or "Development" Series Releases

Ahead of big releases, we have been putting up preview versions to start the feedback cycle turning-over earlier. These "Development" Series releases, always odd-numbered, come with no guarantees, not even regards being able to upgrade between two sequential releases (we reserve the right to break compatibility across "Development" Series releases). Needless to say, these releases are not for production deploys. They are a preview of what is coming in the hope that interested parties will take the release for a test drive and flag us early if we there are issues we’ve missed ahead of our rolling a production-worthy release.

在大发布之前,我们已经发布了预览版本,以开始反馈周期的开始。这些“开发”系列发布,总是奇数,没有保证,甚至不认为能够在两个顺序发布之间进行升级(我们保留在“开发”系列发行版中打破兼容性的权利)。不用说,这些发行版不是用于生产部署的。他们是对即将到来的希望的预演,希望有兴趣的各方将会在测试驱动下发布,如果我们在我们的产品发布之前错过了一些问题,我们就会提前通知我们。

Our first "Development" Series was the 0.89 set that came out ahead of HBase 0.90.0. HBase 0.95 is another "Development" Series that portends HBase 0.96.0. 0.99.x is the last series in "developer preview" mode before 1.0. Afterwards, we will be using semantic versioning naming scheme (see above).

我们的第一个“开发”系列是在HBase 0.90.0之前发布的0.89集。HBase 0.95是另一个“开发”系列,它将HBase 0.96.0引入。0.99。x是1.0之前“开发者预览”模式的最后一个系列。之后,我们将使用语义版本命名方案(见上文)。

Binary Compatibility

When we say two HBase versions are compatible, we mean that the versions are wire and binary compatible. Compatible HBase versions means that clients can talk to compatible but differently versioned servers. It means too that you can just swap out the jars of one version and replace them with the jars of another, compatible version and all will just work. Unless otherwise specified, HBase point versions are (mostly) binary compatible. You can safely do rolling upgrades between binary compatible versions; i.e. across point versions: e.g. from 0.94.5 to 0.94.6. See link:[Does compatibility between versions also mean binary compatibility?] discussion on the HBase dev mailing list.

当我们说两个HBase版本是兼容的,我们的意思是版本是线和二进制兼容的。兼容的HBase版本意味着客户机可以与兼容但不同版本的服务器进行对话。这也意味着你可以把一个版本的jar换掉,用另一个版本的jar替换它们,兼容的版本都可以工作。除非另有说明,HBase point版本(大部分)是二进制兼容的。您可以安全地进行二进制兼容版本之间的滚动升级;例如:从0.94.5到0.94.6。参见链接:[在版本之间的兼容性也意味着二进制兼容性吗?]讨论HBase开发邮件列表。

11.3. Rolling Upgrades

11.3。滚动升级

A rolling upgrade is the process by which you update the servers in your cluster a server at a time. You can rolling upgrade across HBase versions if they are binary or wire compatible. See Rolling Upgrade Between Versions that are Binary/Wire Compatible for more on what this means. Coarsely, a rolling upgrade is a graceful stop each server, update the software, and then restart. You do this for each server in the cluster. Usually you upgrade the Master first and then the RegionServers. See Rolling Restart for tools that can help use the rolling upgrade process.

滚动升级是指您一次更新集群中的服务器的过程。如果它们是二进制或有线兼容,您可以在HBase版本上滚动升级。请参阅在二进制/连线版本之间的滚动升级,以了解更多关于这意味着什么。粗略地说,滚动升级是一个优雅的停止服务器,更新软件,然后重新启动。您为集群中的每个服务器都这样做。通常先升级主服务器,然后再升级区域服务器。请参阅滚动重新启动工具,以帮助使用滚动升级过程。

For example, in the below, HBase was symlinked to the actual HBase install. On upgrade, before running a rolling restart over the cluster, we changed the symlink to point at the new HBase software version and then ran

例如,在下面,HBase与实际的HBase安装相关联。在升级过程中,在对集群进行滚动重新启动之前,我们将symlink更改为指向新的HBase软件版本,然后运行。

$ HADOOP_HOME=~/hadoop-2.6.0-CRC-SNAPSHOT ~/hbase/bin/rolling-restart.sh --config ~/conf_hbase

The rolling-restart script will first gracefully stop and restart the master, and then each of the RegionServers in turn. Because the symlink was changed, on restart the server will come up using the new HBase version. Check logs for errors as the rolling upgrade proceeds.

rollingrestart脚本将首先优雅地停止并重新启动主服务器,然后依次为每个区域服务器。因为符号链接发生了改变,重新启动服务器将使用新的HBase版本。在滚动升级的过程中,检查日志中的错误。

Rolling Upgrade Between Versions that are Binary/Wire Compatible

Unless otherwise specified, HBase point versions are binary compatible. You can do a Rolling Upgrades between HBase point versions. For example, you can go to 0.94.6 from 0.94.5 by doing a rolling upgrade across the cluster replacing the 0.94.5 binary with a 0.94.6 binary.

除非另有说明,HBase point版本是二进制兼容的。您可以在HBase点版本之间进行滚动升级。例如,您可以从0.94.5到0.94.6,通过在集群中进行滚动升级,将0.94.5二进制文件替换为0.94.6二进制。

In the minor version-particular sections below, we call out where the versions are wire/protocol compatible and in this case, it is also possible to do a Rolling Upgrades. For example, in Rolling upgrade from 0.98.x to HBase 1.0.0, we state that it is possible to do a rolling upgrade between hbase-0.98.x and hbase-1.0.0.

在下面的特定版本中,我们会指出版本是在哪里兼容的,在这种情况下,也可以进行滚动升级。例如,滚动升级从0.98。x到HBase 1.0.0,我们声明可以在HBase -0.98之间进行滚动升级。x和hbase-1.0.0。

12. Rollback

12。回滚

Sometimes things don’t go as planned when attempting an upgrade. This section explains how to perform a rollback to an earlier HBase release. Note that this should only be needed between Major and some Minor releases. You should always be able to downgrade between HBase Patch releases within the same Minor version. These instructions may require you to take steps before you start the upgrade process, so be sure to read through this section beforehand.

有时候,在尝试升级时,事情并没有按计划进行。本节解释如何执行回滚到早期的HBase版本。请注意,这只需要在主要版本和一些次要版本之间进行。您应该总是能够在同一个小版本的HBase补丁版本之间降级。这些指示可能要求您在启动升级过程之前采取步骤,所以一定要事先阅读这一节。

12.1. Caveats

12.1。警告

Rollback vs Downgrade

This section describes how to perform a rollback on an upgrade between HBase minor and major versions. In this document, rollback refers to the process of taking an upgraded cluster and restoring it to the old version while losing all changes that have occurred since upgrade. By contrast, a cluster downgrade would restore an upgraded cluster to the old version while maintaining any data written since the upgrade. We currently only offer instructions to rollback HBase clusters. Further, rollback only works when these instructions are followed prior to performing the upgrade.

本节描述如何在HBase小版本和主要版本之间执行回滚。在本文档中,rollback指的是在丢失升级后发生的所有更改的情况下,使用升级后的集群并将其恢复到旧版本的过程。相比之下,集群降级将使升级后的集群恢复到旧版本,同时维护升级后编写的任何数据。我们目前只提供回滚HBase集群的指令。而且,在执行升级之前,只有当这些指令被执行之后,回滚才会起作用。

When these instructions talk about rollback vs downgrade of prerequisite cluster services (i.e. HDFS), you should treat leaving the service version the same as a degenerate case of downgrade.

当这些指令涉及到对先决集群服务(即HDFS)的回滚vs降级时,您应该将服务版本与降级的降级事件一样对待。

Replication

Unless you are doing an all-service rollback, the HBase cluster will lose any configured peers for HBase replication. If your cluster is configured for HBase replication, then prior to following these instructions you should document all replication peers. After performing the rollback you should then add each documented peer back to the cluster. For more information on enabling HBase replication, listing peers, and adding a peer see Managing and Configuring Cluster Replication. Note also that data written to the cluster since the upgrade may or may not have already been replicated to any peers. Determining which, if any, peers have seen replication data as well as rolling back the data in those peers is out of the scope of this guide.

除非您正在执行全服务回滚,否则HBase集群将失去任何已配置的HBase复制节点。如果您的集群配置为HBase复制,那么在遵循这些指令之前,您应该记录所有的复制节点。执行回滚后,您应该将每个记录的对等点添加到集群中。有关启用HBase复制、列出对等点和添加对等点查看管理和配置集群复制的更多信息。还要注意,自升级之后写入集群的数据可能已经被复制到其他节点上了。确定哪个节点(如果有的话)已经看到了复制数据,并在这些节点中回滚数据,这超出了本指南的范围。

Data Locality

Unless you are doing an all-service rollback, going through a rollback procedure will likely destroy all locality for Region Servers. You should expect degraded performance until after the cluster has had time to go through compactions to restore data locality. Optionally, you can force a compaction to speed this process up at the cost of generating cluster load.

除非您正在执行全服务回滚,否则执行回滚过程可能会破坏区域服务器的所有位置。在集群有时间通过压缩恢复数据局部性之前,您应该期望降级的性能。可选地,您可以强制一个压实来加速这个过程,以产生集群负载。

Configurable Locations

The instructions below assume default locations for the HBase data directory and the HBase znode. Both of these locations are configurable and you should verify the value used in your cluster before proceeding. In the event that you have a different value, just replace the default with the one found in your configuration * HBase data directory is configured via the key 'hbase.rootdir' and has a default value of '/hbase'. * HBase znode is configured via the key 'zookeeper.znode.parent' and has a default value of '/hbase'.

下面的说明将假设HBase数据目录和HBase znode的默认位置。这两个位置都是可配置的,您应该在继续之前验证集群中使用的值。如果您有不同的值,只需将默认值替换为配置* HBase数据目录中的默认值,该目录将通过key ' HBase配置。rootdir'并具有'/hbase'的默认值。* HBase znode是通过key 'zookeeper.znode配置的。父类的默认值为'/hbase'。

12.2. All service rollback

12.2。所有服务回滚

If you will be performing a rollback of both the HDFS and ZooKeeper services, then HBase’s data will be rolled back in the process.

如果您将执行对HDFS和ZooKeeper服务的回滚,那么HBase的数据将在这个过程中回滚。

Requirements
  • Ability to rollback HDFS and ZooKeeper

    能够回滚HDFS和ZooKeeper。

Before upgrade

No additional steps are needed pre-upgrade. As an extra precautionary measure, you may wish to use distcp to back up the HBase data off of the cluster to be upgraded. To do so, follow the steps in the 'Before upgrade' section of 'Rollback after HDFS downgrade' but copy to another HDFS instance instead of within the same instance.

没有额外的步骤需要预先升级。作为额外的预防措施,您可能希望使用distcp来备份集群中的HBase数据以进行升级。要做到这一点,请按照“在HDFS降级后的回滚”部分的步骤,而不是在同一个实例中复制到另一个HDFS实例。

Performing a rollback
  1. Stop HBase

    停止HBase

  2. Perform a rollback for HDFS and ZooKeeper (HBase should remain stopped)

    对HDFS和ZooKeeper执行回滚(HBase应该停止)

  3. Change the installed version of HBase to the previous version

    将HBase的安装版本更改为上一个版本。

  4. Start HBase

    开始HBase

  5. Verify HBase contents—use the HBase shell to list tables and scan some known values.

    验证HBase内容——使用HBase shell列出表并扫描一些已知值。

12.3. Rollback after HDFS rollback and ZooKeeper downgrade

12.3。在HDFS回滚和ZooKeeper降级之后回滚。

If you will be rolling back HDFS but going through a ZooKeeper downgrade, then HBase will be in an inconsistent state. You must ensure the cluster is not started until you complete this process.

如果您将回滚HDFS,但是通过ZooKeeper降级,那么HBase将处于不一致的状态。在完成此过程之前,您必须确保集群没有启动。

Requirements
  • Ability to rollback HDFS

    回滚HDFS的能力

  • Ability to downgrade ZooKeeper

    能力降级动物园管理员

Before upgrade

No additional steps are needed pre-upgrade. As an extra precautionary measure, you may wish to use distcp to back up the HBase data off of the cluster to be upgraded. To do so, follow the steps in the 'Before upgrade' section of 'Rollback after HDFS downgrade' but copy to another HDFS instance instead of within the same instance.

没有额外的步骤需要预先升级。作为额外的预防措施,您可能希望使用distcp来备份集群中的HBase数据以进行升级。要做到这一点,请按照“在HDFS降级后的回滚”部分的步骤,而不是在同一个实例中复制到另一个HDFS实例。

Performing a rollback
  1. Stop HBase

    停止HBase

  2. Perform a rollback for HDFS and a downgrade for ZooKeeper (HBase should remain stopped)

    对HDFS执行回滚,并降级为ZooKeeper (HBase应该停止)

  3. Change the installed version of HBase to the previous version

    将HBase的安装版本更改为上一个版本。

  4. Clean out ZooKeeper information related to HBase. WARNING: This step will permanently destroy all replication peers. Please see the section on HBase Replication under Caveats for more information.

    清除与HBase相关的动物管理员信息。警告:此步骤将永久销毁所有复制伙伴。有关更多信息,请参阅“警告”中关于HBase复制的部分。

    Clean HBase information out of ZooKeeper
    [hpnewton@gateway_node.example.com ~]$ zookeeper-client -server zookeeper1.example.com:2181,zookeeper2.example.com:2181,zookeeper3.example.com:2181
    Welcome to ZooKeeper!
    JLine support is disabled
    rmr /hbase
    quit
    Quitting...
  5. Start HBase

    开始HBase

  6. Verify HBase contents—use the HBase shell to list tables and scan some known values.

    验证HBase内容——使用HBase shell列出表并扫描一些已知值。

12.4. Rollback after HDFS downgrade

12.4。回滚后HDFS降级

If you will be performing an HDFS downgrade, then you’ll need to follow these instructions regardless of whether ZooKeeper goes through rollback, downgrade, or reinstallation.

如果您将执行HDFS降级,那么您将需要遵循这些指示,而不管ZooKeeper是否通过rollback、降级或重新安装。

Requirements
  • Ability to downgrade HDFS

    降级HDFS的能力

  • Pre-upgrade cluster must be able to run MapReduce jobs

    预升级集群必须能够运行MapReduce作业。

  • HDFS super user access

    HDFS超级用户访问

  • Sufficient space in HDFS for at least two copies of the HBase data directory

    在HDFS中有足够的空间用于至少两个HBase数据目录的副本。

Before upgrade

Before beginning the upgrade process, you must take a complete backup of HBase’s backing data. The following instructions cover backing up the data within the current HDFS instance. Alternatively, you can use the distcp command to copy the data to another HDFS cluster.

在开始升级过程之前,您必须完全备份HBase的备份数据。下面的说明涵盖了在当前的HDFS实例中备份数据。或者,您可以使用distcp命令将数据复制到另一个HDFS集群。

  1. Stop the HBase cluster

    停止HBase集群

  2. Copy the HBase data directory to a backup location using the distcp command as the HDFS super user (shown below on a security enabled cluster)

    将HBase数据目录复制到一个备份位置,使用distcp命令作为HDFS超级用户(在启用安全的集群中显示)

    Using distcp to backup the HBase data directory
    [hpnewton@gateway_node.example.com ~]$ kinit -k -t hdfs.keytab hdfs@EXAMPLE.COM
    [hpnewton@gateway_node.example.com ~]$ hadoop distcp /hbase /hbase-pre-upgrade-backup
  3. Distcp will launch a mapreduce job to handle copying the files in a distributed fashion. Check the output of the distcp command to ensure this job completed successfully.

    Distcp将启动mapreduce作业以处理以分布式方式复制文件。检查distcp命令的输出,以确保完成此工作。

Performing a rollback
  1. Stop HBase

    停止HBase

  2. Perform a downgrade for HDFS and a downgrade/rollback for ZooKeeper (HBase should remain stopped)

    对HDFS进行降级和降级/回滚给ZooKeeper (HBase应该停止)

  3. Change the installed version of HBase to the previous version

    将HBase的安装版本更改为上一个版本。

  4. Restore the HBase data directory from prior to the upgrade as the HDFS super user (shown below on a security enabled cluster). If you backed up your data on another HDFS cluster instead of locally, you will need to use the distcp command to copy it back to the current HDFS cluster.

    在升级之前,将HBase数据目录恢复为HDFS超级用户(在启用安全的集群中显示)。如果您将数据备份到另一个HDFS集群而不是本地,您将需要使用distcp命令将其复制回当前的HDFS集群。

    Restore the HBase data directory
    [hpnewton@gateway_node.example.com ~]$ kinit -k -t hdfs.keytab hdfs@EXAMPLE.COM
    [hpnewton@gateway_node.example.com ~]$ hdfs dfs -mv /hbase /hbase-upgrade-rollback
    [hpnewton@gateway_node.example.com ~]$ hdfs dfs -mv /hbase-pre-upgrade-backup /hbase
  5. Clean out ZooKeeper information related to HBase. WARNING: This step will permanently destroy all replication peers. Please see the section on HBase Replication under Caveats for more information.

    清除与HBase相关的动物管理员信息。警告:此步骤将永久销毁所有复制伙伴。有关更多信息,请参阅“警告”中关于HBase复制的部分。

    Clean HBase information out of ZooKeeper
    [hpnewton@gateway_node.example.com ~]$ zookeeper-client -server zookeeper1.example.com:2181,zookeeper2.example.com:2181,zookeeper3.example.com:2181
    Welcome to ZooKeeper!
    JLine support is disabled
    rmr /hbase
    quit
    Quitting...
  6. Start HBase

    开始HBase

  7. Verify HBase contents–use the HBase shell to list tables and scan some known values.

    验证HBase内容——使用HBase shell列出表并扫描一些已知值。

13. Upgrade Paths

13。升级路径

13.1. Upgrading from 0.98.x to 1.x

13.1。从0.98升级。x 1.倍

In this section we first note the significant changes that come in with 1.0.0+ HBase and then we go over the upgrade process. Be sure to read the significant changes section with care so you avoid surprises.

在本节中,我们首先注意到1.0.0+ HBase带来的重大变化,然后我们讨论升级过程。一定要仔细阅读有意义的修改部分,以免出现意外。

13.1.1. Changes of Note!

13.1.1。变化的注意!

In here we list important changes that are in 1.0.0+ since 0.98.x., changes you should be aware that will go into effect once you upgrade.

在这里,我们列出了自0.98.x以来,1.0.0+的重要变化。,你应该意识到,一旦你升级了,它就会生效。

ZooKeeper 3.4 is required in HBase 1.0.0+

See ZooKeeper Requirements.

看到动物园管理员的需求。

HBase Default Ports Changed

The ports used by HBase changed. They used to be in the 600XX range. In HBase 1.0.0 they have been moved up out of the ephemeral port range and are 160XX instead (Master web UI was 60010 and is now 16010; the RegionServer web UI was 60030 and is now 16030, etc.). If you want to keep the old port locations, copy the port setting configs from hbase-default.xml into hbase-site.xml, change them back to the old values from the HBase 0.98.x era, and ensure you’ve distributed your configurations before you restart.

HBase使用的端口发生了变化。他们曾经在600XX的范围内。在HBase 1.0.0中,它们已经从短暂的端口范围内移出,而改为160XX(主web UI为60010,现在为16010;区域服务器web UI是60030,现在是16030,等等。如果您希望保留旧的端口位置,请从hbase-default复制端口设置配置。xml到hbase-site。xml,将它们从HBase 0.98转换回旧值。x时代,并确保在重新启动之前分配了配置。

HBase Master Port Binding Change

In HBase 1.0.x, the HBase Master binds the RegionServer ports as well as the Master ports. This behavior is changed from HBase versions prior to 1.0. In HBase 1.1 and 2.0 branches, this behavior is reverted to the pre-1.0 behavior of the HBase master not binding the RegionServer ports.

在HBase 1.0。x, HBase主绑定区域服务器端口和主端口。此行为在1.0之前由HBase版本更改。在HBase 1.1和2.0分支中,该行为被恢复到HBase主的1.0前行为,而不是绑定区域服务器端口。

hbase.bucketcache.percentage.in.combinedcache configuration has been REMOVED

You may have made use of this configuration if you are using BucketCache. If NOT using BucketCache, this change does not affect you. Its removal means that your L1 LruBlockCache is now sized using hfile.block.cache.size — i.e. the way you would size the on-heap L1 LruBlockCache if you were NOT doing BucketCache — and the BucketCache size is not whatever the setting for hbase.bucketcache.size is. You may need to adjust configs to get the LruBlockCache and BucketCache sizes set to what they were in 0.98.x and previous. If you did not set this config., its default value was 0.9. If you do nothing, your BucketCache will increase in size by 10%. Your L1 LruBlockCache will become hfile.block.cache.size times your java heap size (hfile.block.cache.size is a float between 0.0 and 1.0). To read more, see HBASE-11520 Simplify offheap cache config by removing the confusing "hbase.bucketcache.percentage.in.combinedcache".

如果您使用的是BucketCache,您可能已经使用了这个配置。如果不使用BucketCache,此更改不会影响您。它的删除意味着您的L1 LruBlockCache现在使用的是hfile.block.cache.size -即。如果您不做BucketCache,那么您将对on-heap L1 LruBlockCache进行大小设置——而BucketCache大小并不适用于hbase.bucketcache。大小是多少。您可能需要调整configs,以使LruBlockCache和BucketCache大小设置为0.98。x和之前。如果您没有设置这个配置。其默认值为0.9。如果你什么都不做,你的背包将会增加10%。您的L1 LruBlockCache将成为hfile.block.cache。大小乘以您的java堆大小(hfile.block.cache)。大小是介于0.0和1.0之间的浮动。若要读取更多信息,请参见HBASE-11520简化offheap缓存配置,删除“hbase. bucketcache.% age.l .in.combinedcache”。

If you have your own customer filters.

See the release notes on the issue HBASE-12068 [Branch-1] Avoid need to always do KeyValueUtil#ensureKeyValue for Filter transformCell; be sure to follow the recommendations therein.

在HBASE-12068 [Branch-1]上看到发行说明,避免总是要对Filter transformCell进行KeyValueUtil#ensureKeyValue;一定要遵循其中的建议。

Mismatch Of hbase.client.scanner.max.result.size Between Client and Server

If either the client or server version is lower than 0.98.11/1.0.0 and the server has a smaller value for hbase.client.scanner.max.result.size than the client, scan requests that reach the server’s hbase.client.scanner.max.result.size are likely to miss data. In particular, 0.98.11 defaults hbase.client.scanner.max.result.size to 2 MB but other versions default to larger values. For this reason, be very careful using 0.98.11 servers with any other client version.

如果客户机或服务器版本低于0.98.11/1.0.0,服务器对hbase.client.scanner.max.result的值更小。大小超过客户端,扫描请求到达服务器的hbase.client.scanner.max.result。大小可能会遗漏数据。特别是,0.98.11缺省值hbase.client.scanner.max.result。大小为2 MB,但其他版本默认为更大的值。因此,要非常小心使用0.98.11服务器和其他客户端版本。

Availability of Date Tiered Compaction.

The Date Tiered Compaction feature available as of 0.98.19 is available in the 1.y release line starting in release 1.3.0. If you have enabled this feature for any tables you must upgrade to version 1.3.0 or later. If you attempt to use an earlier 1.y release, any tables configured to use date tiered compaction will fail to have their regions open.

在1中可用的日期分级压实特性为0.98.19。从版本1.3.0开始的y发布线。如果您已经为任何表启用了这个特性,那么您必须升级到1.3.0或更高版本。如果你尝试使用更早的1。任何配置为使用日期分级压缩的表都将无法打开它们的区域。

13.1.2. Rolling upgrade from 0.98.x to HBase 1.0.0

13.1.2。从0.98滚动升级。x HBase 1.0.0

From 0.96.x to 1.0.0
You cannot do a rolling upgrade from 0.96.x to 1.0.0 without first doing a rolling upgrade to 0.98.x. See comment in HBASE-11164 Document and test rolling updates from 0.98 → 1.0 for the why. Also because HBase 1.0.0 enables HFile v3 by default, HBASE-9801 Change the default HFile version to V3, and support for HFile v3 only arrives in 0.98, this is another reason you cannot rolling upgrade from HBase 0.96.x; if the rolling upgrade stalls, the 0.96.x servers cannot open files written by the servers running the newer HBase 1.0.0 with HFile’s of version 3.

There are no known issues running a rolling upgrade from HBase 0.98.x to HBase 1.0.0.

从HBase 0.98进行滚动升级,目前还没有已知的问题。x HBase 1.0.0。

13.1.3. Scanner Caching has Changed

13.1.3。扫描仪缓存已经改变了

From 0.98.x to 1.x

In hbase-1.x, the default Scan caching 'number of rows' changed. Where in 0.98.x, it defaulted to 100, in later HBase versions, the default became Integer.MAX_VALUE. Not setting a cache size can make for Scans that run for a long time server-side, especially if they are running with stringent filtering. See Revisiting default value for hbase.client.scanner.caching; for further discussion.

在hbase-1。x,默认扫描缓存的行数改变了。在0.98。在后来的HBase版本中,默认为100,默认为Integer.MAX_VALUE。如果不设置缓存大小,则可以对运行了很长时间的服务器端进行扫描,特别是在使用严格的过滤时。请参阅重新访问hbase.client.scanner.缓存的默认值;为进一步讨论。

13.1.4. Upgrading to 1.0 from 0.94

13.1.4。从0.94升级到1.0。

You cannot rolling upgrade from 0.94.x to 1.x.x. You must stop your cluster, install the 1.x.x software, run the migration described at Executing the 0.96 Upgrade (substituting 1.x.x. wherever we make mention of 0.96.x in the section below), and then restart. Be sure to upgrade your ZooKeeper if it is a version less than the required 3.4.x.

不能从0.94滚动升级。x.x 1. x。您必须停止集群,安装1.x。x软件,运行在执行0.96升级时描述的迁移(替换1.x.x。只要提到0。96。x在下面的部分中),然后重新启动。如果您的ZooKeeper的版本小于要求的3.4.x,请确保升级。

13.2. Upgrading from 0.96.x to 0.98.x

13.2。从0.96升级。x 0.98.x

A rolling upgrade from 0.96.x to 0.98.x works. The two versions are not binary compatible.

滚动升级从0.96。0.98 x。x是有效的。这两个版本不是二进制兼容的。

Additional steps are required to take advantage of some of the new features of 0.98.x, including cell visibility labels, cell ACLs, and transparent server side encryption. See Securing Apache HBase for more information. Significant performance improvements include a change to the write ahead log threading model that provides higher transaction throughput under high load, reverse scanners, MapReduce over snapshot files, and striped compaction.

需要额外的步骤来利用0.98的一些新特性。x,包括单元可视性标签、单元格和透明的服务器端加密。有关更多信息,请参见保护Apache HBase。显著的性能改进包括修改前面的日志线程模型,该模型在高负载、反向扫描器、快照文件和条带压实的情况下提供了更高的事务吞吐量。

Clients and servers can run with 0.98.x and 0.96.x versions. However, applications may need to be recompiled due to changes in the Java API.

客户端和服务器可以运行0.98。0.96 x和。x版本。但是,由于Java API的变化,应用程序可能需要重新编译。

13.3. Upgrading from 0.94.x to 0.98.x

13.3。从0.94升级。x 0.98.x

A rolling upgrade from 0.94.x directly to 0.98.x does not work. The upgrade path follows the same procedures as Upgrading from 0.94.x to 0.96.x. Additional steps are required to use some of the new features of 0.98.x. See Upgrading from 0.96.x to 0.98.x for an abbreviated list of these features.

滚动升级从0.94。直接向0.98 x。x不工作。升级路径遵循与从0.94升级的相同程序。x 0.96.x。需要额外的步骤来使用0.98.x的一些新特性。看到从0.96升级。0.98 x。x代表这些特性的缩写列表。

13.4. Upgrading from 0.94.x to 0.96.x

13.4。从0.94升级。x 0.96.x

13.4.1. The "Singularity"

13.4.1。“奇点”

You will have to stop your old 0.94.x cluster completely to upgrade. If you are replicating between clusters, both clusters will have to go down to upgrade. Make sure it is a clean shutdown. The less WAL files around, the faster the upgrade will run (the upgrade will split any log files it finds in the filesystem as part of the upgrade process). All clients must be upgraded to 0.96 too.

你将不得不停止你的旧的0.94。x集群完全升级。如果在集群之间进行复制,那么这两个集群将不得不进行升级。确保它是一个干净的关机。周围的文件越少,升级的速度就越快(升级将会将文件系统中发现的任何日志文件分割为升级过程的一部分)。所有客户必须升级到0.96。

The API has changed. You will need to recompile your code against 0.96 and you may need to adjust applications to go against new APIs (TODO: List of changes).

API已经改变了。您需要将代码重新编译为0.96,您可能需要调整应用程序以适应新的api (TODO:更改列表)。

13.4.2. Executing the 0.96 Upgrade

13.4.2。执行升级到0.96

HDFS and ZooKeeper must be up!
HDFS and ZooKeeper should be up and running during the upgrade process.

HBase 0.96.0 comes with an upgrade script. Run

HBase 0.96.0附带一个升级脚本。运行

$ bin/hbase upgrade

to see its usage. The script has two main modes: -check, and -execute.

看到它的用法。该脚本有两种主要模式:-检查和-执行。

check

The check step is run against a running 0.94 cluster. Run it from a downloaded 0.96.x binary. The check step is looking for the presence of HFile v1 files. These are unsupported in HBase 0.96.0. To have them rewritten as HFile v2 you must run a compaction.

检查步骤运行在运行的0.94集群上。从下载的0.96运行它。x二进制。检查步骤是寻找HFile v1文件的存在。这些在HBase 0.96.0中是不支持的。要将它们重写为HFile v2,您必须运行一个压缩。

The check step prints stats at the end of its run (grep for “Result:” in the log) printing absolute path of the tables it scanned, any HFile v1 files found, the regions containing said files (these regions will need a major compaction), and any corrupted files if found. A corrupt file is unreadable, and so is undefined (neither HFile v1 nor HFile v2).

检查步骤在其运行结束时打印统计数据(grep用于“结果:”在日志中)打印数据表的绝对路径,任何HFile v1文件,包含该文件的区域(这些区域将需要一个主要的压缩),如果发现任何损坏的文件。一个损坏的文件是不可读的,所以没有定义(HFile v1和HFile v2)。

To run the check step, run

运行检查步骤,运行。

$ bin/hbase upgrade -check

Here is sample output:

这是示例输出:

Tables Processed:
hdfs://localhost:41020/myHBase/.META.
hdfs://localhost:41020/myHBase/usertable
hdfs://localhost:41020/myHBase/TestTable
hdfs://localhost:41020/myHBase/t

Count of HFileV1: 2
HFileV1:
hdfs://localhost:41020/myHBase/usertable    /fa02dac1f38d03577bd0f7e666f12812/family/249450144068442524
hdfs://localhost:41020/myHBase/usertable    /ecdd3eaee2d2fcf8184ac025555bb2af/family/249450144068442512

Count of corrupted files: 1
Corrupted Files:
hdfs://localhost:41020/myHBase/usertable/fa02dac1f38d03577bd0f7e666f12812/family/1
Count of Regions with HFileV1: 2
Regions to Major Compact:
hdfs://localhost:41020/myHBase/usertable/fa02dac1f38d03577bd0f7e666f12812
hdfs://localhost:41020/myHBase/usertable/ecdd3eaee2d2fcf8184ac025555bb2af

There are some HFileV1, or corrupt files (files with incorrect major version)

In the above sample output, there are two HFile v1 files in two regions, and one corrupt file. Corrupt files should probably be removed. The regions that have HFile v1s need to be major compacted. To major compact, start up the hbase shell and review how to compact an individual region. After the major compaction is done, rerun the check step and the HFile v1 files should be gone, replaced by HFile v2 instances.

在上面的示例输出中,在两个区域中有两个HFile v1文件,以及一个损坏的文件。应该删除损坏的文件。具有HFile v1s的区域需要进行大压缩。对于主要的契约,启动hbase shell并审查如何压缩单个区域。在完成主要的压缩之后,重新运行检查步骤,HFile v1文件应该被删除,替换为HFile v2实例。

By default, the check step scans the HBase root directory (defined as hbase.rootdir in the configuration). To scan a specific directory only, pass the -dir option.

默认情况下,检查步骤扫描HBase根目录(定义为HBase)。rootdir配置)。要扫描特定的目录,请通过-dir选项。

$ bin/hbase upgrade -check -dir /myHBase/testTable

The above command would detect HFile v1 files in the /myHBase/testTable directory.

上面的命令将在/myHBase/testTable目录中检测HFile v1文件。

Once the check step reports all the HFile v1 files have been rewritten, it is safe to proceed with the upgrade.

一旦检查步骤报告所有HFile v1文件被重写,就可以安全地进行升级了。

execute

After the check step shows the cluster is free of HFile v1, it is safe to proceed with the upgrade. Next is the execute step. You must SHUTDOWN YOUR 0.94.x CLUSTER before you can run the execute step. The execute step will not run if it detects running HBase masters or RegionServers.

在检查步骤显示集群没有HFile v1之后,继续进行升级是安全的。接下来是执行步骤。你必须关闭你的0.94。在运行执行步骤之前,请先进行x集群。如果它检测到运行的HBase主机或区域服务器,则执行步骤将不会运行。

HDFS and ZooKeeper should be up and running during the upgrade process. If zookeeper is managed by HBase, then you can start zookeeper so it is available to the upgrade by running

在升级过程中,HDFS和ZooKeeper应该启动和运行。如果zookeeper由HBase管理,那么您可以启动zookeeper,这样它就可以通过运行升级。

$ ./hbase/bin/hbase-daemon.sh start zookeeper

The execute upgrade step is made of three substeps.

执行升级步骤由三个子步骤组成。

  • Namespaces: HBase 0.96.0 has support for namespaces. The upgrade needs to reorder directories in the filesystem for namespaces to work.

    名称空间:HBase 0.96.0支持名称空间。升级需要对文件系统中的目录进行重新排序,以便使用名称空间工作。

  • ZNodes: All znodes are purged so that new ones can be written in their place using a new protobuf’ed format and a few are migrated in place: e.g. replication and table state znodes

    znode:所有的znode都被清除了,这样新的原型就可以在它们的位置上使用新的原始格式,并且有一些被迁移到适当的地方:例如复制和表状态znode。

  • WAL Log Splitting: If the 0.94.x cluster shutdown was not clean, we’ll split WAL logs as part of migration before we startup on 0.96.0. This WAL splitting runs slower than the native distributed WAL splitting because it is all inside the single upgrade process (so try and get a clean shutdown of the 0.94.0 cluster if you can).

    如果是0。94。x集群关闭不干净,我们将在启动0.96.0之前将WAL - log分割为迁移的一部分。这个WAL - fi的运行速度比本地的分布式WAL要慢,因为它都在单一的升级过程中(所以如果可以的话,试着彻底关闭0.94.0集群)。

To run the execute step, make sure that first you have copied HBase 0.96.0 binaries everywhere under servers and under clients. Make sure the 0.94.0 cluster is down. Then do as follows:

要运行执行步骤,请确保首先在服务器和客户机下复制HBase 0.96.0二进制文件。确保0.94.0集群已经关闭。然后做如下:

$ bin/hbase upgrade -execute

Here is some sample output.

这是一些样本输出。

Starting Namespace upgrade
Created version file at hdfs://localhost:41020/myHBase with version=7
Migrating table testTable to hdfs://localhost:41020/myHBase/.data/default/testTable
.....
Created version file at hdfs://localhost:41020/myHBase with version=8
Successfully completed NameSpace upgrade.
Starting Znode upgrade
.....
Successfully completed Znode upgrade

Starting Log splitting
...
Successfully completed Log splitting

If the output from the execute step looks good, stop the zookeeper instance you started to do the upgrade:

如果执行步骤的输出看起来很好,请停止您开始进行升级的zookeeper实例:

$ ./hbase/bin/hbase-daemon.sh stop zookeeper

Now start up hbase-0.96.0.

现在启动hbase-0.96.0。

13.5. Troubleshooting

13.5。故障排除

Old Client connecting to 0.96 cluster

It will fail with an exception like the below. Upgrade.

它将以如下的异常失败。升级。

17:22:15  Exception in thread "main" java.lang.IllegalArgumentException: Not a host:port pair: PBUF
17:22:15  *
17:22:15   api-compat-8.ent.cloudera.com ��  ���(
17:22:15    at org.apache.hadoop.hbase.util.Addressing.parseHostname(Addressing.java:60)
17:22:15    at org.apache.hadoop.hbase.ServerName.&init>(ServerName.java:101)
17:22:15    at org.apache.hadoop.hbase.ServerName.parseVersionedServerName(ServerName.java:283)
17:22:15    at org.apache.hadoop.hbase.MasterAddressTracker.bytesToServerName(MasterAddressTracker.java:77)
17:22:15    at org.apache.hadoop.hbase.MasterAddressTracker.getMasterAddress(MasterAddressTracker.java:61)
17:22:15    at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getMaster(HConnectionManager.java:703)
17:22:15    at org.apache.hadoop.hbase.client.HBaseAdmin.&init>(HBaseAdmin.java:126)
17:22:15    at Client_4_3_0.setup(Client_4_3_0.java:716)
17:22:15    at Client_4_3_0.main(Client_4_3_0.java:63)

13.5.1. Upgrading META to use Protocol Buffers (Protobuf)

13.5.1。升级META以使用协议缓冲区(Protobuf)

When you upgrade from versions prior to 0.96, META needs to be converted to use protocol buffers. This is controlled by the configuration option hbase.MetaMigrationConvertingToPB, which is set to true by default. Therefore, by default, no action is required on your part.

当您在0.96之前从版本升级时,需要将META转换为使用协议缓冲区。这是由配置选项hbase控制的。MetaMigrationConvertingToPB,默认设置为true。因此,默认情况下,不需要任何操作。

The migration is a one-time event. However, every time your cluster starts, META is scanned to ensure that it does not need to be converted. If you have a very large number of regions, this scan can take a long time. Starting in 0.98.5, you can set hbase.MetaMigrationConvertingToPB to false in hbase-site.xml, to disable this start-up scan. This should be considered an expert-level setting.

迁移是一次性事件。但是,每次集群启动时,都会扫描元数据以确保不需要转换。如果你有大量的区域,这个扫描可能需要很长时间。从0.98.5开始,您可以设置hbase。在hbase站点中,MetaMigrationConvertingToPB为false。xml,禁用此启动扫描。这应该被视为专家级别的设置。

13.6. Upgrading from 0.92.x to 0.94.x

13.6。从0.92升级。x 0.94.x

We used to think that 0.92 and 0.94 were interface compatible and that you can do a rolling upgrade between these versions but then we figured that HBASE-5357 Use builder pattern in HColumnDescriptor changed method signatures so rather than return void they instead return HColumnDescriptor. This will throw java.lang.NoSuchMethodError: org.apache.hadoop.hbase.HColumnDescriptor.setMaxVersions(I)V so 0.92 and 0.94 are NOT compatible. You cannot do a rolling upgrade between them.

我们曾经认为0.92和0.94是接口兼容的,您可以在这些版本之间进行滚动升级,但是我们发现HBASE-5357在HColumnDescriptor中使用builder模式更改了方法签名,而不是返回void,而是返回HColumnDescriptor。这将把. lang。(I)V . 0.92和0.94是不兼容的。你不能在他们之间进行滚动升级。

13.7. Upgrading from 0.90.x to 0.92.x

13.7。从0.90升级。x 0.92.x

13.7.1. Upgrade Guide

13.7.1。升级指南

You will find that 0.92.0 runs a little differently to 0.90.x releases. Here are a few things to watch out for upgrading from 0.90.x to 0.92.0.

你会发现0。92.0和0。90有点不同。x版本。这里有几点需要注意,从0.90升级。x 0.92.0。

tl:dr

These are the important things to know before upgrading. . Once you upgrade, you can’t go back.

在升级之前,这些都是重要的事情。一旦你升级了,你就不能回去了。

  1. MSLAB is on by default. Watch that heap usage if you have a lot of regions.

    MSLAB默认是on。如果有很多区域,请注意堆使用。

  2. Distributed Log Splitting is on by default. It should make RegionServer failover faster.

    在默认情况下,分布式日志分裂是打开的。它应该使区域性服务器故障转移更快。

  3. There’s a separate tarball for security.

    安全还有一个单独的tarball。

  4. If -XX:MaxDirectMemorySize is set in your hbase-env.sh, it’s going to enable the experimental off-heap cache (You may not want this).

    如果-XX:MaxDirectMemorySize设置在您的hbase-env中。sh,它将启用实验性的堆外缓存(您可能不想要这个)。

You can’t go back!

To move to 0.92.0, all you need to do is shutdown your cluster, replace your HBase 0.90.x with HBase 0.92.0 binaries (be sure you clear out all 0.90.x instances) and restart (You cannot do a rolling restart from 0.90.x to 0.92.x — you must restart). On startup, the .META. table content is rewritten removing the table schema from the info:regioninfo column. Also, any flushes done post first startup will write out data in the new 0.92.0 file format, HBase file format with inline blocks (version 2). This means you cannot go back to 0.90.x once you’ve started HBase 0.92.0 over your HBase data directory.

要移动到0.92.0,需要做的就是关闭集群,替换HBase 0.90。使用HBase 0.92.0二进制文件(请确保您清除了所有0.90)。并重启(您不能从0.90开始滚动重新启动)。0.92 x。x -你必须重启)。在启动时,.META。表内容重写了从info:区域信息列中删除表模式。此外,任何完成后的第一次启动,都将以新的0.92.0文件格式、HBase文件格式和内联块(版本2)来写数据,这意味着您不能回到0.90。一旦您在HBase数据目录上启动了HBase 0.92.0。

MSLAB is ON by default

In 0.92.0, the hbase.hregion.memstore.mslab.enabled flag is set to true (See Long GC pauses). In 0.90.x it was false. When it is enabled, memstores will step allocate memory in MSLAB 2MB chunks even if the memstore has zero or just a few small elements. This is fine usually but if you had lots of regions per RegionServer in a 0.90.x cluster (and MSLAB was off), you may find yourself OOME’ing on upgrade because the thousands of regions * number of column families * 2MB MSLAB (at a minimum) puts your heap over the top. Set hbase.hregion.memstore.mslab.enabled to false or set the MSLAB size down from 2MB by setting hbase.hregion.memstore.mslab.chunksize to something less.

在0.92.0 hbase.hregion.memstore.mslab。enabled标志被设置为true(请参阅长GC暂停)。在0.90。x是假的。当启用它时,memstores将会在MSLAB 2MB内存块中分配内存,即使memstore只有零个或只是几个小的元素。这通常是可以的,但是如果每个区域服务器有很多区域在0.90。x集群(和MSLAB关闭),您可能会发现自己正在进行升级,因为成千上万的区域* * * * * * MSLAB(至少是)将您的堆放置在顶部。设置hbase.hregion.memstore.mslab。通过设置hbase. h区域性.memstore.mslab,可以将MSLAB大小从2MB设置为false或设置。chunksize的东西更少。

Distributed Log Splitting is on by default

Previous, WAL logs on crash were split by the Master alone. In 0.92.0, log splitting is done by the cluster (See HBASE-1364 [performance] Distributed splitting of regionserver commit logs or see the blog post Apache HBase Log Splitting). This should cut down significantly on the amount of time it takes splitting logs and getting regions back online again.

此前,在《撞车》(crash)中,沃斯(WAL - log)被主人单独分开。在0.92.0中,日志拆分是由集群完成的(参见HBase -1364[性能]分布式分区服务器提交日志或查看博客Apache HBase日志拆分)。这应该会大大减少分割日志和恢复区域的时间。

Memory accounting is different now

In 0.92.0, HBase file format with inline blocks (version 2) indices and bloom filters take up residence in the same LRU used caching blocks that come from the filesystem. In 0.90.x, the HFile v1 indices lived outside of the LRU so they took up space even if the index was on a ‘cold’ file, one that wasn’t being actively used. With the indices now in the LRU, you may find you have less space for block caching. Adjust your block cache accordingly. See the Block Cache for more detail. The block size default size has been changed in 0.92.0 from 0.2 (20 percent of heap) to 0.25.

在0.92.0中,带有内联块(版本2)索引和bloom过滤器的HBase文件格式在相同的LRU中占用了来自文件系统的缓存块。在0.90。x, HFile v1指数在LRU之外,所以它们占据了空间,即使索引是在“冷”文件上,也没有被积极使用。使用LRU中的索引,您可能会发现阻塞缓存的空间更小。相应地调整块缓存。更多细节请参见块缓存。块大小的默认大小从0.2(20%的堆)更改为0.92.0。

On the Hadoop version to use

Run 0.92.0 on Hadoop 1.0.x (or CDH3u3). The performance benefits are worth making the move. Otherwise, our Hadoop prescription is as it has been; you need an Hadoop that supports a working sync. See Hadoop.

在Hadoop 1.0上运行0.92.0。x(或CDH3u3)。性能上的好处是值得采取行动的。否则,我们的Hadoop处方就像以前一样;您需要一个支持工作同步的Hadoop。参见Hadoop。

If running on Hadoop 1.0.x (or CDH3u3), enable local read. See Practical Caching presentation for ruminations on the performance benefits ‘going local’ (and for how to enable local reads).

如果在Hadoop 1.0上运行。x(或CDH3u3),允许本地读取。请参阅实用的缓存演示,以了解关于性能好处“本地化”(以及如何启用本地读取)的思考。

HBase 0.92.0 ships with ZooKeeper 3.4.2

If you can, upgrade your ZooKeeper. If you can’t, 3.4.2 clients should work against 3.3.X ensembles (HBase makes use of 3.4.2 API).

如果可以的话,升级你的动物园管理员。如果不能,3.4.2客户端应该针对3.3。X套件(HBase使用3.4.2 API)。

Online alter is off by default

In 0.92.0, we’ve added an experimental online schema alter facility (See hbase.online.schema.update.enable). It’s off by default. Enable it at your own risk. Online alter and splitting tables do not play well together so be sure your cluster quiescent using this feature (for now).

在0.92.0中,我们添加了一个实验性的在线模式更改工具(见hbase.online.schema.update.enable)。这是默认关闭的。你可以自行承担风险。在线修改和拆分表不能很好地组合在一起,所以要确保您的集群休眠使用这个特性(现在)。

WebUI

The web UI has had a few additions made in 0.92.0. It now shows a list of the regions currently transitioning, recent compactions/flushes, and a process list of running processes (usually empty if all is well and requests are being handled promptly). Other additions including requests by region, a debugging servlet dump, etc.

web UI在0.92.0中添加了一些内容。它现在显示了当前正在转换的区域列表,最近的压缩/刷新,以及正在运行的进程的进程列表(如果一切正常,通常是空的,并且请求正在迅速处理)。其他添加包括区域请求、调试servlet转储等。

Security tarball

We now ship with two tarballs; secure and insecure HBase. Documentation on how to setup a secure HBase is on the way.

我们现在用两个tarball;安全的,不安全的HBase。关于如何设置安全HBase的文档正在进行中。

Changes in HBase replication

0.92.0 adds two new features: multi-slave and multi-master replication. The way to enable this is the same as adding a new peer, so in order to have multi-master you would just run add_peer for each cluster that acts as a master to the other slave clusters. Collisions are handled at the timestamp level which may or may not be what you want, this needs to be evaluated on a per use case basis. Replication is still experimental in 0.92 and is disabled by default, run it at your own risk.

0.92.0添加了两个新特性:多奴隶和多主复制。启用这一功能的方法与添加一个新的对等点是一样的,因此,为了拥有多主机,您只需要为每个集群运行add_peer,以充当其他从属集群的主服务器。冲突是在时间戳级别处理的,它可能是您想要的,也可能不是您想要的,这需要在每个用例的基础上进行评估。复制在0.92中仍然是实验性的,在默认情况下是禁用的,在您自己的风险下运行它。

RegionServer now aborts if OOME

If an OOME, we now have the JVM kill -9 the RegionServer process so it goes down fast. Previous, a RegionServer might stick around after incurring an OOME limping along in some wounded state. To disable this facility, and recommend you leave it in place, you’d need to edit the bin/hbase file. Look for the addition of the -XX:OnOutOfMemoryError="kill -9 %p" arguments (See HBASE-4769 - ‘Abort RegionServer Immediately on OOME’).

如果一个OOME,我们现在有JVM kill -9区域服务器进程,所以它会快速下降。以前,在一些受伤的州,一个地区服务器可能会在一瘸一拐的行进中徘徊。要禁用此功能,并建议您将其保留,您需要编辑bin/hbase文件。查找添加的- xx:OnOutOfMemoryError="kill - 9% p"参数(参见HBASE-4769 -“立即在OOME上中止区域服务器”)。

HFile v2 and the “Bigger, Fewer” Tendency

0.92.0 stores data in a new format, HBase file format with inline blocks (version 2). As HBase runs, it will move all your data from HFile v1 to HFile v2 format. This auto-migration will run in the background as flushes and compactions run. HFile v2 allows HBase run with larger regions/files. In fact, we encourage that all HBasers going forward tend toward Facebook axiom #1, run with larger, fewer regions. If you have lots of regions now — more than 100s per host — you should look into setting your region size up after you move to 0.92.0 (In 0.92.0, default size is now 1G, up from 256M), and then running online merge tool (See HBASE-1621 merge tool should work on online cluster, but disabled table).

0.92.0以新的格式存储数据,HBase文件格式与内联块(版本2)。当HBase运行时,它将把所有数据从HFile v1移动到HFile v2格式。这个自动迁移将在后台运行,因为它会运行。HFile v2允许HBase运行较大的区域/文件。事实上,我们鼓励所有的hbaser都倾向于Facebook axiom #1,使用更大、更少的区域。如果你现在有很多地区——超过100年代每个主机,你应该考虑设置区域大小后搬到0.92.0(现在在0.92.0,默认大小是1克,256),然后运行在线合并工具(见hbase - 1621合并工具应该在线集群,但禁用表)。

13.8. Upgrading to HBase 0.90.x from 0.20.x or 0.89.x

13.8。升级到0.90 HBase。从0.20 x。x或0.89.x

This version of 0.90.x HBase can be started on data written by HBase 0.20.x or HBase 0.89.x. There is no need of a migration step. HBase 0.89.x and 0.90.x does write out the name of region directories differently — it names them with a md5 hash of the region name rather than a jenkins hash — so this means that once started, there is no going back to HBase 0.20.x.

这个版本为0.90。x HBase可以从HBase 0.20的数据开始。x或HBase 0.89.x。不需要迁移步骤。HBase 0.89。0.90 x和。x确实以不同的方式写出了区域目录的名称——它用区域名称的md5哈希来命名它们,而不是jenkins哈希——所以这意味着一旦开始,就不会返回到HBase 0.20 x。

Be sure to remove the hbase-default.xml from your conf directory on upgrade. A 0.20.x version of this file will have sub-optimal configurations for 0.90.x HBase. The hbase-default.xml file is now bundled into the HBase jar and read from there. If you would like to review the content of this file, see it in the src tree at src/main/resources/hbase-default.xml or see HBase Default Configuration.

一定要删除hbase-default。从您的conf目录中的xml升级。0.20。该文件的x版本将有0.90的次优化配置。x HBase。hbase-default。xml文件现在被绑定到HBase jar中并从那里读取。如果您想查看该文件的内容,请参见src/main/resources/hbase-default中的src树。xml或查看HBase默认配置。

Finally, if upgrading from 0.20.x, check your .META. schema in the shell. In the past we would recommend that users run with a 16kb MEMSTORE_FLUSHSIZE. Run

最后,如果从0.20升级。x,检查你的.META。模式的壳。在过去,我们建议用户使用16kb的MEMSTORE_FLUSHSIZE运行。运行

hbase> scan '-ROOT-'

in the shell. This will output the current .META. schema. Check MEMSTORE_FLUSHSIZE size. Is it 16kb (16384)? If so, you will need to change this (The 'normal'/default value is 64MB (67108864)). Run the script bin/set_meta_memstore_size.rb. This will make the necessary edit to your .META. schema. Failure to run this change will make for a slow cluster. See HBASE-3499 Users upgrading to 0.90.0 need to have their .META. table updated with the right MEMSTORE_SIZE.

带壳的。这将输出当前的. meta。模式。检查MEMSTORE_FLUSHSIZE大小。这是16 kb(16384)吗?如果是,您将需要更改这个(“正常”/默认值是64MB(67108864))。bin / set_meta_memstore_size.rb运行脚本。这将对您的. meta进行必要的编辑。模式。未能运行此更改将导致集群的缓慢。看到HBASE-3499用户升级到0.90.0需要有他们的. meta。表更新了正确的MEMSTORE_SIZE。

The Apache HBase Shell

Apache HBase壳

The Apache HBase Shell is (J)Ruby's IRB with some HBase particular commands added. Anything you can do in IRB, you should be able to do in the HBase Shell.

Apache HBase Shell是(J)Ruby的IRB,添加了一些HBase特定命令。在IRB中可以做的任何事情,都可以在HBase Shell中进行。

To run the HBase shell, do as follows:

要运行HBase shell,请执行以下操作:

$ ./bin/hbase shell

Type help and then <RETURN> to see a listing of shell commands and options. Browse at least the paragraphs at the end of the help output for the gist of how variables and command arguments are entered into the HBase shell; in particular note how table names, rows, and columns, etc., must be quoted.

类型帮助,然后 <返回> 查看shell命令和选项的列表。至少浏览帮助输出末尾的段落,以了解如何将变量和命令参数输入到HBase shell中;特别要注意,必须引用表名、行和列等等。

See shell exercises for example basic shell operation.

参见shell练习,例如基本的shell操作。

Here is a nicely formatted listing of all shell commands by Rajeshbabu Chintaguntla.

下面是Rajeshbabu Chintaguntla对所有shell命令的良好格式化的列表。

14. Scripting with Ruby

14。使用Ruby脚本

For examples scripting Apache HBase, look in the HBase bin directory. Look at the files that end in *.rb. To run one of these files, do as follows:

例如,使用HBase bin目录来编写Apache HBase脚本。看看在*.rb中结束的文件。要运行其中一个文件,请执行以下操作:

$ ./bin/hbase org.jruby.Main PATH_TO_SCRIPT

15. Running the Shell in Non-Interactive Mode

15。在非交互模式下运行Shell。

A new non-interactive mode has been added to the HBase Shell (HBASE-11658). Non-interactive mode captures the exit status (success or failure) of HBase Shell commands and passes that status back to the command interpreter. If you use the normal interactive mode, the HBase Shell will only ever return its own exit status, which will nearly always be 0 for success.

在HBase Shell中添加了一个新的非交互模式(HBase -11658)。非交互模式捕获HBase Shell命令的退出状态(成功或失败),并将该状态传递回命令解释器。如果使用正常的交互模式,HBase Shell将只返回它自己的退出状态,这几乎总是为0。

To invoke non-interactive mode, pass the -n or --non-interactive option to HBase Shell.

要调用非交互模式,将-n或-非交互式选项传递给HBase Shell。

16. HBase Shell in OS Scripts

16。操作系统脚本中的HBase Shell。

You can use the HBase shell from within operating system script interpreters like the Bash shell which is the default command interpreter for most Linux and UNIX distributions. The following guidelines use Bash syntax, but could be adjusted to work with C-style shells such as csh or tcsh, and could probably be modified to work with the Microsoft Windows script interpreter as well. Submissions are welcome.

您可以在操作系统脚本解释器中使用HBase shell,如Bash shell,它是大多数Linux和UNIX发行版的缺省命令解释器。下面的指导方针使用Bash语法,但是可以调整为使用c风格的shell,例如csh或tcsh,也可以修改为与Microsoft Windows脚本解释器一起工作。欢迎提交。

Spawning HBase Shell commands in this way is slow, so keep that in mind when you are deciding when combining HBase operations with the operating system command line is appropriate.
Example 7. Passing Commands to the HBase Shell

You can pass commands to the HBase Shell in non-interactive mode (see hbase.shell.noninteractive) using the echo command and the | (pipe) operator. Be sure to escape characters in the HBase commands which would otherwise be interpreted by the shell. Some debug-level output has been truncated from the example below.

您可以使用echo命令和|(管道)操作符,以非交互模式将命令传递给HBase Shell(参见hbase.shell.noninteractive)。一定要在HBase命令中转义字符,否则将被shell解释。一些调试级别的输出已经从下面的示例中截断。

$ echo "describe 'test1'" | ./hbase shell -n

Version 0.98.3-hadoop2, rd5e65a9144e315bb0a964e7730871af32f5018d5, Sat May 31 19:56:09 PDT 2014

describe 'test1'

DESCRIPTION                                          ENABLED
 'test1', {NAME => 'cf', DATA_BLOCK_ENCODING => 'NON true
 E', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0',
  VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIO
 NS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS =>
 'false', BLOCKSIZE => '65536', IN_MEMORY => 'false'
 , BLOCKCACHE => 'true'}
1 row(s) in 3.2410 seconds

To suppress all output, echo it to /dev/null:

为了抑制所有输出,将其echo到/dev/null:

$ echo "describe 'test'" | ./hbase shell -n > /dev/null 2>&1
Example 8. Checking the Result of a Scripted Command

Since scripts are not designed to be run interactively, you need a way to check whether your command failed or succeeded. The HBase shell uses the standard convention of returning a value of 0 for successful commands, and some non-zero value for failed commands. Bash stores a command’s return value in a special environment variable called $?. Because that variable is overwritten each time the shell runs any command, you should store the result in a different, script-defined variable.

由于脚本不是用来交互运行的,所以您需要一种方法来检查您的命令是否失败或成功。HBase shell使用标准约定,返回值为0的成功命令,以及一些失败命令的非零值。Bash在一个名为$?的特殊环境变量中存储一个命令的返回值。因为每次shell运行任何命令时,该变量都被覆盖,所以您应该将结果存储在一个不同的、脚本定义的变量中。

This is a naive script that shows one way to store the return value and make a decision based upon it.

这是一个简单的脚本,它显示了一种存储返回值的方法,并基于它做出决策。

#!/bin/bash

echo "describe 'test'" | ./hbase shell -n > /dev/null 2>&1
status=$?
echo "The status was " $status
if ($status == 0); then
    echo "The command succeeded"
else
    echo "The command may have failed."
fi
return $status

16.1. Checking for Success or Failure In Scripts

16.1。检查脚本中的成功或失败。

Getting an exit code of 0 means that the command you scripted definitely succeeded. However, getting a non-zero exit code does not necessarily mean the command failed. The command could have succeeded, but the client lost connectivity, or some other event obscured its success. This is because RPC commands are stateless. The only way to be sure of the status of an operation is to check. For instance, if your script creates a table, but returns a non-zero exit value, you should check whether the table was actually created before trying again to create it.

获取0的退出代码意味着您所编写的命令一定成功。然而,获取非零的退出代码并不一定意味着命令失败。该命令本来可以成功,但是客户端失去了连接,或者其他一些事件掩盖了它的成功。这是因为RPC命令是无状态的。唯一确定操作状态的方法是检查。例如,如果您的脚本创建了一个表,但是返回一个非零的退出值,那么您应该检查表是否在再次尝试创建它之前被创建。

17. Read HBase Shell Commands from a Command File

17所示。从命令文件读取HBase Shell命令。

You can enter HBase Shell commands into a text file, one command per line, and pass that file to the HBase Shell.

您可以将HBase Shell命令输入到一个文本文件中,每一行一个命令,并将该文件传递给HBase Shell。

Example 9. Example Command File
create 'test', 'cf'
list 'test'
put 'test', 'row1', 'cf:a', 'value1'
put 'test', 'row2', 'cf:b', 'value2'
put 'test', 'row3', 'cf:c', 'value3'
put 'test', 'row4', 'cf:d', 'value4'
scan 'test'
get 'test', 'row1'
disable 'test'
enable 'test'
Example 10. Directing HBase Shell to Execute the Commands

Pass the path to the command file as the only argument to the hbase shell command. Each command is executed and its output is shown. If you do not include the exit command in your script, you are returned to the HBase shell prompt. There is no way to programmatically check each individual command for success or failure. Also, though you see the output for each command, the commands themselves are not echoed to the screen so it can be difficult to line up the command with its output.

将路径传递到命令文件,作为hbase shell命令的惟一参数。执行每个命令并显示其输出。如果在脚本中不包含exit命令,则返回到HBase shell提示符。对于成功或失败,没有办法以编程方式检查每个单独的命令。另外,虽然您可以看到每个命令的输出,但是命令本身并没有响应到屏幕上,因此很难将命令与输出连接起来。

$ ./hbase shell ./sample_commands.txt
0 row(s) in 3.4170 seconds

TABLE
test
1 row(s) in 0.0590 seconds

0 row(s) in 0.1540 seconds

0 row(s) in 0.0080 seconds

0 row(s) in 0.0060 seconds

0 row(s) in 0.0060 seconds

ROW                   COLUMN+CELL
 row1                 column=cf:a, timestamp=1407130286968, value=value1
 row2                 column=cf:b, timestamp=1407130286997, value=value2
 row3                 column=cf:c, timestamp=1407130287007, value=value3
 row4                 column=cf:d, timestamp=1407130287015, value=value4
4 row(s) in 0.0420 seconds

COLUMN                CELL
 cf:a                 timestamp=1407130286968, value=value1
1 row(s) in 0.0110 seconds

0 row(s) in 1.5630 seconds

0 row(s) in 0.4360 seconds

18. Passing VM Options to the Shell

18岁。将VM选项传递给Shell。

You can pass VM options to the HBase Shell using the HBASE_SHELL_OPTS environment variable. You can set this in your environment, for instance by editing ~/.bashrc, or set it as part of the command to launch HBase Shell. The following example sets several garbage-collection-related variables, just for the lifetime of the VM running the HBase Shell. The command should be run all on a single line, but is broken by the \ character, for readability.

您可以使用HBASE_SHELL_OPTS环境变量将VM选项传递给HBase Shell。您可以在您的环境中设置这个,例如通过编辑~/。bashrc,或将其设置为启动HBase Shell的命令的一部分。下面的示例设置了几个垃圾收集相关的变量,仅用于运行HBase Shell的VM的生命周期。该命令应该在一行上运行,但是被\字符破坏,为了可读性。

$ HBASE_SHELL_OPTS="-verbose:gc -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDateStamps \
  -XX:+PrintGCDetails -Xloggc:$HBASE_HOME/logs/gc-hbase.log" ./bin/hbase shell

19. Shell Tricks

19所示。壳牌的技巧

19.1. Table variables

19.1。表变量

HBase 0.95 adds shell commands that provides jruby-style object-oriented references for tables. Previously all of the shell commands that act upon a table have a procedural style that always took the name of the table as an argument. HBase 0.95 introduces the ability to assign a table to a jruby variable. The table reference can be used to perform data read write operations such as puts, scans, and gets well as admin functionality such as disabling, dropping, describing tables.

HBase 0.95添加了shell命令,它为表提供了jruby风格的面向对象引用。以前,在表上执行的所有shell命令都有一个过程式的样式,它总是以表的名称作为参数。HBase 0.95引入了将表分配给jruby变量的能力。表引用可用于执行数据读写操作,如put、扫描和获得良好的管理功能,如禁用、删除、描述表。

For example, previously you would always specify a table name:

例如,以前您总是指定一个表名:

hbase(main):000:0> create ‘t’, ‘f’
0 row(s) in 1.0970 seconds
hbase(main):001:0> put 't', 'rold', 'f', 'v'
0 row(s) in 0.0080 seconds

hbase(main):002:0> scan 't'
ROW                                COLUMN+CELL
 rold                              column=f:, timestamp=1378473207660, value=v
1 row(s) in 0.0130 seconds

hbase(main):003:0> describe 't'
DESCRIPTION                                                                           ENABLED
 't', {NAME => 'f', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_ true
 SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => '2
 147483647', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65536', IN_MEMORY => 'false
 ', BLOCKCACHE => 'true'}
1 row(s) in 1.4430 seconds

hbase(main):004:0> disable 't'
0 row(s) in 14.8700 seconds

hbase(main):005:0> drop 't'
0 row(s) in 23.1670 seconds

hbase(main):006:0>

Now you can assign the table to a variable and use the results in jruby shell code.

现在可以将表分配给一个变量,并使用jruby shell代码中的结果。

hbase(main):007 > t = create 't', 'f'
0 row(s) in 1.0970 seconds

=> Hbase::Table - t
hbase(main):008 > t.put 'r', 'f', 'v'
0 row(s) in 0.0640 seconds
hbase(main):009 > t.scan
ROW                           COLUMN+CELL
 r                            column=f:, timestamp=1331865816290, value=v
1 row(s) in 0.0110 seconds
hbase(main):010:0> t.describe
DESCRIPTION                                                                           ENABLED
 't', {NAME => 'f', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_ true
 SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => '2
 147483647', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65536', IN_MEMORY => 'false
 ', BLOCKCACHE => 'true'}
1 row(s) in 0.0210 seconds
hbase(main):038:0> t.disable
0 row(s) in 6.2350 seconds
hbase(main):039:0> t.drop
0 row(s) in 0.2340 seconds

If the table has already been created, you can assign a Table to a variable by using the get_table method:

如果已经创建了表,您可以使用get_table方法将一个表分配给一个变量:

hbase(main):011 > create 't','f'
0 row(s) in 1.2500 seconds

=> Hbase::Table - t
hbase(main):012:0> tab = get_table 't'
0 row(s) in 0.0010 seconds

=> Hbase::Table - t
hbase(main):013:0> tab.put ‘r1’ ,’f’, ‘v’
0 row(s) in 0.0100 seconds
hbase(main):014:0> tab.scan
ROW                                COLUMN+CELL
 r1                                column=f:, timestamp=1378473876949, value=v
1 row(s) in 0.0240 seconds
hbase(main):015:0>

The list functionality has also been extended so that it returns a list of table names as strings. You can then use jruby to script table operations based on these names. The list_snapshots command also acts similarly.

列表功能也被扩展,因此它返回一个表名称列表作为字符串。然后,您可以根据这些名称使用jruby编写脚本表操作。list_snapshot命令也类似。

hbase(main):016 > tables = list(‘t.*’)
TABLE
t
1 row(s) in 0.1040 seconds

=> #<#<Class:0x7677ce29>:0x21d377a4>
hbase(main):017:0> tables.map { |t| disable t ; drop  t}
0 row(s) in 2.2510 seconds

=> [nil]
hbase(main):018:0>

19.2. irbrc

19.2。irbrc

Create an .irbrc file for yourself in your home directory. Add customizations. A useful one is command history so commands are save across Shell invocations:

在您的主目录中为自己创建一个.irbrc文件。添加自定义。一个有用的命令是命令历史,所以命令可以通过Shell调用保存:

$ more .irbrc
require 'irb/ext/save-history'
IRB.conf[:SAVE_HISTORY] = 100
IRB.conf[:HISTORY_FILE] = "#{ENV['HOME']}/.irb-save-history"

See the ruby documentation of .irbrc to learn about other possible configurations.

请参阅.irbrc的ruby文档了解其他可能的配置。

19.3. LOG data to timestamp

19.3。日志数据的时间戳

To convert the date '08/08/16 20:56:29' from an hbase log into a timestamp, do:

要将日期“08/08/16 20:56:29”从hbase日志转换为时间戳,请执行:

hbase(main):021:0> import java.text.SimpleDateFormat
hbase(main):022:0> import java.text.ParsePosition
hbase(main):023:0> SimpleDateFormat.new("yy/MM/dd HH:mm:ss").parse("08/08/16 20:56:29", ParsePosition.new(0)).getTime() => 1218920189000

To go the other direction:

走向另一个方向:

hbase(main):021:0> import java.util.Date
hbase(main):022:0> Date.new(1218920189000).toString() => "Sat Aug 16 20:56:29 UTC 2008"

To output in a format that is exactly like that of the HBase log format will take a little messing with SimpleDateFormat.

要以与HBase日志格式完全相同的格式输出,将会使用SimpleDateFormat来进行一些干扰。

19.4. Query Shell Configuration

19.4。查询外壳配置

hbase(main):001:0> @shell.hbase.configuration.get("hbase.rpc.timeout")
=> "60000"

To set a config in the shell:

在shell中设置一个配置:

hbase(main):005:0> @shell.hbase.configuration.setInt("hbase.rpc.timeout", 61010)
hbase(main):006:0> @shell.hbase.configuration.get("hbase.rpc.timeout")
=> "61010"

19.5. Pre-splitting tables with the HBase Shell

19.5。与HBase Shell的预分解表。

You can use a variety of options to pre-split tables when creating them via the HBase Shell create command.

在通过HBase Shell创建命令时,可以使用各种选项来预分解表。

The simplest approach is to specify an array of split points when creating the table. Note that when specifying string literals as split points, these will create split points based on the underlying byte representation of the string. So when specifying a split point of '10', we are actually specifying the byte split point '\x31\30'.

最简单的方法是在创建表时指定分割点的数组。注意,当将字符串文本指定为分割点时,它们将基于字符串的底层字节表示创建分叉点。因此,当指定“10”的分叉点时,我们实际上指定了字节分割点“\x31\30”。

The split points will define n+1 regions where n is the number of split points. The lowest region will contain all keys from the lowest possible key up to but not including the first split point key. The next region will contain keys from the first split point up to, but not including the next split point key. This will continue for all split points up to the last. The last region will be defined from the last split point up to the maximum possible key.

分割点将定义n+1个区域,其中n为分裂点的个数。最低的区域将包含所有键,从最低的可能的关键到但不包括第一个分裂点的关键。下一个区域将包含从第一个分裂点到,但不包括下一个拆分点键的键。这将会持续到最后。最后一个区域将从最后一个分割点定义到最大可能的密钥。

hbase>create 't1','f',SPLITS => ['10','20',30']

In the above example, the table 't1' will be created with column family 'f', pre-split to four regions. Note the first region will contain all keys from '\x00' up to '\x30' (as '\x31' is the ASCII code for '1').

在上面的例子中,表“t1”将用列族“f”创建,并预先划分为四个区域。注意,第一个区域将包含“\x00”到“\x30”的所有键(“\x31”是“1”的ASCII码)。

You can pass the split points in a file using following variation. In this example, the splits are read from a file corresponding to the local path on the local filesystem. Each line in the file specifies a split point key.

您可以使用以下变体在文件中传递分割点。在本例中,将从与本地文件系统上的本地路径对应的文件中读取分割。文件中的每一行都指定了一个拆分点键。

hbase>create 't14','f',SPLITS_FILE=>'splits.txt'

The other options are to automatically compute splits based on a desired number of regions and a splitting algorithm. HBase supplies algorithms for splitting the key range based on uniform splits or based on hexadecimal keys, but you can provide your own splitting algorithm to subdivide the key range.

其他选项是根据所需的区域数量和分割算法自动计算分割。HBase提供了基于均匀分割或基于十六进制键来分割密钥范围的算法,但是您可以提供自己的分裂算法来细分密钥范围。

# create table with four regions based on random bytes keys
hbase>create 't2','f1', { NUMREGIONS => 4 , SPLITALGO => 'UniformSplit' }

# create table with five regions based on hex keys
hbase>create 't3','f1', { NUMREGIONS => 5, SPLITALGO => 'HexStringSplit' }

As the HBase Shell is effectively a Ruby environment, you can use simple Ruby scripts to compute splits algorithmically.

由于HBase Shell实际上是一个Ruby环境,所以您可以使用简单的Ruby脚本来计算分割算法。

# generate splits for long (Ruby fixnum) key range from start to end key
hbase(main):070:0> def gen_splits(start_key,end_key,num_regions)
hbase(main):071:1>   results=[]
hbase(main):072:1>   range=end_key-start_key
hbase(main):073:1>   incr=(range/num_regions).floor
hbase(main):074:1>   for i in 1 .. num_regions-1
hbase(main):075:2>     results.push([i*incr+start_key].pack("N"))
hbase(main):076:2>   end
hbase(main):077:1>   return results
hbase(main):078:1> end
hbase(main):079:0>
hbase(main):080:0> splits=gen_splits(1,2000000,10)
=> ["\000\003\r@", "\000\006\032\177", "\000\t'\276", "\000\f4\375", "\000\017B<", "\000\022O{", "\000\025\\\272", "\000\030i\371", "\000\ew8"]
hbase(main):081:0> create 'test_splits','f',SPLITS=>splits
0 row(s) in 0.2670 seconds

=> Hbase::Table - test_splits

Note that the HBase Shell command truncate effectively drops and recreates the table with default options which will discard any pre-splitting. If you need to truncate a pre-split table, you must drop and recreate the table explicitly to re-specify custom split options.

注意,HBase Shell命令截断了有效的删除,并使用默认选项重新创建表,该选项将丢弃任何预分解。如果您需要截断一个预分割表,您必须删除并重新创建表,以重新指定自定义的分割选项。

19.6. Debug

19.6。调试

19.6.1. Shell debug switch

19.6.1。壳牌调试开关

You can set a debug switch in the shell to see more output — e.g. more of the stack trace on exception — when you run a command:

您可以在shell中设置一个调试开关以查看更多的输出。当您运行一个命令时,更多的堆栈跟踪是异常的:

hbase> debug <RETURN>

19.6.2. DEBUG log level

19.6.2。调试日志级别

To enable DEBUG level logging in the shell, launch it with the -d option.

要在shell中启用调试级别的日志记录,可以使用-d选项启动它。

$ ./bin/hbase shell -d

19.7. Commands

19.7。命令

19.7.1. count

19.7.1。数

Count command returns the number of rows in a table. It’s quite fast when configured with the right CACHE

Count命令返回表中的行数。配置正确的缓存时速度非常快。

hbase> count '<tablename>', CACHE => 1000

The above count fetches 1000 rows at a time. Set CACHE lower if your rows are big. Default is to fetch one row at a time.

以上计数一次取1000行。如果行很大,则设置缓存更低。默认是一次取一行。

Data Model

数据模型

In HBase, data is stored in tables, which have rows and columns. This is a terminology overlap with relational databases (RDBMSs), but this is not a helpful analogy. Instead, it can be helpful to think of an HBase table as a multi-dimensional map.

在HBase中,数据存储在表中,表中有行和列。这是一个与关系数据库(RDBMSs)重叠的术语,但这不是一个有用的类比。相反,可以将HBase表看作多维映射。

HBase Data Model Terminology
Table

An HBase table consists of multiple rows.

一个HBase表由多个行组成。

Row

A row in HBase consists of a row key and one or more columns with values associated with them. Rows are sorted alphabetically by the row key as they are stored. For this reason, the design of the row key is very important. The goal is to store data in such a way that related rows are near each other. A common row key pattern is a website domain. If your row keys are domains, you should probably store them in reverse (org.apache.www, org.apache.mail, org.apache.jira). This way, all of the Apache domains are near each other in the table, rather than being spread out based on the first letter of the subdomain.

HBase中的一行包含一行键和一个或多个带有与之关联的值的列。行按字母顺序排序,按行键存储。因此,行键的设计非常重要。其目标是存储数据,以使相关的行彼此相邻。公共行键模式是一个网站域。如果您的行键是域,则应该将它们存储在反向(org.apache)中。www,表示。邮件,org.apache.jira)。这样,所有的Apache域都在表中彼此相邻,而不是基于子域的第一个字母展开。

Column

A column in HBase consists of a column family and a column qualifier, which are delimited by a : (colon) character.

HBase中的一个列由一个列族和一个列限定符组成,它由一个:(冒号)字符分隔。

Column Family

Column families physically colocate a set of columns and their values, often for performance reasons. Each column family has a set of storage properties, such as whether its values should be cached in memory, how its data is compressed or its row keys are encoded, and others. Each row in a table has the same column families, though a given row might not store anything in a given column family.

列家庭在物理上对一组列和它们的值进行物理colocate,通常是出于性能方面的原因。每个列家族都有一组存储属性,比如它的值是否应该缓存在内存中,它的数据是如何被压缩的,或者它的行键是编码的,等等。表中的每一行都有相同的列家族,尽管给定的行可能不会存储在给定列家族中的任何东西。

Column Qualifier

A column qualifier is added to a column family to provide the index for a given piece of data. Given a column family content, a column qualifier might be content:html, and another might be content:pdf. Though column families are fixed at table creation, column qualifiers are mutable and may differ greatly between rows.

将列限定符添加到列家族中,以提供给定数据块的索引。给定一个列的家庭内容,一个列限定符可能是内容:html,另一个可能是内容:pdf。虽然列族在表创建时是固定的,但是列限定符是可变的,在行之间可能有很大的不同。

Cell

A cell is a combination of row, column family, and column qualifier, and contains a value and a timestamp, which represents the value’s version.

单元格是行、列族和列限定符的组合,包含一个值和一个时间戳,它表示值的版本。

Timestamp

A timestamp is written alongside each value, and is the identifier for a given version of a value. By default, the timestamp represents the time on the RegionServer when the data was written, but you can specify a different timestamp value when you put data into the cell.

时间戳是在每个值旁边写的,是一个给定版本的值的标识符。默认情况下,时间戳表示数据写入时区域服务器上的时间,但您可以在将数据放入单元格时指定不同的时间戳值。

20. Conceptual View

20.概念视图

You can read a very understandable explanation of the HBase data model in the blog post Understanding HBase and BigTable by Jim R. Wilson. Another good explanation is available in the PDF Introduction to Basic Schema Design by Amandeep Khurana.

您可以通过Jim R. Wilson的博客文章了解HBase和BigTable中HBase数据模型的一个非常容易理解的解释。另一个很好的解释是,Amandeep Khurana的基本模式设计的PDF介绍。

It may help to read different perspectives to get a solid understanding of HBase schema design. The linked articles cover the same ground as the information in this section.

它可能有助于阅读不同的透视图以获得对HBase模式设计的坚实理解。链接的文章与本节中的信息相同。

The following example is a slightly modified form of the one on page 2 of the BigTable paper. There is a table called webtable that contains two rows (com.cnn.www and com.example.www) and three column families named contents, anchor, and people. In this example, for the first row (com.cnn.www), anchor contains two columns (anchor:cssnsi.com, anchor:my.look.ca) and contents contains one column (contents:html). This example contains 5 versions of the row with the row key com.cnn.www, and one version of the row with the row key com.example.www. The contents:html column qualifier contains the entire HTML of a given website. Qualifiers of the anchor column family each contain the external site which links to the site represented by the row, along with the text it used in the anchor of its link. The people column family represents people associated with the site.

下面的示例是BigTable文件第2页上的一个稍微修改过的表单。有一个名为webtable的表,它包含两行(com.cn .www和com.example.www)和三个列的名为contents、anchor和people的列。在本例中,对于第一行(com.cn .www),锚包含两列(anchor:cssnsi.com, anchor:my.look.ca),内容包含一个列(内容:html)。这个示例包含5个版本的行和行键com.cn .www,以及一行与行键com.example.www的一个版本。内容:html列限定符包含给定网站的整个html。锚列族的限定符每个包含外部站点,该站点链接到由行表示的站点,以及它在链接的锚中使用的文本。people栏目组代表与网站相关的人。

Column Names

By convention, a column name is made of its column family prefix and a qualifier. For example, the column contents:html is made up of the column family contents and the html qualifier. The colon character (:) delimits the column family from the column family qualifier.

按惯例,列名称由其列族前缀和限定符组成。例如,列内容:html是由列家族内容和html限定符组成的。冒号(:)将列族从列族限定符中分离出来。

Table 5. Table webtable
Row Key Time Stamp ColumnFamily contents ColumnFamily anchor ColumnFamily people

"com.cnn.www"

“com.cnn.www”

t9

t9

anchor:cnnsi.com = "CNN"

CNN主播:cnnsi.com = " "

"com.cnn.www"

“com.cnn.www”

t8

t8

anchor:my.look.ca = "CNN.com"

主播:my.look。ca = " CNN.com "

"com.cnn.www"

“com.cnn.www”

t6

t6

contents:html = "<html>…​"

内容:html = " < html >…”

"com.cnn.www"

“com.cnn.www”

t5

t5

contents:html = "<html>…​"

内容:html = " < html >…”

"com.cnn.www"

“com.cnn.www”

t3

t3

contents:html = "<html>…​"

内容:html = " < html >…”

"com.example.www"

“com.example.www”

t5

t5

contents:html = "<html>…​"

内容:html = " < html >…”

people:author = "John Doe"

人:作者=“John Doe”

Cells in this table that appear to be empty do not take space, or in fact exist, in HBase. This is what makes HBase "sparse." A tabular view is not the only possible way to look at data in HBase, or even the most accurate. The following represents the same information as a multi-dimensional map. This is only a mock-up for illustrative purposes and may not be strictly accurate.

这个表中显示为空的单元格不占用空间,或者实际上存在于HBase中。这就是为什么HBase是“稀疏的”。表格视图并不是查看HBase中数据的唯一方法,甚至是最准确的数据。下面是与多维映射相同的信息。这只是为说明目的而做的模型,而且可能不是严格的精确。

{
  "com.cnn.www": {
    contents: {
      t6: contents:html: "<html>..."
      t5: contents:html: "<html>..."
      t3: contents:html: "<html>..."
    }
    anchor: {
      t9: anchor:cnnsi.com = "CNN"
      t8: anchor:my.look.ca = "CNN.com"
    }
    people: {}
  }
  "com.example.www": {
    contents: {
      t5: contents:html: "<html>..."
    }
    anchor: {}
    people: {
      t5: people:author: "John Doe"
    }
  }
}

21. Physical View

21。物理视图

Although at a conceptual level tables may be viewed as a sparse set of rows, they are physically stored by column family. A new column qualifier (column_family:column_qualifier) can be added to an existing column family at any time.

尽管在概念级别的表中可以看作是一组稀疏的行,但它们实际上是由列族存储的。一个新的列限定符(column_family:column_qualifier)可以在任何时候添加到现有的列家族中。

Table 6. ColumnFamily anchor
Row Key Time Stamp Column Family anchor

"com.cnn.www"

“com.cnn.www”

t9

t9

anchor:cnnsi.com = "CNN"

CNN主播:cnnsi.com = " "

"com.cnn.www"

“com.cnn.www”

t8

t8

anchor:my.look.ca = "CNN.com"

主播:my.look。ca = " CNN.com "

Table 7. ColumnFamily contents
Row Key Time Stamp ColumnFamily contents:

"com.cnn.www"

“com.cnn.www”

t6

t6

contents:html = "<html>…​"

内容:html = " < html >…”

"com.cnn.www"

“com.cnn.www”

t5

t5

contents:html = "<html>…​"

内容:html = " < html >…”

"com.cnn.www"

“com.cnn.www”

t3

t3

contents:html = "<html>…​"

内容:html = " < html >…”

The empty cells shown in the conceptual view are not stored at all. Thus a request for the value of the contents:html column at time stamp t8 would return no value. Similarly, a request for an anchor:my.look.ca value at time stamp t9 would return no value. However, if no timestamp is supplied, the most recent value for a particular column would be returned. Given multiple versions, the most recent is also the first one found, since timestamps are stored in descending order. Thus a request for the values of all columns in the row com.cnn.www if no timestamp is specified would be: the value of contents:html from timestamp t6, the value of anchor:cnnsi.com from timestamp t9, the value of anchor:my.look.ca from timestamp t8.

在概念视图中显示的空单元根本没有存储。因此,对内容的值的请求:在时间戳t8上的html列将返回无值。类似地,对锚的请求:my.look。时间戳t9的ca值不会返回任何值。但是,如果没有提供时间戳,则返回特定列的最新值。给定多个版本,最近的版本也是第一个发现的,因为时间戳是按降序存储的。因此,如果没有指定时间戳,则请求行com.cn .cn .www中的所有列的值:内容的值:时间戳t6的html值,锚的值:时间戳t9的cnnsi.com,锚的值:my.look。从时间戳t8 ca。

For more information about the internals of how Apache HBase stores data, see regions.arch.

有关Apache HBase存储数据的内部情况的更多信息,请参见区域。arch。

22. Namespace

22。名称空间

A namespace is a logical grouping of tables analogous to a database in relation database systems. This abstraction lays the groundwork for upcoming multi-tenancy related features:

名称空间是表的逻辑分组,类似于数据库系统中的数据库。这种抽象为即将到来的多租户相关特性奠定了基础:

  • Quota Management (HBASE-8410) - Restrict the amount of resources (i.e. regions, tables) a namespace can consume.

    配额管理(HBASE-8410)限制了名称空间可以使用的资源数量(即区域、表)。

  • Namespace Security Administration (HBASE-9206) - Provide another level of security administration for tenants.

    名称空间安全管理(HBASE-9206)——为租户提供另一个级别的安全管理。

  • Region server groups (HBASE-6721) - A namespace/table can be pinned onto a subset of RegionServers thus guaranteeing a coarse level of isolation.

    区域服务器组(HBASE-6721) -一个名称空间/表可以被固定在区域服务器的子集上,从而保证了一个粗糙的隔离级别。

22.1. Namespace management

22.1。命名空间管理

A namespace can be created, removed or altered. Namespace membership is determined during table creation by specifying a fully-qualified table name of the form:

可以创建、删除或修改名称空间。在表创建期间,通过指定表单的完全限定表名来确定名称空间成员:

<table namespace>:<table qualifier>
Example 11. Examples
#Create a namespace
create_namespace 'my_ns'
#create my_table in my_ns namespace
create 'my_ns:my_table', 'fam'
#drop namespace
drop_namespace 'my_ns'
#alter namespace
alter_namespace 'my_ns', {METHOD => 'set', 'PROPERTY_NAME' => 'PROPERTY_VALUE'}

22.2. Predefined namespaces

22.2。预定义的名称空间

There are two predefined special namespaces:

有两个预定义的特殊名称空间:

  • hbase - system namespace, used to contain HBase internal tables

    hbase -系统名称空间,用于包含hbase内部表。

  • default - tables with no explicit specified namespace will automatically fall into this namespace

    没有显式指定名称空间的默认表将自动归入这个名称空间。

Example 12. Examples
#namespace=foo and table qualifier=bar
create 'foo:bar', 'fam'

#namespace=default and table qualifier=bar
create 'bar', 'fam'

23. Table

23。表

Tables are declared up front at schema definition time.

表在模式定义时间内被声明。

24. Row

24。行

Row keys are uninterpreted bytes. Rows are lexicographically sorted with the lowest order appearing first in a table. The empty byte array is used to denote both the start and end of a tables' namespace.

行键是未解释的字节。行是按字母顺序排序的,顺序是表中出现的最低顺序。空字节数组用于表示表名称空间的开始和结束。

25. Column Family

25。列族

Columns in Apache HBase are grouped into column families. All column members of a column family have the same prefix. For example, the columns courses:history and courses:math are both members of the courses column family. The colon character (:) delimits the column family from the column family qualifier. The column family prefix must be composed of printable characters. The qualifying tail, the column family qualifier, can be made of any arbitrary bytes. Column families must be declared up front at schema definition time whereas columns do not need to be defined at schema time but can be conjured on the fly while the table is up and running.

Apache HBase中的列被分组为列族。列家族的所有列成员都具有相同的前缀。例如,列课程:历史和课程:数学是课程列家庭的成员。冒号(:)将列族从列族限定符中分离出来。列族前缀必须由可打印字符组成。符合条件的尾部,列家庭限定符,可以由任意的字节组成。列族必须在模式定义时间内声明,而列不需要在模式时间内定义,但是可以在表启动和运行时动态地转换。

Physically, all column family members are stored together on the filesystem. Because tunings and storage specifications are done at the column family level, it is advised that all column family members have the same general access pattern and size characteristics.

物理上,所有列家族成员都存储在文件系统中。由于在列家族级别上完成了调优和存储规范,因此建议所有列家庭成员都具有相同的通用访问模式和大小特性。

26. Cells

26岁。细胞

A {row, column, version} tuple exactly specifies a cell in HBase. Cell content is uninterpreted bytes

{行,列,版本}元组精确地指定了HBase中的单元格。单元格内容是未解释的字节。

27. Data Model Operations

27。数据模型操作

The four primary data model operations are Get, Put, Scan, and Delete. Operations are applied via Table instances.

四个主要的数据模型操作是Get、Put、Scan和Delete。操作通过表实例应用。

27.1. Get

27.1。得到

Get returns attributes for a specified row. Gets are executed via Table.get

获取指定行的返回属性。get通过表执行。

27.2. Put

27.2。把

Put either adds new rows to a table (if the key is new) or can update existing rows (if the key already exists). Puts are executed via Table.put (non-writeBuffer) or Table.batch (non-writeBuffer)

将新行添加到表中(如果键是新的),或者可以更新现有的行(如果键已经存在)。put是通过表执行的。把(non-writeBuffer)或表。批处理(non-writeBuffer)

27.3. Scans

27.3。扫描

Scan allow iteration over multiple rows for specified attributes.

扫描允许对指定属性的多行进行迭代。

The following is an example of a Scan on a Table instance. Assume that a table is populated with rows with keys "row1", "row2", "row3", and then another set of rows with the keys "abc1", "abc2", and "abc3". The following example shows how to set a Scan instance to return the rows beginning with "row".

下面是对表实例进行扫描的示例。假设一个表中包含有键“row1”、“row2”、“row3”,然后是另一组带有“abc1”、“abc2”和“abc3”的行。下面的示例演示如何设置扫描实例以返回以“row”开头的行。

public static final byte[] CF = "cf".getBytes();
public static final byte[] ATTR = "attr".getBytes();
...

Table table = ...      // instantiate a Table instance

Scan scan = new Scan();
scan.addColumn(CF, ATTR);
scan.setRowPrefixFilter(Bytes.toBytes("row"));
ResultScanner rs = table.getScanner(scan);
try {
  for (Result r = rs.next(); r != null; r = rs.next()) {
    // process result...
  }
} finally {
  rs.close();  // always close the ResultScanner!
}

Note that generally the easiest way to specify a specific stop point for a scan is by using the InclusiveStopFilter class.

请注意,通常为扫描指定特定的停止点的最简单方法是使用InclusiveStopFilter类。

27.4. Delete

27.4。删除

Delete removes a row from a table. Deletes are executed via Table.delete.

从表中删除一行。删除是通过表执行的。

HBase does not modify data in place, and so deletes are handled by creating new markers called tombstones. These tombstones, along with the dead values, are cleaned up on major compactions.

HBase不修改数据,因此删除是通过创建称为tombstone的新标记来处理的。这些墓碑,连同死去的价值观,都被清理干净了。

See version.delete for more information on deleting versions of columns, and see compaction for more information on compactions.

请参阅版本。删除更多关于删除列版本的信息,并查看compaction以获得关于compaction的更多信息。

28. Versions

28。版本

A {row, column, version} tuple exactly specifies a cell in HBase. It’s possible to have an unbounded number of cells where the row and column are the same but the cell address differs only in its version dimension.

{行,列,版本}元组精确地指定了HBase中的单元格。在行和列相同的情况下,可能有一个无限数量的单元格,但单元地址仅在其版本维度上有所不同。

While rows and column keys are expressed as bytes, the version is specified using a long integer. Typically this long contains time instances such as those returned by java.util.Date.getTime() or System.currentTimeMillis(), that is: the difference, measured in milliseconds, between the current time and midnight, January 1, 1970 UTC.

虽然行和列键表示为字节,但使用长整数指定版本。通常,这段时间包含一些时间实例,例如java.util.Date.getTime()或System.currentTimeMillis()所返回的时间实例,即:在当前时间和午夜、1970年1月1日和1970年1月1日之间,以毫秒计算的差异。

The HBase version dimension is stored in decreasing order, so that when reading from a store file, the most recent values are found first.

HBase版本维度存储在递减顺序中,因此当从存储文件读取时,首先会发现最近的值。

There is a lot of confusion over the semantics of cell versions, in HBase. In particular:

在HBase中,对单元格的语义有很多混淆。特别是:

  • If multiple writes to a cell have the same version, only the last written is fetchable.

    如果多个写入到一个单元格具有相同的版本,则只有最后一个写的是fetchable。

  • It is OK to write cells in a non-increasing version order.

    在不增加的版本中编写单元格是可以的。

Below we describe how the version dimension in HBase currently works. See HBASE-2406 for discussion of HBase versions. Bending time in HBase makes for a good read on the version, or time, dimension in HBase. It has more detail on versioning than is provided here. As of this writing, the limitation Overwriting values at existing timestamps mentioned in the article no longer holds in HBase. This section is basically a synopsis of this article by Bruno Dumon.

下面我们将介绍HBase中的版本维度是如何工作的。有关HBase版本的讨论,请参见HBase -2406。在HBase中的弯曲时间可以在HBase中对版本或时间维度进行良好的读取。它比这里提供的版本更详细。在撰写本文时,限制在文章中提到的现有时间戳中覆盖的值在HBase中不再有效。这一节基本上是布鲁诺·杜蒙的这篇文章的梗概。

28.1. Specifying the Number of Versions to Store

28.1。指定要存储的版本的数量。

The maximum number of versions to store for a given column is part of the column schema and is specified at table creation, or via an alter command, via HColumnDescriptor.DEFAULT_VERSIONS. Prior to HBase 0.96, the default number of versions kept was 3, but in 0.96 and newer has been changed to 1.

为给定列存储的最大版本数是列模式的一部分,并且是在表创建中指定的,或者通过hcolumndescriptor.default_version通过alter命令指定。在HBase 0.96之前,保留的版本的默认数量是3,但是在0.96和更新的版本中被更改为1。

Example 13. Modify the Maximum Number of Versions for a Column Family

This example uses HBase Shell to keep a maximum of 5 versions of all columns in column family f1. You could also use HColumnDescriptor.

这个示例使用HBase Shell在列家族f1中保留最多5个版本的所有列。也可以使用HColumnDescriptor。

hbase> alter ‘t1′, NAME => ‘f1′, VERSIONS => 5
Example 14. Modify the Minimum Number of Versions for a Column Family

You can also specify the minimum number of versions to store per column family. By default, this is set to 0, which means the feature is disabled. The following example sets the minimum number of versions on all columns in column family f1 to 2, via HBase Shell. You could also use HColumnDescriptor.

您还可以指定每个列家族存储的版本的最小数量。默认情况下,这个值设置为0,这意味着该特性是禁用的。下面的示例通过HBase Shell将列族f1到2的所有列的最小版本数设置为2。也可以使用HColumnDescriptor。

hbase> alter ‘t1′, NAME => ‘f1′, MIN_VERSIONS => 2

Starting with HBase 0.98.2, you can specify a global default for the maximum number of versions kept for all newly-created columns, by setting hbase.column.max.version in hbase-site.xml. See hbase.column.max.version.

从HBase 0.98.2开始,通过设置HBase . columnmax,您可以为所有新建列保留的最大版本数指定一个全局缺省值。在hbase-site.xml版本。看到hbase.column.max.version。

28.2. Versions and HBase Operations

28.2。版本和HBase操作

In this section we look at the behavior of the version dimension for each of the core HBase operations.

在本节中,我们将查看每个核心HBase操作的版本维度的行为。

28.2.1. Get/Scan

28.2.1。Get /扫描

Gets are implemented on top of Scans. The below discussion of Get applies equally to Scans.

获取是在扫描之上实现的。下面的讨论同样适用于扫描。

By default, i.e. if you specify no explicit version, when doing a get, the cell whose version has the largest value is returned (which may or may not be the latest one written, see later). The default behavior can be modified in the following ways:

默认情况下,即如果您指定没有显式的版本,当执行get时,其版本具有最大的值的单元格会返回(这可能是最新的一个,也可能不是最近的一个)。默认行为可以通过以下方式进行修改:

  • to return more than one version, see Get.setMaxVersions()

    要返回多个版本,请参阅Get.setMaxVersions()

  • to return versions other than the latest, see Get.setTimeRange()

    要返回最新的版本,请参阅Get.setTimeRange()

    To retrieve the latest version that is less than or equal to a given value, thus giving the 'latest' state of the record at a certain point in time, just use a range from 0 to the desired version and set the max versions to 1.

    要检索小于或等于给定值的最新版本,因此在某个时间点上给出记录的“最新”状态,只需使用从0到所需版本的范围,并将最大版本设置为1。

28.2.2. Default Get Example

28.2.2。默认有例子

The following Get will only retrieve the current version of the row

下面的Get将只检索该行的当前版本。

public static final byte[] CF = "cf".getBytes();
public static final byte[] ATTR = "attr".getBytes();
...
Get get = new Get(Bytes.toBytes("row1"));
Result r = table.get(get);
byte[] b = r.getValue(CF, ATTR);  // returns current version of value

28.2.3. Versioned Get Example

28.2.3。版本化得到的例子

The following Get will return the last 3 versions of the row.

下面的Get将返回该行的最后3个版本。

public static final byte[] CF = "cf".getBytes();
public static final byte[] ATTR = "attr".getBytes();
...
Get get = new Get(Bytes.toBytes("row1"));
get.setMaxVersions(3);  // will return last 3 versions of row
Result r = table.get(get);
byte[] b = r.getValue(CF, ATTR);  // returns current version of value
List<KeyValue> kv = r.getColumn(CF, ATTR);  // returns all versions of this column

28.2.4. Put

28.2.4。把

Doing a put always creates a new version of a cell, at a certain timestamp. By default the system uses the server’s currentTimeMillis, but you can specify the version (= the long integer) yourself, on a per-column level. This means you could assign a time in the past or the future, or use the long value for non-time purposes.

在某个时间戳中,执行put总是会创建一个新版本的单元格。默认情况下,系统使用服务器的currentTimeMillis,但是您可以在每列的级别上指定您自己的版本(=长整数)。这意味着您可以在过去或将来分配一个时间,或者将长值用于非时间目的。

To overwrite an existing value, do a put at exactly the same row, column, and version as that of the cell you want to overwrite.

要覆盖现有的值,请执行与要覆盖的单元格相同的行、列和版本。

Implicit Version Example
隐式版本的例子

The following Put will be implicitly versioned by HBase with the current time.

下面的Put将由HBase在当前时间内隐式地版本化。

public static final byte[] CF = "cf".getBytes();
public static final byte[] ATTR = "attr".getBytes();
...
Put put = new Put(Bytes.toBytes(row));
put.add(CF, ATTR, Bytes.toBytes( data));
table.put(put);
Explicit Version Example
明确的版本的例子

The following Put has the version timestamp explicitly set.

下面的Put是显式设置的版本时间戳。

public static final byte[] CF = "cf".getBytes();
public static final byte[] ATTR = "attr".getBytes();
...
Put put = new Put( Bytes.toBytes(row));
long explicitTimeInMs = 555;  // just an example
put.add(CF, ATTR, explicitTimeInMs, Bytes.toBytes(data));
table.put(put);

Caution: the version timestamp is used internally by HBase for things like time-to-live calculations. It’s usually best to avoid setting this timestamp yourself. Prefer using a separate timestamp attribute of the row, or have the timestamp as a part of the row key, or both.

警告:版本时间戳是由HBase在内部使用的,用于计算实时计算。通常最好不要自己设置这个时间戳。更喜欢使用行的单独的时间戳属性,或者将时间戳作为行键的一部分,或者两者都使用。

28.2.5. Delete

28.2.5。删除

There are three different types of internal delete markers. See Lars Hofhansl’s blog for discussion of his attempt adding another, Scanning in HBase: Prefix Delete Marker.

有三种不同类型的内部删除标记。请参阅Lars Hofhansl的博客,讨论他的尝试添加另一个,在HBase中扫描:前缀删除标记。

  • Delete: for a specific version of a column.

    删除:针对某一列的特定版本。

  • Delete column: for all versions of a column.

    删除列:对于所有版本的列。

  • Delete family: for all columns of a particular ColumnFamily

    删除家庭:适用于特定列的所有列。

When deleting an entire row, HBase will internally create a tombstone for each ColumnFamily (i.e., not each individual column).

当删除整个行时,HBase将在内部为每个ColumnFamily创建一个tombstone(即:,不是每一栏。

Deletes work by creating tombstone markers. For example, let’s suppose we want to delete a row. For this you can specify a version, or else by default the currentTimeMillis is used. What this means is delete all cells where the version is less than or equal to this version. HBase never modifies data in place, so for example a delete will not immediately delete (or mark as deleted) the entries in the storage file that correspond to the delete condition. Rather, a so-called tombstone is written, which will mask the deleted values. When HBase does a major compaction, the tombstones are processed to actually remove the dead values, together with the tombstones themselves. If the version you specified when deleting a row is larger than the version of any value in the row, then you can consider the complete row to be deleted.

通过创建墓碑标记来删除工作。例如,假设我们想删除一行。为此,您可以指定一个版本,或者默认使用currentTimeMillis。这意味着删除所有版本小于或等于这个版本的单元格。HBase永远不会修改数据,例如,删除不会立即删除(或标记为已删除)存储文件中对应于删除条件的条目。相反,所谓的墓碑是写出来的,它会掩盖被删除的值。当HBase做一个主要的压缩时,tombstone会被处理,以实际移除死值,连同墓碑本身。如果在删除行时指定的版本大于行中任何值的版本,则可以考虑删除完整行。

For an informative discussion on how deletes and versioning interact, see the thread Put w/timestamp → Deleteall → Put w/ timestamp fails up on the user mailing list.

的信息讨论如何删除和版本进行交互,查看线程把w /时间戳→Deleteall→把w /时间戳失败用户邮件列表。

Also see keyvalue for more information on the internal KeyValue format.

还可以看到关于内部keyvalue格式的更多信息的keyvalue。

Delete markers are purged during the next major compaction of the store, unless the KEEP_DELETED_CELLS option is set in the column family (See Keeping Deleted Cells). To keep the deletes for a configurable amount of time, you can set the delete TTL via the hbase.hstore.time.to.purge.deletes property in hbase-site.xml. If hbase.hstore.time.to.purge.deletes is not set, or set to 0, all delete markers, including those with timestamps in the future, are purged during the next major compaction. Otherwise, a delete marker with a timestamp in the future is kept until the major compaction which occurs after the time represented by the marker’s timestamp plus the value of hbase.hstore.time.to.purge.deletes, in milliseconds.

删除标记在该存储的下一个主要压缩过程中被清除,除非在列家族中设置KEEP_DELETED_CELLS选项(参见保留删除的单元格)。为了保持删除的时间,你可以通过hbase.hstore.h .time.to.purge.delete来设置删除TTL。如果hbase.hstore.time.to.purge.deletes没有设置或设置为0,所有删除标记,包括将来的时间戳,都将在接下来的主要压缩过程中被清除。否则,在未来的时间戳中会保留一个带有时间戳的删除标记,直到标记的时间戳和hbase. hstore.c . hstore.o .purge.delete以毫秒为单位表示的时间戳后出现的主压缩。

This behavior represents a fix for an unexpected change that was introduced in HBase 0.94, and was fixed in HBASE-10118. The change has been backported to HBase 0.94 and newer branches.

28.3. Current Limitations

28.3。当前的限制

28.3.1. Deletes mask Puts

28.3.1。删除面具了

Deletes mask puts, even puts that happened after the delete was entered. See HBASE-2256. Remember that a delete writes a tombstone, which only disappears after then next major compaction has run. Suppose you do a delete of everything ⇐ T. After this you do a new put with a timestamp ⇐ T. This put, even if it happened after the delete, will be masked by the delete tombstone. Performing the put will not fail, but when you do a get you will notice the put did have no effect. It will start working again after the major compaction has run. These issues should not be a problem if you use always-increasing versions for new puts to a row. But they can occur even if you do not care about time: just do delete and put immediately after each other, and there is some chance they happen within the same millisecond.

删除掩码设置,甚至在输入删除后发生。看到hbase - 2256。记住,一个删除写了一个墓碑,它只会在接下来的主要压实运行之后消失。假设你做删除一切⇐t后你做一个新的时间戳⇐t .这把,即使它发生在删除后,将删除蒙面的墓碑。执行put不会失败,但是当你做一个get时,你会注意到put没有效果。它将在主压缩运行后重新开始工作。如果您使用总是递增的新版本,这些问题不应该成为问题。但是,即使你不关心时间,它们也会发生:只需要立即删除和放置,并且在相同的毫秒内就有可能发生。

28.3.2. Major compactions change query results

28.3.2。主要的压缩会改变查询结果。

…​create three cell versions at t1, t2 and t3, with a maximum-versions setting of 2. So when getting all versions, only the values at t2 and t3 will be returned. But if you delete the version at t2 or t3, the one at t1 will appear again. Obviously, once a major compaction has run, such behavior will not be the case anymore…​ (See Garbage Collection in Bending time in HBase.)

在t1、t2和t3中创建三个单元版本,最大版本设置为2。所以当得到所有的版本时,只有t2和t3的值会被返回。但是如果你在t2或t3中删除这个版本,t1的那个将会再次出现。很明显,一旦一个主要的压缩运行,这样的行为就不再是这样了…(在HBase的弯曲时间里看到垃圾收集)。

29. Sort Order

29。排序顺序

All data model operations HBase return data in sorted order. First by row, then by ColumnFamily, followed by column qualifier, and finally timestamp (sorted in reverse, so newest records are returned first).

所有数据模型操作HBase返回数据的排序顺序。首先是行,然后是ColumnFamily,然后是列限定符,最后是时间戳(以反向排序,所以最新的记录是先返回的)。

30. Column Metadata

30.列元数据

There is no store of column metadata outside of the internal KeyValue instances for a ColumnFamily. Thus, while HBase can support not only a wide number of columns per row, but a heterogeneous set of columns between rows as well, it is your responsibility to keep track of the column names.

在ColumnFamily的内部KeyValue实例之外,不存在列元数据存储。因此,虽然HBase不仅可以支持每行中大量的列,而且还可以支持行之间的异构列,但您有责任跟踪列名。

The only way to get a complete set of columns that exist for a ColumnFamily is to process all the rows. For more information about how HBase stores data internally, see keyvalue.

获得一个列的完整的列的唯一方法是处理所有的行。有关HBase在内部如何存储数据的更多信息,请参见keyvalue。

31. Joins

31日。连接

Whether HBase supports joins is a common question on the dist-list, and there is a simple answer: it doesn’t, at not least in the way that RDBMS' support them (e.g., with equi-joins or outer-joins in SQL). As has been illustrated in this chapter, the read data model operations in HBase are Get and Scan.

HBase是否支持连接在列表中是一个常见的问题,并且有一个简单的答案:它没有,至少在RDBMS支持它们的方式上(例如,在SQL中使用equijoin或outer连接)。如本章所述,HBase中的读取数据模型操作是Get和Scan。

However, that doesn’t mean that equivalent join functionality can’t be supported in your application, but you have to do it yourself. The two primary strategies are either denormalizing the data upon writing to HBase, or to have lookup tables and do the join between HBase tables in your application or MapReduce code (and as RDBMS' demonstrate, there are several strategies for this depending on the size of the tables, e.g., nested loops vs. hash-joins). So which is the best approach? It depends on what you are trying to do, and as such there isn’t a single answer that works for every use case.

但是,这并不意味着在您的应用程序中不能支持等价的连接功能,但是您必须自己完成它。两个主要策略是denormalizing数据写入HBase,或查找表和HBase表之间的连接应用程序或MapReduce代码(如RDBMS的演示,有几种策略取决于表的大小,例如,嵌套循环比散列连接)。那么哪种方法才是最好的呢?这取决于你想要做什么,因此,对于每个用例都没有一个有效的答案。

32. ACID

32。酸

See ACID Semantics. Lars Hofhansl has also written a note on ACID in HBase.

看到酸语义。Lars Hofhansl也在HBase上写了关于酸的注释。

HBase and Schema Design

HBase和模式设计

A good introduction on the strength and weaknesses modelling on the various non-rdbms datastores is to be found in Ian Varley’s Master thesis, No Relation: The Mixed Blessings of Non-Relational Databases. It is a little dated now but a good background read if you have a moment on how HBase schema modeling differs from how it is done in an RDBMS. Also, read keyvalue for how HBase stores data internally, and the section on schema.casestudies.

在Ian Varley的硕士论文中,我们可以很好地介绍各种非关系型数据库的优点和缺点,没有关系:非关系数据库的混合的好处。它现在有点过时了,但是如果您有一个关于HBase模式建模与在RDBMS中是如何不同的时间,那么它将是一个很好的背景。另外,请阅读keyvalue,以了解HBase如何在内部存储数据,以及schema.casestudies的部分。

The documentation on the Cloud Bigtable website, Designing Your Schema, is pertinent and nicely done and lessons learned there equally apply here in HBase land; just divide any quoted values by ~10 to get what works for HBase: e.g. where it says individual values can be ~10MBs in size, HBase can do similar — perhaps best to go smaller if you can — and where it says a maximum of 100 column families in Cloud Bigtable, think ~10 when modeling on HBase.

云Bigtable网站上的文档,设计您的模式,是恰当的,很好地完成了,并且在HBase中也同样适用于这里的经验教训;只是任何引用值除以~ 10为HBase工作:例如,它说可以~ 10个人值大小mbs,HBase最好做类似的——也许可以更小的如果你可以和它说最多100列家庭云Bigtable,认为~ 10 HBase建模时。

See also Robert Yokota’s HBase Application Archetypes (an update on work done by other HBasers), for a helpful categorization of use cases that do well on top of the HBase model.

请参阅Robert Yokota的HBase应用程序原型(其他HBasers所做的工作更新),以帮助对在HBase模型之上做得更好的用例进行分类。

33. Schema Creation

33。创建模式

HBase schemas can be created or updated using the The Apache HBase Shell or by using Admin in the Java API.

可以使用Apache HBase Shell或在Java API中使用Admin来创建或更新HBase模式。

Tables must be disabled when making ColumnFamily modifications, for example:

在制作ColumnFamily的修改时必须禁用表,例如:

Configuration config = HBaseConfiguration.create();
Admin admin = new Admin(conf);
TableName table = TableName.valueOf("myTable");

admin.disableTable(table);

HColumnDescriptor cf1 = ...;
admin.addColumn(table, cf1);      // adding new ColumnFamily
HColumnDescriptor cf2 = ...;
admin.modifyColumn(table, cf2);    // modifying existing ColumnFamily

admin.enableTable(table);

See client dependencies for more information about configuring client connections.

有关配置客户端连接的更多信息,请参见客户机依赖关系。

online schema changes are supported in the 0.92.x codebase, but the 0.90.x codebase requires the table to be disabled.

33.1. Schema Updates

33.1。模式更新

When changes are made to either Tables or ColumnFamilies (e.g. region size, block size), these changes take effect the next time there is a major compaction and the StoreFiles get re-written.

当对表或columnfamily(例如区域大小、块大小)进行更改时,这些更改将在下一次发生重大压缩时生效,并重新编写存储文件。

See store for more information on StoreFiles.

有关存储文件的更多信息,请参见存储。

34. Table Schema Rules Of Thumb

34。表模式的经验法则。

There are many different data sets, with different access patterns and service-level expectations. Therefore, these rules of thumb are only an overview. Read the rest of this chapter to get more details after you have gone through this list.

有许多不同的数据集,有不同的访问模式和服务级别的期望。因此,这些经验法则只是一个概述。阅读本章的其余部分,以获得更多的细节。

  • Aim to have regions sized between 10 and 50 GB.

    目标区域大小在10到50 GB之间。

  • Aim to have cells no larger than 10 MB, or 50 MB if you use mob. Otherwise, consider storing your cell data in HDFS and store a pointer to the data in HBase.

    如果您使用mob,目标是没有大于10 MB的单元,或者50 MB。否则,考虑将您的单元数据存储在HDFS中,并在HBase中存储一个指向数据的指针。

  • A typical schema has between 1 and 3 column families per table. HBase tables should not be designed to mimic RDBMS tables.

    一个典型的模式在每个表中有1到3个列。HBase表不应该被设计成模拟RDBMS表。

  • Around 50-100 regions is a good number for a table with 1 or 2 column families. Remember that a region is a contiguous segment of a column family.

    大约50-100个区域对于有1或2个列族的表来说是一个很好的数字。记住,一个区域是一个列族的连续部分。

  • Keep your column family names as short as possible. The column family names are stored for every value (ignoring prefix encoding). They should not be self-documenting and descriptive like in a typical RDBMS.

    尽量缩短你的列姓。列家族名存储为每个值(忽略前缀编码)。它们不应该像典型的RDBMS那样自我记录和描述。

  • If you are storing time-based machine data or logging information, and the row key is based on device ID or service ID plus time, you can end up with a pattern where older data regions never have additional writes beyond a certain age. In this type of situation, you end up with a small number of active regions and a large number of older regions which have no new writes. For these situations, you can tolerate a larger number of regions because your resource consumption is driven by the active regions only.

    如果您正在存储基于时间的机器数据或日志信息,而行键是基于设备ID或服务ID加上时间的,那么您就可以使用一种模式,在这个模式中,较老的数据区域在一定的年龄之外不会有额外的写入。在这种情况下,你会得到少数活跃的区域和大量没有新写的旧区域。对于这些情况,您可以容忍更多的区域,因为您的资源消耗仅由活动区域驱动。

  • If only one column family is busy with writes, only that column family accomulates memory. Be aware of write patterns when allocating resources.

    如果只有一个列家庭忙于编写,那么只有这个列家庭容纳内存。在分配资源时要注意编写模式。

RegionServer Sizing Rules of Thumb

区域服务器的大小规则。

Lars Hofhansl wrote a great blog post about RegionServer memory sizing. The upshot is that you probably need more memory than you think you need. He goes into the impact of region size, memstore size, HDFS replication factor, and other things to check.

Lars Hofhansl写了一篇关于区域服务器内存大小的博文。结果是,你可能需要比你认为你需要的更多的记忆。他深入研究了区域大小、memstore大小、HDFS复制因子以及其他需要检查的东西。

Personally I would place the maximum disk space per machine that can be served exclusively with HBase around 6T, unless you have a very read-heavy workload. In that case the Java heap should be 32GB (20G regions, 128M memstores, the rest defaults).

我个人认为,除非你有非常繁重的工作,否则每台机器的最大磁盘空间只能在6T左右。在这种情况下,Java堆应该是32GB (20G区域,128M memstores,其余的缺省值)。

— Lars Hofhansl
http://hadoop-hbase.blogspot.com/2013/01/hbase-region-server-memory-sizing.html

35. On the number of column families

35。关于列族的数量。

HBase currently does not do well with anything above two or three column families so keep the number of column families in your schema low. Currently, flushing and compactions are done on a per Region basis so if one column family is carrying the bulk of the data bringing on flushes, the adjacent families will also be flushed even though the amount of data they carry is small. When many column families exist the flushing and compaction interaction can make for a bunch of needless i/o (To be addressed by changing flushing and compaction to work on a per column family basis). For more information on compactions, see Compaction.

HBase目前在两三个列的家庭中都不太好,所以在您的模式中保持列族的数量很低。目前,冲洗和压实是在每个区域的基础上进行的,因此,如果一个列族携带大量的数据带来了冲洗,即使他们携带的数据量很小,相邻的家庭也会被刷新。当许多列家族存在时,刷新和压实交互可以使一堆不必要的i/o(通过更改刷新和压缩以在每个列的家庭基础上工作)。有关Compaction的更多信息,请参见Compaction。

Try to make do with one column family if you can in your schemas. Only introduce a second and third column family in the case where data access is usually column scoped; i.e. you query one column family or the other but usually not both at the one time.

如果您可以在模式中使用一个列族,那么就尝试使用它。在数据访问通常为列范围的情况下,只引入第二和第三列家族;也就是说,你查询一个列的家庭或另一个,但通常不是两个都在同一时间。

35.1. Cardinality of ColumnFamilies

35.1。基数的ColumnFamilies

Where multiple ColumnFamilies exist in a single table, be aware of the cardinality (i.e., number of rows). If ColumnFamilyA has 1 million rows and ColumnFamilyB has 1 billion rows, ColumnFamilyA’s data will likely be spread across many, many regions (and RegionServers). This makes mass scans for ColumnFamilyA less efficient.

在单个表中存在多个ColumnFamilies时,请注意基数(即:的行数)。如果ColumnFamilyA有100万行,而ColumnFamilyB有10亿行,那么ColumnFamilyA的数据很可能会分布在许多区域(和区域服务器)。这使得对ColumnFamilyA的大规模扫描效率降低。

36. Rowkey Design

36。Rowkey设计

36.1. Hotspotting

36.1。热点

Rows in HBase are sorted lexicographically by row key. This design optimizes for scans, allowing you to store related rows, or rows that will be read together, near each other. However, poorly designed row keys are a common source of hotspotting. Hotspotting occurs when a large amount of client traffic is directed at one node, or only a few nodes, of a cluster. This traffic may represent reads, writes, or other operations. The traffic overwhelms the single machine responsible for hosting that region, causing performance degradation and potentially leading to region unavailability. This can also have adverse effects on other regions hosted by the same region server as that host is unable to service the requested load. It is important to design data access patterns such that the cluster is fully and evenly utilized.

HBase中的行由行键按字母顺序排序。该设计优化了扫描,允许您存储相关的行,或将一起阅读的行,彼此相邻。然而,设计糟糕的行键是热点的常见来源。当大量的客户端流量定向到一个节点(或仅几个节点)时,就会出现“热点识别”。此流量可以表示读、写或其他操作。流量超过了负责托管该区域的单一机器,导致性能下降,并可能导致区域不可用。这也会对同一区域服务器承载的其他区域产生不利影响,因为主机无法服务请求的负载。设计数据访问模式是很重要的,这样集群就可以得到充分和均匀的利用。

To prevent hotspotting on writes, design your row keys such that rows that truly do need to be in the same region are, but in the bigger picture, data is being written to multiple regions across the cluster, rather than one at a time. Some common techniques for avoiding hotspotting are described below, along with some of their advantages and drawbacks.

为了防止在编写时出现hotspotting,请设计行键,以便在同一区域中真正需要的行是相同的,但是在更大的情况下,数据将被写到集群中的多个区域,而不是一次一个。下面将介绍一些避免热识别的常用技术,以及它们的一些优点和缺点。

Salting

Salting in this sense has nothing to do with cryptography, but refers to adding random data to the start of a row key. In this case, salting refers to adding a randomly-assigned prefix to the row key to cause it to sort differently than it otherwise would. The number of possible prefixes correspond to the number of regions you want to spread the data across. Salting can be helpful if you have a few "hot" row key patterns which come up over and over amongst other more evenly-distributed rows. Consider the following example, which shows that salting can spread write load across multiple RegionServers, and illustrates some of the negative implications for reads.

在这个意义上,Salting与加密无关,而是指将随机数据添加到行键的开头。在这种情况下,salting指的是在行键中添加一个随机分配的前缀,以使它以不同于其他方式的方式排序。可能的前缀的数量与您希望将数据传播的区域数量相对应。如果您有几个“热”行键模式,这些模式在其他更均匀分布的行中反复出现,那么Salting可能会有帮助。请考虑下面的示例,它显示了salting可以跨多个区域服务器传播写负载,并演示了一些对读取的负面影响。

Example 15. Salting Example

Suppose you have the following list of row keys, and your table is split such that there is one region for each letter of the alphabet. Prefix 'a' is one region, prefix 'b' is another. In this table, all rows starting with 'f' are in the same region. This example focuses on rows with keys like the following:

假设您有下面的行键列表,并且您的表被拆分,这样每个字母表的每个字母都有一个区域。前缀“a”是一个区域,前缀“b”是另一个区域。在该表中,以“f”开头的所有行位于同一区域。这个例子关注的是带键的行:

foo0001
foo0002
foo0003
foo0004

Now, imagine that you would like to spread these across four different regions. You decide to use four different salts: a, b, c, and d. In this scenario, each of these letter prefixes will be on a different region. After applying the salts, you have the following rowkeys instead. Since you can now write to four separate regions, you theoretically have four times the throughput when writing that you would have if all the writes were going to the same region.

现在,想象一下,你想要把它们分散到四个不同的区域。您决定使用四种不同的盐:a、b、c和d。在这种情况下,每个字母前缀将位于不同的区域。在应用了这些盐之后,您将得到以下的rowkeys。既然你现在可以写四个不同的区域,理论上你写的时候你会有4倍的吞吐量,如果所有的写都是在同一个区域。

a-foo0003
b-foo0001
c-foo0004
d-foo0002

Then, if you add another row, it will randomly be assigned one of the four possible salt values and end up near one of the existing rows.

然后,如果您添加另一行,它将随机分配四种可能的盐值之一,并在现有的行中结束。

a-foo0003
b-foo0001
c-foo0003
c-foo0004
d-foo0002

Since this assignment will be random, you will need to do more work if you want to retrieve the rows in lexicographic order. In this way, salting attempts to increase throughput on writes, but has a cost during reads.

由于这个任务是随机的,如果你想要在字典顺序中检索行,你需要做更多的工作。通过这种方式,salting试图增加写操作的吞吐量,但是在读取过程中有成本。

Hashing

Instead of a random assignment, you could use a one-way hash that would cause a given row to always be "salted" with the same prefix, in a way that would spread the load across the RegionServers, but allow for predictability during reads. Using a deterministic hash allows the client to reconstruct the complete rowkey and use a Get operation to retrieve that row as normal.

您可以使用单向散列,而不是随机分配,这样可以使给定的行始终以相同的前缀“加盐”,这样可以在区域服务器上分散负载,但是在读取期间允许可预测性。使用确定性哈希允许客户端重构完整的rowkey,并使用Get操作来恢复正常的行。

Example 16. Hashing Example
Given the same situation in the salting example above, you could instead apply a one-way hash that would cause the row with key foo0003 to always, and predictably, receive the a prefix. Then, to retrieve that row, you would already know the key. You could also optimize things so that certain pairs of keys were always in the same region, for instance.
Reversing the Key

A third common trick for preventing hotspotting is to reverse a fixed-width or numeric row key so that the part that changes the most often (the least significant digit) is first. This effectively randomizes row keys, but sacrifices row ordering properties.

防止热斑的第三个常见的技巧是,反转固定宽度或数字行键,使最常发生变化的部分(最不重要的数字)是第一个。这有效地随机化了行键,但牺牲了行排序属性。

See https://communities.intel.com/community/itpeernetwork/datastack/blog/2013/11/10/discussion-on-designing-hbase-tables, and article on Salted Tables from the Phoenix project, and the discussion in the comments of HBASE-11682 for more information about avoiding hotspotting.

请参阅https://community.intel.com/community/itpeernetwork/datastack/blog/2013/11/10/discussion -on-design -hbase- Tables,并在“凤凰”项目中对盐表进行讨论,并在HBASE-11682的评论中进行讨论,以获得更多关于避免热点的信息。

36.2. Monotonically Increasing Row Keys/Timeseries Data

36.2。单调递增的行键/Timeseries数据。

In the HBase chapter of Tom White’s book Hadoop: The Definitive Guide (O’Reilly) there is a an optimization note on watching out for a phenomenon where an import process walks in lock-step with all clients in concert pounding one of the table’s regions (and thus, a single node), then moving onto the next region, etc. With monotonically increasing row-keys (i.e., using a timestamp), this will happen. See this comic by IKai Lan on why monotonically increasing row keys are problematic in BigTable-like datastores: monotonically increasing values are bad. The pile-up on a single region brought on by monotonically increasing keys can be mitigated by randomizing the input records to not be in sorted order, but in general it’s best to avoid using a timestamp or a sequence (e.g. 1, 2, 3) as the row-key.

汤姆的HBase章白的书Hadoop:明确的指南(O ' reilly)有一个优化注意看了一个现象,一个导入过程走在同步与所有客户共同冲击表的地区之一(因此,单个节点),然后移动到下一个区域,与单调递增的行键(即等等。。(使用时间戳),这将发生。看一看IKai Lan的漫画,为什么单调递增的行键在bigtable类的数据存储中是有问题的:单调递增的值是不好的。通过对输入记录进行随机化,可以减轻单个区域上单调递增的键所带来的堆积,但一般来说,最好避免使用时间戳或序列(例如,1、2、3)作为行键。

If you do need to upload time series data into HBase, you should study OpenTSDB as a successful example. It has a page describing the schema it uses in HBase. The key format in OpenTSDB is effectively [metric_type][event_timestamp], which would appear at first glance to contradict the previous advice about not using a timestamp as the key. However, the difference is that the timestamp is not in the lead position of the key, and the design assumption is that there are dozens or hundreds (or more) of different metric types. Thus, even with a continual stream of input data with a mix of metric types, the Puts are distributed across various points of regions in the table.

如果您确实需要将时间序列数据上传到HBase中,那么您应该将OpenTSDB作为一个成功的例子。它有一个描述它在HBase中使用的模式的页面。OpenTSDB中的关键格式是有效的[metric_type][event_timestamp],它会在第一眼看上去与之前的建议相矛盾,即不使用时间戳作为键。但是,不同之处在于时间戳不在键的主要位置,而设计假设是有几十个或数百个(或更多)不同的度量类型。因此,即使输入数据的连续流与度量类型混合,也会分布在表中各个区域的位置上。

See schema.casestudies for some rowkey design examples.

看到模式。对一些rowkey设计示例的案例研究。

36.3. Try to minimize row and column sizes

36.3。尽量减少行和列的大小。

In HBase, values are always freighted with their coordinates; as a cell value passes through the system, it’ll be accompanied by its row, column name, and timestamp - always. If your rows and column names are large, especially compared to the size of the cell value, then you may run up against some interesting scenarios. One such is the case described by Marc Limotte at the tail of HBASE-3551 (recommended!). Therein, the indices that are kept on HBase storefiles (StoreFile (HFile)) to facilitate random access may end up occupying large chunks of the HBase allotted RAM because the cell value coordinates are large. Mark in the above cited comment suggests upping the block size so entries in the store file index happen at a larger interval or modify the table schema so it makes for smaller rows and column names. Compression will also make for larger indices. See the thread a question storefileIndexSize up on the user mailing list.

在HBase中,值总是与它们的坐标相匹配;当一个单元格值通过该系统时,它将伴随它的行、列名称和时间戳——始终。如果您的行和列名很大,特别是与单元格值的大小相比,那么您可能会遇到一些有趣的情况。其中一个例子是Marc Limotte在HBASE-3551(推荐!)的尾巴上描述的。其中,保存在HBase storefiles (StoreFile (HFile))上的索引可以方便随机访问,最终可能占用大量的HBase分配RAM,因为单元值坐标很大。在上面引用的注释中,Mark建议增加块大小,这样,存储文件索引中的条目就会发生在更大的间隔中,或者修改表模式,这样它就可以为较小的行和列名进行修改。压缩也将为更大的指标。在用户邮件列表上查看一个问题storefileIndexSize。

Most of the time small inefficiencies don’t matter all that much. Unfortunately, this is a case where they do. Whatever patterns are selected for ColumnFamilies, attributes, and rowkeys they could be repeated several billion times in your data.

大多数时候,小的低效率并不重要。不幸的是,这是他们的一个例子。无论对ColumnFamilies、属性和rowkeys选择何种模式,它们都可以在您的数据中重复几十亿次。

See keyvalue for more information on HBase stores data internally to see why this is important.

有关HBase存储数据的更多信息,请参见keyvalue,以了解其重要性。

36.3.1. Column Families

36.3.1。列的家庭

Try to keep the ColumnFamily names as small as possible, preferably one character (e.g. "d" for data/default).

尽量使列族名尽可能小,最好是一个字符(例如:数据/默认的“d”)。

See KeyValue for more information on HBase stores data internally to see why this is important.

有关HBase存储数据的更多信息,请参见KeyValue,以了解其重要性。

36.3.2. Attributes

36.3.2。属性

Although verbose attribute names (e.g., "myVeryImportantAttribute") are easier to read, prefer shorter attribute names (e.g., "via") to store in HBase.

尽管详细的属性名称(例如,“myVeryImportantAttribute”)更容易阅读,但更喜欢较短的属性名称(例如,“via”)来存储在HBase中。

See keyvalue for more information on HBase stores data internally to see why this is important.

有关HBase存储数据的更多信息,请参见keyvalue,以了解其重要性。

36.3.3. Rowkey Length

36.3.3。Rowkey长度

Keep them as short as is reasonable such that they can still be useful for required data access (e.g. Get vs. Scan). A short key that is useless for data access is not better than a longer key with better get/scan properties. Expect tradeoffs when designing rowkeys.

让它们尽可能短,这样它们仍然可以用于需要的数据访问(例如,Get和Scan)。一个对数据访问无用的短键并不比一个更长的键更好的获取/扫描属性好。在设计行键时,需要权衡。

36.3.4. Byte Patterns

36.3.4。字节模式

A long is 8 bytes. You can store an unsigned number up to 18,446,744,073,709,551,615 in those eight bytes. If you stored this number as a String — presuming a byte per character — you need nearly 3x the bytes.

一个长是8个字节。在这8个字节中,您可以将未签名的数字存储为18,446,744,073,709,551,615。如果将这个数字存储为字符串,假设每个字符为一个字节,那么您需要的字节数接近3倍。

Not convinced? Below is some sample code that you can run on your own.

不相信吗?下面是一些您可以自己运行的示例代码。

// long
//
long l = 1234567890L;
byte[] lb = Bytes.toBytes(l);
System.out.println("long bytes length: " + lb.length);   // returns 8

String s = String.valueOf(l);
byte[] sb = Bytes.toBytes(s);
System.out.println("long as string length: " + sb.length);    // returns 10

// hash
//
MessageDigest md = MessageDigest.getInstance("MD5");
byte[] digest = md.digest(Bytes.toBytes(s));
System.out.println("md5 digest bytes length: " + digest.length);    // returns 16

String sDigest = new String(digest);
byte[] sbDigest = Bytes.toBytes(sDigest);
System.out.println("md5 digest as string length: " + sbDigest.length);    // returns 26

Unfortunately, using a binary representation of a type will make your data harder to read outside of your code. For example, this is what you will see in the shell when you increment a value:

不幸的是,使用一种类型的二进制表示将使您的数据更难在代码之外读取。例如,当您增加一个值时,您将在shell中看到:

hbase(main):001:0> incr 't', 'r', 'f:q', 1
COUNTER VALUE = 1

hbase(main):002:0> get 't', 'r'
COLUMN                                        CELL
 f:q                                          timestamp=1369163040570, value=\x00\x00\x00\x00\x00\x00\x00\x01
1 row(s) in 0.0310 seconds

The shell makes a best effort to print a string, and it this case it decided to just print the hex. The same will happen to your row keys inside the region names. It can be okay if you know what’s being stored, but it might also be unreadable if arbitrary data can be put in the same cells. This is the main trade-off.

shell为打印字符串做出了最大的努力,并且它决定只打印十六进制。同样的情况也会发生在区域名称的行键中。如果您知道存储了什么,那么它可能是可以的,但是如果可以将任意数据放入相同的单元中,那么它也可能是不可读的。这是主要的权衡。

36.4. Reverse Timestamps

36.4。反向时间戳

Reverse Scan API

HBASE-4811 implements an API to scan a table or a range within a table in reverse, reducing the need to optimize your schema for forward or reverse scanning. This feature is available in HBase 0.98 and later. See Scan.setReversed() for more information.

HBASE-4811实现了一个API,可以在一个表中反向扫描一个表或一个范围,从而减少对前向或反向扫描优化模式的需求。该特性在HBase 0.98和以后可用。要了解更多信息,请参阅scan.setre()。

A common problem in database processing is quickly finding the most recent version of a value. A technique using reverse timestamps as a part of the key can help greatly with a special case of this problem. Also found in the HBase chapter of Tom White’s book Hadoop: The Definitive Guide (O’Reilly), the technique involves appending (Long.MAX_VALUE - timestamp) to the end of any key, e.g. [key][reverse_timestamp].

数据库处理中的一个常见问题是快速找到最新版本的值。使用反向时间戳作为键的一部分的技术可以极大地帮助解决这个问题的特殊情况。在汤姆·怀特的书《Hadoop:权威指南》(O 'Reilly)的HBase章节中也有发现,该技术包括附加(Long)。MAX_VALUE - timestamp)到任何键的末尾,例如[key][reverse_timestamp]。

The most recent value for [key] in a table can be found by performing a Scan for [key] and obtaining the first record. Since HBase keys are in sorted order, this key sorts before any older row-keys for [key] and thus is first.

通过对[key]进行扫描并获得第一个记录,可以找到表中[key]的最近值。由于HBase键是按排序顺序排列的,所以这一键在任何老的行键之前(键)之前排序,因此是第一个键。

This technique would be used instead of using Number of Versions where the intent is to hold onto all versions "forever" (or a very long time) and at the same time quickly obtain access to any other version by using the same Scan technique.

这种技术将会被使用,而不是使用许多版本,其中的目的是“永远”(或很长时间)保存所有版本,同时通过使用相同的扫描技术快速获得对任何其他版本的访问。

36.5. Rowkeys and ColumnFamilies

36.5。Rowkeys和ColumnFamilies

Rowkeys are scoped to ColumnFamilies. Thus, the same rowkey could exist in each ColumnFamily that exists in a table without collision.

Rowkeys被限定在ColumnFamilies中。因此,相同的行键可以存在于一个没有冲突的表中。

36.6. Immutability of Rowkeys

36.6。不变性的Rowkeys

Rowkeys cannot be changed. The only way they can be "changed" in a table is if the row is deleted and then re-inserted. This is a fairly common question on the HBase dist-list so it pays to get the rowkeys right the first time (and/or before you’ve inserted a lot of data).

Rowkeys不能改变。它们在表中“更改”的惟一方法是,如果行被删除,然后重新插入。这在HBase列表中是一个相当常见的问题,因此在第一次(以及/或插入大量数据之前)获得rowkeys是值得的。

36.7. Relationship Between RowKeys and Region Splits

36.7。RowKeys与区域分割的关系。

If you pre-split your table, it is critical to understand how your rowkey will be distributed across the region boundaries. As an example of why this is important, consider the example of using displayable hex characters as the lead position of the key (e.g., "0000000000000000" to "ffffffffffffffff"). Running those key ranges through Bytes.split (which is the split strategy used when creating regions in Admin.createTable(byte[] startKey, byte[] endKey, numRegions) for 10 regions will generate the following splits…​

如果您预先分割了您的表,那么理解您的rowkey将如何分布到整个区域边界是非常重要的。作为一个重要的例子,考虑使用可显示的十六进制字符作为键的主要位置的例子(例如,“0000000000000000”到“ffffffffffffffffffff”)。通过字节来运行这些键。split(在管理中创建区域时使用的拆分策略)。10个区域的createTable(byte[] startKey, byte[] endKey, numRegions)将产生以下的分割…

48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48                                // 0
54 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10                 // 6
61 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -68                 // =
68 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -126  // D
75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 72                                // K
82 18 18 18 18 18 18 18 18 18 18 18 18 18 18 14                                // R
88 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -44                 // X
95 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -102                // _
102 102 102 102 102 102 102 102 102 102 102 102 102 102 102 102                // f

(note: the lead byte is listed to the right as a comment.) Given that the first split is a '0' and the last split is an 'f', everything is great, right? Not so fast.

(注意:标题字节被列在右侧作为注释。)考虑到第一个分割是一个“0”而最后一个分割是一个“f”,一切都很好,对吧?没有那么快。

The problem is that all the data is going to pile up in the first 2 regions and the last region thus creating a "lumpy" (and possibly "hot") region problem. To understand why, refer to an ASCII Table. '0' is byte 48, and 'f' is byte 102, but there is a huge gap in byte values (bytes 58 to 96) that will never appear in this keyspace because the only values are [0-9] and [a-f]. Thus, the middle regions will never be used. To make pre-splitting work with this example keyspace, a custom definition of splits (i.e., and not relying on the built-in split method) is required.

问题是,所有的数据都将堆积在前两个区域和最后一个区域,从而造成一个“块状”(可能是“热”)区域问题。要理解原因,请参考ASCII表。“0”是字节48,而“f”是字节102,但是字节值(字节58到96)之间有一个巨大的空白,因为唯一的值是[0-9]和[a-f],所以它永远不会出现在这个密钥空间中。因此,中间区域将永远不会被使用。要使用这个示例keyspace进行预分解工作,可以定义分割(也就是)。,并且不依赖内置的分割方法)是必需的。

Lesson #1: Pre-splitting tables is generally a best practice, but you need to pre-split them in such a way that all the regions are accessible in the keyspace. While this example demonstrated the problem with a hex-key keyspace, the same problem can happen with any keyspace. Know your data.

第1课:预分解表通常是最佳实践,但您需要以这样一种方式预分解它们,使所有区域都可以在keyspace中访问。这个例子演示了一个hex密钥空间的问题,同样的问题也可能发生在任何密钥空间中。知道你的数据。

Lesson #2: While generally not advisable, using hex-keys (and more generally, displayable data) can still work with pre-split tables as long as all the created regions are accessible in the keyspace.

第2课:虽然通常不可取,但是使用hex键(更一般地说,可显示数据)仍然可以使用预分割表,只要在keyspace中可以访问所有创建的区域。

To conclude this example, the following is an example of how appropriate splits can be pre-created for hex-keys:.

为了完成这个示例,下面是一个示例,说明如何为hex键预先创建适当的分割:。

public static boolean createTable(Admin admin, HTableDescriptor table, byte[][] splits)
throws IOException {
  try {
    admin.createTable( table, splits );
    return true;
  } catch (TableExistsException e) {
    logger.info("table " + table.getNameAsString() + " already exists");
    // the table already exists...
    return false;
  }
}

public static byte[][] getHexSplits(String startKey, String endKey, int numRegions) {
  byte[][] splits = new byte[numRegions-1][];
  BigInteger lowestKey = new BigInteger(startKey, 16);
  BigInteger highestKey = new BigInteger(endKey, 16);
  BigInteger range = highestKey.subtract(lowestKey);
  BigInteger regionIncrement = range.divide(BigInteger.valueOf(numRegions));
  lowestKey = lowestKey.add(regionIncrement);
  for(int i=0; i < numRegions-1;i++) {
    BigInteger key = lowestKey.add(regionIncrement.multiply(BigInteger.valueOf(i)));
    byte[] b = String.format("%016x", key).getBytes();
    splits[i] = b;
  }
  return splits;
}

37. Number of Versions

37岁。数量的版本

37.1. Maximum Number of Versions

37.1。最大数量的版本

The maximum number of row versions to store is configured per column family via HColumnDescriptor. The default for max versions is 1. This is an important parameter because as described in Data Model section HBase does not overwrite row values, but rather stores different values per row by time (and qualifier). Excess versions are removed during major compactions. The number of max versions may need to be increased or decreased depending on application needs.

每个列家族通过HColumnDescriptor配置存储的行版本的最大数量。max版本的默认值是1。这是一个重要的参数,因为正如数据模型部分HBase所描述的那样,它不会覆盖行值,而是按时间(和限定符)存储不同的值。在主要的压缩过程中删除多余的版本。根据应用程序的需要,可能需要增加或减少max版本的数量。

It is not recommended setting the number of max versions to an exceedingly high level (e.g., hundreds or more) unless those old values are very dear to you because this will greatly increase StoreFile size.

不建议将max版本的数量设置为非常高的级别(例如,数百或更多),除非这些旧值对您非常重要,因为这会大大增加StoreFile的大小。

37.2. Minimum Number of Versions

37.2。最小数量的版本

Like maximum number of row versions, the minimum number of row versions to keep is configured per column family via HColumnDescriptor. The default for min versions is 0, which means the feature is disabled. The minimum number of row versions parameter is used together with the time-to-live parameter and can be combined with the number of row versions parameter to allow configurations such as "keep the last T minutes worth of data, at most N versions, but keep at least M versions around" (where M is the value for minimum number of row versions, M<N). This parameter should only be set when time-to-live is enabled for a column family and must be less than the number of row versions.

与最大行版本数一样,通过HColumnDescriptor将每个列家庭配置的行版本的最小数量。最小版本的默认值是0,这意味着该特性是禁用的。最小数量的行版本参数是与生存时间参数一起使用,可以结合行版本参数允许的数量配置如“保持最后T分钟的数据,最多N版本,但至少保持M版本”(M的值为最小数量的行版本,M < N)。此参数只在对列族启用时才设置,并且必须小于行版本的数量。

38. Supported Datatypes

38。支持的数据类型

HBase supports a "bytes-in/bytes-out" interface via Put and Result, so anything that can be converted to an array of bytes can be stored as a value. Input could be strings, numbers, complex objects, or even images as long as they can rendered as bytes.

HBase通过Put和Result支持“bytes-in/ byts -out”接口,因此可以将任何可以转换成字节数组的内容存储为一个值。输入可以是字符串、数字、复杂对象,甚至是图像,只要它们能以字节的形式呈现。

There are practical limits to the size of values (e.g., storing 10-50MB objects in HBase would probably be too much to ask); search the mailing list for conversations on this topic. All rows in HBase conform to the Data Model, and that includes versioning. Take that into consideration when making your design, as well as block size for the ColumnFamily.

对于值的大小有实际的限制(例如,在HBase中存储10-50MB的对象可能会要求太多);在邮件列表中搜索关于这个主题的对话。HBase中的所有行都符合数据模型,包括版本控制。在设计时要考虑到这一点,也要考虑到ColumnFamily的块大小。

38.1. Counters

38.1。计数器

One supported datatype that deserves special mention are "counters" (i.e., the ability to do atomic increments of numbers). See Increment in Table.

一个值得特别提及的支持数据类型是“计数器”。,也就是对数字进行原子增量的能力。看到增量表。

Synchronization on counters are done on the RegionServer, not in the client.

计数器上的同步是在区域服务器上完成的,而不是在客户机上。

39. Joins

39岁。连接

If you have multiple tables, don’t forget to factor in the potential for Joins into the schema design.

如果您有多个表,请不要忘记考虑加入模式设计的可能性。

40. Time To Live (TTL)

40。生存时间(TTL)

ColumnFamilies can set a TTL length in seconds, and HBase will automatically delete rows once the expiration time is reached. This applies to all versions of a row - even the current one. The TTL time encoded in the HBase for the row is specified in UTC.

ColumnFamilies可以在秒内设置TTL长度,HBase在到达过期时间后自动删除行。这适用于行的所有版本——甚至是当前版本。在UTC中指定了在HBase中编码的TTL时间。

Store files which contains only expired rows are deleted on minor compaction. Setting hbase.store.delete.expired.storefile to false disables this feature. Setting minimum number of versions to other than 0 also disables this.

只在较小的压缩过程中删除包含过期行的存储文件。设置hbase.store.delete.expired.storefile对该特性的错误禁用。将最小数量的版本设置为0也可以禁用此功能。

See HColumnDescriptor for more information.

有关更多信息,请参见HColumnDescriptor。

Recent versions of HBase also support setting time to live on a per cell basis. See HBASE-10560 for more information. Cell TTLs are submitted as an attribute on mutation requests (Appends, Increments, Puts, etc.) using Mutation#setTTL. If the TTL attribute is set, it will be applied to all cells updated on the server by the operation. There are two notable differences between cell TTL handling and ColumnFamily TTLs:

HBase的最新版本也支持在每个单元的基础上设置时间。更多信息见HBASE-10560。使用突变#setTTL,将单元TTLs作为一个属性提交给突变请求(附加、增量、放置等)。如果设置了TTL属性,它将被应用到服务器上通过操作更新的所有单元格。TTL处理与ColumnFamily TTLs有两个显著的区别:

  • Cell TTLs are expressed in units of milliseconds instead of seconds.

    单元TTLs以毫秒为单位表示,而不是以秒为单位。

  • A cell TTLs cannot extend the effective lifetime of a cell beyond a ColumnFamily level TTL setting.

    一个细胞TTLs不能将一个细胞的有效寿命延长到一个ColumnFamily level TTL设置之外。

41. Keeping Deleted Cells

41岁。保持删除细胞

By default, delete markers extend back to the beginning of time. Therefore, Get or Scan operations will not see a deleted cell (row or column), even when the Get or Scan operation indicates a time range before the delete marker was placed.

默认情况下,删除标记可以追溯到时间的开始。因此,获取或扫描操作将不会看到被删除的单元格(行或列),即使在获取或扫描操作指示删除标记放置之前的时间范围。

ColumnFamilies can optionally keep deleted cells. In this case, deleted cells can still be retrieved, as long as these operations specify a time range that ends before the timestamp of any delete that would affect the cells. This allows for point-in-time queries even in the presence of deletes.

ColumnFamilies可以选择保留删除的单元格。在这种情况下,仍然可以检索被删除的单元格,只要这些操作指定一个时间范围,在任何将影响单元格的删除时间戳之前结束。即使在删除的情况下,这也允许进行时间点查询。

Deleted cells are still subject to TTL and there will never be more than "maximum number of versions" deleted cells. A new "raw" scan options returns all deleted rows and the delete markers.

已删除的单元仍受TTL的约束,并且永远不会有超过“最大版本”被删除的单元格。一个新的“原始”扫描选项返回所有被删除的行和删除标记。

Example 17. Change the Value of KEEP_DELETED_CELLS Using HBase Shell
hbase> hbase> alter ‘t1′, NAME => ‘f1′, KEEP_DELETED_CELLS => true
Example 18. Change the Value of KEEP_DELETED_CELLS Using the API
...
HColumnDescriptor.setKeepDeletedCells(true);
...

Let us illustrate the basic effect of setting the KEEP_DELETED_CELLS attribute on a table.

让我们说明在表中设置keep_deleted_cell属性的基本效果。

First, without:

首先,没有:

create 'test', {NAME=>'e', VERSIONS=>2147483647}
put 'test', 'r1', 'e:c1', 'value', 10
put 'test', 'r1', 'e:c1', 'value', 12
put 'test', 'r1', 'e:c1', 'value', 14
delete 'test', 'r1', 'e:c1',  11

hbase(main):017:0> scan 'test', {RAW=>true, VERSIONS=>1000}
ROW                                              COLUMN+CELL
 r1                                              column=e:c1, timestamp=14, value=value
 r1                                              column=e:c1, timestamp=12, value=value
 r1                                              column=e:c1, timestamp=11, type=DeleteColumn
 r1                                              column=e:c1, timestamp=10, value=value
1 row(s) in 0.0120 seconds

hbase(main):018:0> flush 'test'
0 row(s) in 0.0350 seconds

hbase(main):019:0> scan 'test', {RAW=>true, VERSIONS=>1000}
ROW                                              COLUMN+CELL
 r1                                              column=e:c1, timestamp=14, value=value
 r1                                              column=e:c1, timestamp=12, value=value
 r1                                              column=e:c1, timestamp=11, type=DeleteColumn
1 row(s) in 0.0120 seconds

hbase(main):020:0> major_compact 'test'
0 row(s) in 0.0260 seconds

hbase(main):021:0> scan 'test', {RAW=>true, VERSIONS=>1000}
ROW                                              COLUMN+CELL
 r1                                              column=e:c1, timestamp=14, value=value
 r1                                              column=e:c1, timestamp=12, value=value
1 row(s) in 0.0120 seconds

Notice how delete cells are let go.

注意删除单元格是如何被释放的。

Now let’s run the same test only with KEEP_DELETED_CELLS set on the table (you can do table or per-column-family):

现在,让我们只在表上设置KEEP_DELETED_CELLS(您可以做表或每个列的家庭)来运行同一个测试:

hbase(main):005:0> create 'test', {NAME=>'e', VERSIONS=>2147483647, KEEP_DELETED_CELLS => true}
0 row(s) in 0.2160 seconds

=> Hbase::Table - test
hbase(main):006:0> put 'test', 'r1', 'e:c1', 'value', 10
0 row(s) in 0.1070 seconds

hbase(main):007:0> put 'test', 'r1', 'e:c1', 'value', 12
0 row(s) in 0.0140 seconds

hbase(main):008:0> put 'test', 'r1', 'e:c1', 'value', 14
0 row(s) in 0.0160 seconds

hbase(main):009:0> delete 'test', 'r1', 'e:c1',  11
0 row(s) in 0.0290 seconds

hbase(main):010:0> scan 'test', {RAW=>true, VERSIONS=>1000}
ROW                                                                                          COLUMN+CELL
 r1                                                                                          column=e:c1, timestamp=14, value=value
 r1                                                                                          column=e:c1, timestamp=12, value=value
 r1                                                                                          column=e:c1, timestamp=11, type=DeleteColumn
 r1                                                                                          column=e:c1, timestamp=10, value=value
1 row(s) in 0.0550 seconds

hbase(main):011:0> flush 'test'
0 row(s) in 0.2780 seconds

hbase(main):012:0> scan 'test', {RAW=>true, VERSIONS=>1000}
ROW                                                                                          COLUMN+CELL
 r1                                                                                          column=e:c1, timestamp=14, value=value
 r1                                                                                          column=e:c1, timestamp=12, value=value
 r1                                                                                          column=e:c1, timestamp=11, type=DeleteColumn
 r1                                                                                          column=e:c1, timestamp=10, value=value
1 row(s) in 0.0620 seconds

hbase(main):013:0> major_compact 'test'
0 row(s) in 0.0530 seconds

hbase(main):014:0> scan 'test', {RAW=>true, VERSIONS=>1000}
ROW                                                                                          COLUMN+CELL
 r1                                                                                          column=e:c1, timestamp=14, value=value
 r1                                                                                          column=e:c1, timestamp=12, value=value
 r1                                                                                          column=e:c1, timestamp=11, type=DeleteColumn
 r1                                                                                          column=e:c1, timestamp=10, value=value
1 row(s) in 0.0650 seconds

KEEP_DELETED_CELLS is to avoid removing Cells from HBase when the only reason to remove them is the delete marker. So with KEEP_DELETED_CELLS enabled deleted cells would get removed if either you write more versions than the configured max, or you have a TTL and Cells are in excess of the configured timeout, etc.

KEEP_DELETED_CELLS是为了避免移除HBase中的单元格,因为删除它们的唯一原因是删除标记。因此,如果您编写的版本比配置的max多,或者您有一个TTL和单元格超过了配置的超时,那么使用KEEP_DELETED_CELLS将会被删除。

42. Secondary Indexes and Alternate Query Paths

42。二级索引和备选查询路径。

This section could also be titled "what if my table rowkey looks like this but I also want to query my table like that." A common example on the dist-list is where a row-key is of the format "user-timestamp" but there are reporting requirements on activity across users for certain time ranges. Thus, selecting by user is easy because it is in the lead position of the key, but time is not.

这个部分也可以命名为“如果我的表rowkey看起来像这样,但是我也想这样查询我的表。”在列表列表中,一个常见的例子是“用户时间戳”格式的行键,但是在特定的时间范围内,对用户的活动有报告要求。因此,用户选择很容易,因为它处于关键位置,但时间不是。

There is no single answer on the best way to handle this because it depends on…​

解决这个问题的最好方法没有单一的答案,因为这取决于……

  • Number of users

    用户数量

  • Data size and data arrival rate

    数据大小和数据到达率。

  • Flexibility of reporting requirements (e.g., completely ad-hoc date selection vs. pre-configured ranges)

    报告需求的灵活性(例如,完全特别的日期选择和预先配置的范围)

  • Desired execution speed of query (e.g., 90 seconds may be reasonable to some for an ad-hoc report, whereas it may be too long for others)

    期望的执行速度(例如,90秒可能对某些特定的报告来说是合理的,而对于其他人来说可能太长了)

and solutions are also influenced by the size of the cluster and how much processing power you have to throw at the solution. Common techniques are in sub-sections below. This is a comprehensive, but not exhaustive, list of approaches.

解决方案也会受到集群规模的影响,以及需要处理多少处理能力。常见的技术在下面的小节中。这是一个全面的,但不是详尽的方法列表。

It should not be a surprise that secondary indexes require additional cluster space and processing. This is precisely what happens in an RDBMS because the act of creating an alternate index requires both space and processing cycles to update. RDBMS products are more advanced in this regard to handle alternative index management out of the box. However, HBase scales better at larger data volumes, so this is a feature trade-off.

次要索引需要额外的集群空间和处理,这并不奇怪。这正是在RDBMS中发生的情况,因为创建替代索引的行为需要空间和处理周期来更新。在这方面,RDBMS产品在处理替代索引管理方面更为先进。但是,HBase在更大的数据量上更有效,因此这是一个特性权衡。

Pay attention to Apache HBase Performance Tuning when implementing any of these approaches.

在实现任何这些方法时,请注意Apache HBase性能调优。

Additionally, see the David Butler response in this dist-list thread HBase, mail # user - Stargate+hbase

另外,在这个列表列表线程HBase、mail # user - Stargate+ HBase中,可以看到David Butler的响应。

42.1. Filter Query

42.1。过滤查询

Depending on the case, it may be appropriate to use Client Request Filters. In this case, no secondary index is created. However, don’t try a full-scan on a large table like this from an application (i.e., single-threaded client).

根据情况不同,使用客户机请求筛选器可能是合适的。在这种情况下,没有创建第二个索引。但是,不要在一个应用程序(也就是)中对一个大型表进行全面扫描。单线程客户)。

42.2. Periodic-Update Secondary Index

42.2。Periodic-Update二级索引

A secondary index could be created in another table which is periodically updated via a MapReduce job. The job could be executed intra-day, but depending on load-strategy it could still potentially be out of sync with the main data table.

可以在另一个表中创建第二个索引,该表通过MapReduce作业定期更新。该作业可以在一天内执行,但取决于负载策略,它仍然可能与主数据表不同步。

See mapreduce.example.readwrite for more information.

看到mapreduce.example。读写的更多信息。

42.3. Dual-Write Secondary Index

42.3。Dual-Write二级索引

Another strategy is to build the secondary index while publishing data to the cluster (e.g., write to data table, write to index table). If this is approach is taken after a data table already exists, then bootstrapping will be needed for the secondary index with a MapReduce job (see secondary.indexes.periodic).

另一个策略是在将数据发布到集群时构建次要索引(例如,写入数据表,写入到索引表)。如果这是在数据表已经存在之后采取的方法,那么使用MapReduce作业的辅助索引将需要bootstrapping(请参阅secondary.index .定期)。

42.4. Summary Tables

42.4。汇总表

Where time-ranges are very wide (e.g., year-long report) and where the data is voluminous, summary tables are a common approach. These would be generated with MapReduce jobs into another table.

在时间范围很广(例如,一年的报告)和数据大量的地方,汇总表是一种常见的方法。这些将通过MapReduce作业生成另一个表。

See mapreduce.example.summary for more information.

看到mapreduce.example。摘要以获取更多信息。

42.5. Coprocessor Secondary Index

42.5。协处理器的二级指标

Coprocessors act like RDBMS triggers. These were added in 0.92. For more information, see coprocessors

协处理器的作用类似于RDBMS触发器。这些加在0。92里。有关更多信息,请参见协处理器。

43. Constraints

43。约束

HBase currently supports 'constraints' in traditional (SQL) database parlance. The advised usage for Constraints is in enforcing business rules for attributes in the table (e.g. make sure values are in the range 1-10). Constraints could also be used to enforce referential integrity, but this is strongly discouraged as it will dramatically decrease the write throughput of the tables where integrity checking is enabled. Extensive documentation on using Constraints can be found at Constraint since version 0.94.

HBase目前支持传统(SQL)数据库用语的“约束”。约束的建议用法是为表中的属性执行业务规则(例如,确保值在1-10的范围内)。约束还可以用于强制引用完整性,但是这是非常不理想的,因为它将极大地减少启用完整性检查的表的写吞吐量。关于使用约束的大量文档可以在版本0.94的约束下找到。

44. Schema Design Case Studies

44岁。模式设计案例研究

The following will describe some typical data ingestion use-cases with HBase, and how the rowkey design and construction can be approached. Note: this is just an illustration of potential approaches, not an exhaustive list. Know your data, and know your processing requirements.

下面将描述一些典型的基于HBase的数据输入用例,以及如何处理rowkey设计和构建。注意:这只是一个潜在的方法的说明,而不是一个详尽的列表。了解您的数据,了解您的处理需求。

It is highly recommended that you read the rest of the HBase and Schema Design first, before reading these case studies.

在阅读这些案例研究之前,强烈建议您先阅读HBase和模式设计的其余部分。

The following case studies are described:

以下是个案研究:

  • Log Data / Timeseries Data

    日志数据/ Timeseries数据。

  • Log Data / Timeseries on Steroids

    使用类固醇的日志数据/ Timeseries。

  • Customer/Order

    客户/订单

  • Tall/Wide/Middle Schema Design

    高/宽/中间模式设计

  • List Data

    列表数据

44.1. Case Study - Log Data and Timeseries Data

44.1。案例研究-日志数据和Timeseries数据。

Assume that the following data elements are being collected.

假设正在收集以下数据元素。

  • Hostname

    主机名

  • Timestamp

    时间戳

  • Log event

    日志事件

  • Value/message

    价值/消息

We can store them in an HBase table called LOG_DATA, but what will the rowkey be? From these attributes the rowkey will be some combination of hostname, timestamp, and log-event - but what specifically?

我们可以将它们存储在一个名为LOG_DATA的HBase表中,但是rowkey是什么呢?从这些属性中,rowkey将是主机名、时间戳和日志事件的组合——但是具体是什么呢?

44.1.1. Timestamp In The Rowkey Lead Position

44.1.1。在Rowkey Lead位置上的时间戳。

The rowkey [timestamp][hostname][log-event] suffers from the monotonically increasing rowkey problem described in Monotonically Increasing Row Keys/Timeseries Data.

rowkey [timestamp][hostname][logevent]在单调递增的行键/Timeseries数据中所描述的单调递增的rowkey问题受到了影响。

There is another pattern frequently mentioned in the dist-lists about "bucketing" timestamps, by performing a mod operation on the timestamp. If time-oriented scans are important, this could be a useful approach. Attention must be paid to the number of buckets, because this will require the same number of scans to return results.

在列表中经常提到的另一种模式是在时间戳上执行一个mod操作。如果时间导向扫描很重要,这可能是一个有用的方法。必须注意bucket的数量,因为这需要相同数量的扫描才能返回结果。

long bucket = timestamp % numBuckets;

to construct:

构造:

[bucket][timestamp][hostname][log-event]

As stated above, to select data for a particular timerange, a Scan will need to be performed for each bucket. 100 buckets, for example, will provide a wide distribution in the keyspace but it will require 100 Scans to obtain data for a single timestamp, so there are trade-offs.

如上所述,要为特定的时间范围选择数据,需要对每个bucket执行扫描。例如,100个bucket将在keyspace中提供一个广泛的分布,但是需要100次扫描才能获得单个时间戳的数据,因此需要进行权衡。

44.1.2. Host In The Rowkey Lead Position

44.1.2。主机在Rowkey领先位置。

The rowkey [hostname][log-event][timestamp] is a candidate if there is a large-ish number of hosts to spread the writes and reads across the keyspace. This approach would be useful if scanning by hostname was a priority.

如果有大量的主机来传播写操作并读取整个密钥空间,那么rowkey [hostname][日志事件][时间戳]是一个候选对象。如果以主机名进行扫描是优先级,那么这种方法将非常有用。

44.1.3. Timestamp, or Reverse Timestamp?

44.1.3。时间戳,或反向时间戳?

If the most important access path is to pull most recent events, then storing the timestamps as reverse-timestamps (e.g., timestamp = Long.MAX_VALUE – timestamp) will create the property of being able to do a Scan on [hostname][log-event] to obtain the most recently captured events.

如果最重要的访问路径是拖拽最近的事件,那么将时间戳存储为反向时间戳(例如,timestamp = Long)。MAX_VALUE - timestamp)将创建能够对[主机名][log-event]进行扫描以获取最近捕获的事件的属性。

Neither approach is wrong, it just depends on what is most appropriate for the situation.

这两种方法都不是错误的,它只取决于最适合的情况。

Reverse Scan API

HBASE-4811 implements an API to scan a table or a range within a table in reverse, reducing the need to optimize your schema for forward or reverse scanning. This feature is available in HBase 0.98 and later. See Scan.setReversed() for more information.

HBASE-4811实现了一个API,可以在一个表中反向扫描一个表或一个范围,从而减少对前向或反向扫描优化模式的需求。该特性在HBase 0.98和以后可用。要了解更多信息,请参阅scan.setre()。

44.1.4. Variable Length or Fixed Length Rowkeys?

44.1.4。可变长度或固定长度的行键?

It is critical to remember that rowkeys are stamped on every column in HBase. If the hostname is a and the event type is e1 then the resulting rowkey would be quite small. However, what if the ingested hostname is myserver1.mycompany.com and the event type is com.package1.subpackage2.subsubpackage3.ImportantService?

重要的是要记住,在HBase的每一列上都要加盖rowkeys。如果主机名是a,事件类型是e1,那么生成的rowkey将非常小。但是,如果接收的主机名是myserver1.mycompany.com,事件类型是com.package1.subpackage2. subpackage3. importantservice ?

It might make sense to use some substitution in the rowkey. There are at least two approaches: hashed and numeric. In the Hostname In The Rowkey Lead Position example, it might look like this:

在rowkey中使用一些替换可能是有意义的。至少有两种方法:散列和数值。在Rowkey Lead Position示例中的主机名中,它可能是这样的:

Composite Rowkey With Hashes:

复合Rowkey散列:

  • [MD5 hash of hostname] = 16 bytes

    [MD5散列的主机名]= 16字节。

  • [MD5 hash of event-type] = 16 bytes

    (事件类型的MD5哈希)= 16字节。

  • [timestamp] = 8 bytes

    (时间戳)= 8个字节

Composite Rowkey With Numeric Substitution:

数值替换的复合行键:

For this approach another lookup table would be needed in addition to LOG_DATA, called LOG_TYPES. The rowkey of LOG_TYPES would be:

对于这种方法,除了LOG_DATA(称为LOG_TYPES)之外,还需要另一个查找表。LOG_TYPES的行键为:

  • [type] (e.g., byte indicating hostname vs. event-type)

    [类型](例如,字节指示主机名与事件类型)

  • [bytes] variable length bytes for raw hostname or event-type.

    [字节]用于原始主机名或事件类型的可变长度字节。

A column for this rowkey could be a long with an assigned number, which could be obtained by using an HBase counter

这个行键的列可以是一个指定的编号,可以通过使用HBase计数器来获得。

So the resulting composite rowkey would be:

因此,合成的复合rowkey将是:

  • [substituted long for hostname] = 8 bytes

    [取代长为主机名]= 8字节。

  • [substituted long for event type] = 8 bytes

    [用long表示事件类型]= 8字节。

  • [timestamp] = 8 bytes

    (时间戳)= 8个字节

In either the Hash or Numeric substitution approach, the raw values for hostname and event-type can be stored as columns.

在散列或数值替代方法中,主机名和事件类型的原始值可以存储为列。

44.2. Case Study - Log Data and Timeseries Data on Steroids

44.2。案例研究-日志数据和关于类固醇的Timeseries数据。

This effectively is the OpenTSDB approach. What OpenTSDB does is re-write data and pack rows into columns for certain time-periods. For a detailed explanation, see: http://opentsdb.net/schema.html, and Lessons Learned from OpenTSDB from HBaseCon2012.

这实际上就是OpenTSDB方法。OpenTSDB所做的是在特定的时间周期内将数据重写成列。有关详细说明,请参见:http://opentsdb.net/schema.html,以及从HBaseCon2012获得的OpenTSDB的经验。

But this is how the general concept works: data is ingested, for example, in this manner…​

但这是一般概念的工作方式:数据被摄入,例如,以这种方式……

[hostname][log-event][timestamp1]
[hostname][log-event][timestamp2]
[hostname][log-event][timestamp3]

with separate rowkeys for each detailed event, but is re-written like this…​

对于每个详细的事件,使用单独的行键,但是重新编写如下…

[hostname][log-event][timerange]

and each of the above events are converted into columns stored with a time-offset relative to the beginning timerange (e.g., every 5 minutes). This is obviously a very advanced processing technique, but HBase makes this possible.

上面的每一个事件都被转换成存储有时间偏移量的列(例如,每5分钟)。这显然是一种非常高级的处理技术,但是HBase使这成为可能。

44.3. Case Study - Customer/Order

44.3。案例研究——客户/订单

Assume that HBase is used to store customer and order information. There are two core record-types being ingested: a Customer record type, and Order record type.

假设HBase用于存储客户和订单信息。有两种核心记录类型被摄入:客户记录类型和订单记录类型。

The Customer record type would include all the things that you’d typically expect:

客户记录类型将包括您通常期望的所有内容:

  • Customer number

    客户编号

  • Customer name

    客户名称

  • Address (e.g., city, state, zip)

    地址(如城市、州、邮编)

  • Phone numbers, etc.

    电话号码等。

The Order record type would include things like:

订单记录类型包括以下内容:

  • Customer number

    客户编号

  • Order number

    订单号

  • Sales date

    销售日期

  • A series of nested objects for shipping locations and line-items (see Order Object Design for details)

    用于配送位置和行项目的一系列嵌套对象(详见Order对象设计)

Assuming that the combination of customer number and sales order uniquely identify an order, these two attributes will compose the rowkey, and specifically a composite key such as:

假设客户编号和销售订单的组合惟一地标识一个订单,这两个属性将组成rowkey,特别是组合键,例如:

[customer number][order number]

for an ORDER table. However, there are more design decisions to make: are the raw values the best choices for rowkeys?

订单表。但是,还有更多的设计决策要做:原始值是行键的最佳选择吗?

The same design questions in the Log Data use-case confront us here. What is the keyspace of the customer number, and what is the format (e.g., numeric? alphanumeric?) As it is advantageous to use fixed-length keys in HBase, as well as keys that can support a reasonable spread in the keyspace, similar options appear:

在日志数据用例中,同样的设计问题将我们摆在这里。客户编号的关键空间是什么,格式是什么(例如,数字?)字母数字?)由于在HBase中使用固定长度的键,以及可以支持在keyspace中合理扩展的键,类似的选项出现了:

Composite Rowkey With Hashes:

复合Rowkey散列:

  • [MD5 of customer number] = 16 bytes

    (客户编号的MD5) = 16字节。

  • [MD5 of order number] = 16 bytes

    [MD5的订单号]= 16字节。

Composite Numeric/Hash Combo Rowkey:

复合数字/散列组合Rowkey:

  • [substituted long for customer number] = 8 bytes

    [取代顾客编号]= 8字节。

  • [MD5 of order number] = 16 bytes

    [MD5的订单号]= 16字节。

44.3.1. Single Table? Multiple Tables?

44.3.1。单表吗?多个表吗?

A traditional design approach would have separate tables for CUSTOMER and SALES. Another option is to pack multiple record types into a single table (e.g., CUSTOMER++).

传统的设计方法将为客户和销售提供单独的表。另一种方法是将多个记录类型打包到单个表中(例如,CUSTOMER++)。

Customer Record Type Rowkey:

客户记录类型Rowkey:

  • [customer-id]

    (客户id)

  • [type] = type indicating `1' for customer record type

    [type] =类型指示' 1'为客户记录类型。

Order Record Type Rowkey:

Rowkey顺序记录类型:

  • [customer-id]

    (客户id)

  • [type] = type indicating `2' for order record type

    [type] =类型指示' 2'的订单记录类型。

  • [order]

    (订单)

The advantage of this particular CUSTOMER++ approach is that organizes many different record-types by customer-id (e.g., a single scan could get you everything about that customer). The disadvantage is that it’s not as easy to scan for a particular record-type.

这种特殊的客户++方法的优点是通过客户id组织了许多不同的记录类型(例如,一次扫描可以让您了解客户的一切)。缺点是不容易扫描到特定的记录类型。

44.3.2. Order Object Design

44.3.2。订单对象设计

Now we need to address how to model the Order object. Assume that the class structure is as follows:

现在我们需要讨论如何建模Order对象。假设类结构如下:

Order

(an Order can have multiple ShippingLocations

(一个订单可以有多个shippinglocation。

LineItem

(a ShippingLocation can have multiple LineItems

(ShippingLocation可以有多个LineItems。

there are multiple options on storing this data.

存储这些数据有多个选项。

Completely Normalized
完全归一化

With this approach, there would be separate tables for ORDER, SHIPPING_LOCATION, and LINE_ITEM.

使用这种方法,将会有单独的订单、SHIPPING_LOCATION和LINE_ITEM表。

The ORDER table’s rowkey was described above: schema.casestudies.custorder

上面描述了ORDER表的rowkey: schema.casestudies.custorder。

The SHIPPING_LOCATION’s composite rowkey would be something like this:

SHIPPING_LOCATION的复合行键是这样的:

  • [order-rowkey]

    (order-rowkey)

  • [shipping location number] (e.g., 1st location, 2nd, etc.)

    [船舶位置编号](例如,第1位,第2号,等等)

The LINE_ITEM table’s composite rowkey would be something like this:

LINE_ITEM表的复合行键是这样的:

  • [order-rowkey]

    (order-rowkey)

  • [shipping location number] (e.g., 1st location, 2nd, etc.)

    [船舶位置编号](例如,第1位,第2号,等等)

  • [line item number] (e.g., 1st lineitem, 2nd, etc.)

    [行项目编号](如第1行、第2条等)

Such a normalized model is likely to be the approach with an RDBMS, but that’s not your only option with HBase. The cons of such an approach is that to retrieve information about any Order, you will need:

这种规范化模型很可能是使用RDBMS的方法,但这不是HBase的唯一选择。这种方法的缺点是检索关于任何订单的信息,您将需要:

  • Get on the ORDER table for the Order

    上订单的订单。

  • Scan on the SHIPPING_LOCATION table for that order to get the ShippingLocation instances

    为了获得ShippingLocation实例,请在SHIPPING_LOCATION表上进行扫描。

  • Scan on the LINE_ITEM for each ShippingLocation

    扫描每个ShippingLocation的LINE_ITEM。

granted, this is what an RDBMS would do under the covers anyway, but since there are no joins in HBase you’re just more aware of this fact.

当然,这是RDBMS在覆盖范围内所做的事情,但是由于HBase中没有连接,您只是更了解这个事实。

Single Table With Record Types
单台与记录类型。

With this approach, there would exist a single table ORDER that would contain

使用这种方法,将会有一个包含的单个表顺序。

The Order rowkey was described above: schema.casestudies.custorder

上面描述了Order rowkey: schema.casestudies.custorder。

  • [order-rowkey]

    (order-rowkey)

  • [ORDER record type]

    (订单记录类型)

The ShippingLocation composite rowkey would be something like this:

ShippingLocation组合rowkey是这样的:

  • [order-rowkey]

    (order-rowkey)

  • [SHIPPING record type]

    (航运记录类型)

  • [shipping location number] (e.g., 1st location, 2nd, etc.)

    [船舶位置编号](例如,第1位,第2号,等等)

The LineItem composite rowkey would be something like this:

LineItem组合rowkey是这样的:

  • [order-rowkey]

    (order-rowkey)

  • [LINE record type]

    (线记录类型)

  • [shipping location number] (e.g., 1st location, 2nd, etc.)

    [船舶位置编号](例如,第1位,第2号,等等)

  • [line item number] (e.g., 1st lineitem, 2nd, etc.)

    [行项目编号](如第1行、第2条等)

Denormalized
规范化的

A variant of the Single Table With Record Types approach is to denormalize and flatten some of the object hierarchy, such as collapsing the ShippingLocation attributes onto each LineItem instance.

使用记录类型方法的单个表的一个变体是对一些对象层次结构进行非规范化和扁平化,例如将ShippingLocation属性折叠到每个LineItem实例上。

The LineItem composite rowkey would be something like this:

LineItem组合rowkey是这样的:

  • [order-rowkey]

    (order-rowkey)

  • [LINE record type]

    (线记录类型)

  • [line item number] (e.g., 1st lineitem, 2nd, etc., care must be taken that there are unique across the entire order)

    [line项目编号](例如,第1条lineitem,第2条,等等,必须注意在整个订单中有唯一的)

and the LineItem columns would be something like this:

LineItem列是这样的

  • itemNumber

    itemNumber)

  • quantity

    数量

  • price

    价格

  • shipToLine1 (denormalized from ShippingLocation)

    从ShippingLocation shipToLine1(规范化)

  • shipToLine2 (denormalized from ShippingLocation)

    从ShippingLocation shipToLine2(规范化)

  • shipToCity (denormalized from ShippingLocation)

    从ShippingLocation shipToCity(规范化)

  • shipToState (denormalized from ShippingLocation)

    从ShippingLocation shipToState(规范化)

  • shipToZip (denormalized from ShippingLocation)

    从ShippingLocation shipToZip(规范化)

The pros of this approach include a less complex object hierarchy, but one of the cons is that updating gets more complicated in case any of this information changes.

这种方法的优点包括一个不那么复杂的对象层次结构,但是其中一个缺点是,如果这些信息发生变化,更新变得更加复杂。

Object BLOB
对象的团

With this approach, the entire Order object graph is treated, in one way or another, as a BLOB. For example, the ORDER table’s rowkey was described above: schema.casestudies.custorder, and a single column called "order" would contain an object that could be deserialized that contained a container Order, ShippingLocations, and LineItems.

通过这种方法,整个Order对象图以一种或另一种方式被视为BLOB。例如,上面描述了ORDER表的rowkey: schema.casestudies。custorder和一个名为“order”的单一列将包含一个可以被反序列化的对象,该对象包含一个容器顺序、shippinglocation和LineItems。

There are many options here: JSON, XML, Java Serialization, Avro, Hadoop Writables, etc. All of them are variants of the same approach: encode the object graph to a byte-array. Care should be taken with this approach to ensure backward compatibility in case the object model changes such that older persisted structures can still be read back out of HBase.

这里有许多选项:JSON、XML、Java序列化、Avro、Hadoop Writables等,它们都是相同方法的变体:将对象图编码为字节数组。应该注意这种方法,以确保在对象模型更改时向后兼容,这样旧的持久化结构仍然可以从HBase中读取。

Pros are being able to manage complex object graphs with minimal I/O (e.g., a single HBase Get per Order in this example), but the cons include the aforementioned warning about backward compatibility of serialization, language dependencies of serialization (e.g., Java Serialization only works with Java clients), the fact that you have to deserialize the entire object to get any piece of information inside the BLOB, and the difficulty in getting frameworks like Hive to work with custom objects like this.

优点是能够以最小的I / O管理复杂对象图(例如,单个HBase得到每个订单在这个例子中),但缺点包括序列化的上述警告向后兼容性,语言依赖性的序列化(例如,Java序列化仅适用于Java客户端),这个事实你必须反序列化整个对象得到任何信息在BLOB,和困难这样的框架蜂巢处理自定义对象。

44.4. Case Study - "Tall/Wide/Middle" Schema Design Smackdown

44.4。案例研究——“高/宽/中”模式设计的Smackdown。

This section will describe additional schema design questions that appear on the dist-list, specifically about tall and wide tables. These are general guidelines and not laws - each application must consider its own needs.

本节将描述在列表列表中出现的其他模式设计问题,特别是关于高和宽的表。这些是通用的指导方针,而不是法律——每个应用程序都必须考虑自己的需求。

44.4.1. Rows vs. Versions

44.4.1。行和版本

A common question is whether one should prefer rows or HBase’s built-in-versioning. The context is typically where there are "a lot" of versions of a row to be retained (e.g., where it is significantly above the HBase default of 1 max versions). The rows-approach would require storing a timestamp in some portion of the rowkey so that they would not overwrite with each successive update.

一个常见的问题是,是否应该选择行或HBase的构建版本。上下文通常是要保留的行的“很多”版本(例如,它明显高于HBase默认的1 max版本)。行方法需要在行键的某个部分存储时间戳,这样它们就不会在每次连续更新时覆盖它们。

Preference: Rows (generally speaking).

偏好:行(一般来说)。

44.4.2. Rows vs. Columns

44.4.2。行与列

Another common question is whether one should prefer rows or columns. The context is typically in extreme cases of wide tables, such as having 1 row with 1 million attributes, or 1 million rows with 1 columns apiece.

另一个常见的问题是,是否应该选择行或列。上下文通常是在一些宽表的极端情况下,例如有1行具有100万属性,或1百万行,每个列有1列。

Preference: Rows (generally speaking). To be clear, this guideline is in the context is in extremely wide cases, not in the standard use-case where one needs to store a few dozen or hundred columns. But there is also a middle path between these two options, and that is "Rows as Columns."

偏好:行(一般来说)。要说明的是,这个指导方针是在非常广泛的情况下,而不是在标准的用例中,其中一个需要存储几十个或数百个列。但是在这两个选项之间也有一条中间路径,那就是“行作为列”。

44.4.3. Rows as Columns

44.4.3。行,列

The middle path between Rows vs. Columns is packing data that would be a separate row into columns, for certain rows. OpenTSDB is the best example of this case where a single row represents a defined time-range, and then discrete events are treated as columns. This approach is often more complex, and may require the additional complexity of re-writing your data, but has the advantage of being I/O efficient. For an overview of this approach, see schema.casestudies.log-steroids.

行与列之间的中间路径是打包数据,这些数据将单独列成列,用于某些行。OpenTSDB是这种情况下最好的例子,其中一行表示一个已定义的时间范围,然后将离散事件作为列处理。这种方法通常比较复杂,可能需要重新编写数据的额外复杂性,但是具有I/O效率的优势。有关此方法的概述,请参见schema.casestudies.log-类固醇。

44.5. Case Study - List Data

44.5。案例研究-列表数据。

The following is an exchange from the user dist-list regarding a fairly common question: how to handle per-user list data in Apache HBase.

下面是关于一个相当常见的问题的用户列表的交换:如何处理Apache HBase中的每个用户列表数据。

  • QUESTION *

    问题*

We’re looking at how to store a large amount of (per-user) list data in HBase, and we were trying to figure out what kind of access pattern made the most sense. One option is store the majority of the data in a key, so we could have something like:

我们正在研究如何在HBase中存储大量的(每个用户)列表数据,并且我们正在尝试找出最合理的访问模式。一种选择是将大部分数据存储在一个密钥中,这样我们就可以拥有如下内容:

<FixedWidthUserName><FixedWidthValueId1>:"" (no value)
<FixedWidthUserName><FixedWidthValueId2>:"" (no value)
<FixedWidthUserName><FixedWidthValueId3>:"" (no value)

The other option we had was to do this entirely using:

我们的另一个选择是完全使用:

<FixedWidthUserName><FixedWidthPageNum0>:<FixedWidthLength><FixedIdNextPageNum><ValueId1><ValueId2><ValueId3>...
<FixedWidthUserName><FixedWidthPageNum1>:<FixedWidthLength><FixedIdNextPageNum><ValueId1><ValueId2><ValueId3>...

where each row would contain multiple values. So in one case reading the first thirty values would be:

其中每一行都包含多个值。因此,在一个案例中,阅读前30个值是:

scan { STARTROW => 'FixedWidthUsername' LIMIT => 30}

And in the second case it would be

第二种情况是。

get 'FixedWidthUserName\x00\x00\x00\x00'

The general usage pattern would be to read only the first 30 values of these lists, with infrequent access reading deeper into the lists. Some users would have ⇐ 30 total values in these lists, and some users would have millions (i.e. power-law distribution)

一般的使用模式是只读取这些列表的前30个值,而不经常访问更深入到列表中。一些用户会⇐30总值在这些列表,和一些用户数百万(即幂律分布)

The single-value format seems like it would take up more space on HBase, but would offer some improved retrieval / pagination flexibility. Would there be any significant performance advantages to be able to paginate via gets vs paginating with scans?

单值格式似乎需要在HBase上占用更多空间,但可以提供一些改进的检索/分页灵活性。有什么显著的性能优势可以通过获取和扫描的页面进行分页吗?

My initial understanding was that doing a scan should be faster if our paging size is unknown (and caching is set appropriately), but that gets should be faster if we’ll always need the same page size. I’ve ended up hearing different people tell me opposite things about performance. I assume the page sizes would be relatively consistent, so for most use cases we could guarantee that we only wanted one page of data in the fixed-page-length case. I would also assume that we would have infrequent updates, but may have inserts into the middle of these lists (meaning we’d need to update all subsequent rows).

我最初的理解是,如果我们的分页大小未知(并适当地设置了缓存),那么做一个扫描应该更快,但是如果我们总是需要相同的页面大小,那就应该更快。我最终听到不同的人告诉我关于绩效的相反的事情。我假设页面大小是相对一致的,所以对于大多数用例,我们可以保证我们只希望在固定页长度的情况下只需要一页数据。我还假设我们会有不频繁的更新,但是可能会插入到这些列表的中间(这意味着我们需要更新所有后续的行)。

Thanks for help / suggestions / follow-up questions.

谢谢你的帮助/建议/后续问题。

  • ANSWER *

    答案*

If I understand you correctly, you’re ultimately trying to store triples in the form "user, valueid, value", right? E.g., something like:

如果我正确地理解了您,您最终将尝试在表单“user, valueid, value”中存储三元组,对吗?例如,类似:

"user123, firstname, Paul",
"user234, lastname, Smith"

(But the usernames are fixed width, and the valueids are fixed width).

(但是用户名是固定宽度的,而valueids是固定宽度的)。

And, your access pattern is along the lines of: "for user X, list the next 30 values, starting with valueid Y". Is that right? And these values should be returned sorted by valueid?

并且,您的访问模式是沿着:“对于用户X,列出接下来的30个值,从valueid Y开始”。是这样吗?这些值应该按照valueid的顺序返回吗?

The tl;dr version is that you should probably go with one row per user+value, and not build a complicated intra-row pagination scheme on your own unless you’re really sure it is needed.

tl;dr版本是,您应该使用每个用户的一行+值,而不是自己构建一个复杂的内部行分页方案,除非您真的确定它是必需的。

Your two options mirror a common question people have when designing HBase schemas: should I go "tall" or "wide"? Your first schema is "tall": each row represents one value for one user, and so there are many rows in the table for each user; the row key is user + valueid, and there would be (presumably) a single column qualifier that means "the value". This is great if you want to scan over rows in sorted order by row key (thus my question above, about whether these ids are sorted correctly). You can start a scan at any user+valueid, read the next 30, and be done. What you’re giving up is the ability to have transactional guarantees around all the rows for one user, but it doesn’t sound like you need that. Doing it this way is generally recommended (see here https://hbase.apache.org/book.html#schema.smackdown).

你的两种选择反映了人们在设计HBase模式时遇到的一个常见问题:我应该“高”还是“宽”?您的第一个模式是“tall”:每一行表示一个用户的一个值,因此每个用户的表中有许多行;行键是user + valueid,并且有(大概)一个表示“值”的列限定符。如果您想按行键(因此我上面的问题,关于这些id是否正确排序)进行扫描,这是很好的。您可以在任何用户+valueid上开始扫描,阅读下一个30,然后完成。你放弃的是在所有行中为一个用户提供事务保证的能力,但是听起来不像你需要的那样。这样做通常是推荐的(参见这里的https://hbase.apache.org/book.html#schema.smackdown)。

Your second option is "wide": you store a bunch of values in one row, using different qualifiers (where the qualifier is the valueid). The simple way to do that would be to just store ALL values for one user in a single row. I’m guessing you jumped to the "paginated" version because you’re assuming that storing millions of columns in a single row would be bad for performance, which may or may not be true; as long as you’re not trying to do too much in a single request, or do things like scanning over and returning all of the cells in the row, it shouldn’t be fundamentally worse. The client has methods that allow you to get specific slices of columns.

第二个选项是“宽”:在一行中存储一串值,使用不同的限定符(修饰符是valueid)。简单的方法是将一个用户的所有值存储在一行中。我猜你跳到了“分页”的版本,因为你假设在一行中存储数百万列将不利于性能,这可能是也可能不是真的;只要你不想在一个单一的请求中做太多的事情,或者做一些像扫描和返回一行中的所有单元的事情,它就不会变得更糟。客户端有允许您获得特定的列的方法。

Note that neither case fundamentally uses more disk space than the other; you’re just "shifting" part of the identifying information for a value either to the left (into the row key, in option one) or to the right (into the column qualifiers in option 2). Under the covers, every key/value still stores the whole row key, and column family name. (If this is a bit confusing, take an hour and watch Lars George’s excellent video about understanding HBase schema design: http://www.youtube.com/watch?v=_HLoH_PgrLk).

注意,这两种情况都没有从根本上使用更多的磁盘空间;您只是将标识信息的一部分“转移”到左边(在行键中,在选项1中)或右边(在选项2中的列限定符中)。在覆盖下,每个键/值仍然存储整个行键和列姓。(如果这有点让人困惑的话,那就花一个小时,看看Lars George关于理解HBase模式设计的优秀视频:http://www.youtube.com/watch? v_hloh_pgrlk)。

A manually paginated version has lots more complexities, as you note, like having to keep track of how many things are in each page, re-shuffling if new values are inserted, etc. That seems significantly more complex. It might have some slight speed advantages (or disadvantages!) at extremely high throughput, and the only way to really know that would be to try it out. If you don’t have time to build it both ways and compare, my advice would be to start with the simplest option (one row per user+value). Start simple and iterate! :)

手动分页的版本有很多复杂的地方,如您所注意到的,如要跟踪每一页中有多少东西,如果插入新值,重新洗牌,等等。这似乎要复杂得多。它可能有一些轻微的速度优势(或缺点!)在极高的吞吐量,而且唯一的方法,真正知道那将是尝试它。如果您没有时间构建这两种方法并进行比较,那么我的建议将从最简单的选项(每个用户+值的一行)开始。开始简单重复!:)

45. Operational and Performance Configuration Options

45岁。操作和性能配置选项。

45.1. Tune HBase Server RPC Handling

45.1。调优HBase服务器RPC处理。

  • Set hbase.regionserver.handler.count (in hbase-site.xml) to cores x spindles for concurrency.

    设置hbase.regionserver.handler。计数(在hbase-site.xml中)到核心x轴的并发性。

  • Optionally, split the call queues into separate read and write queues for differentiated service. The parameter hbase.ipc.server.callqueue.handler.factor specifies the number of call queues:

    可选地,将调用队列分割为单独的读和写队列,以区别服务。参数hbase.ipc.server.callqueue.handler。因子指定调用队列的数量:

    • 0 means a single shared queue

      0表示单个共享队列。

    • 1 means one queue for each handler.

      1表示每个处理程序的一个队列。

    • A value between 0 and 1 allocates the number of queues proportionally to the number of handlers. For instance, a value of .5 shares one queue between each two handlers.

      在0到1之间的一个值将队列的数量按比例分配给处理程序的数量。例如,.5的值在每个处理程序之间共享一个队列。

  • Use hbase.ipc.server.callqueue.read.ratio (hbase.ipc.server.callqueue.read.share in 0.98) to split the call queues into read and write queues:

    使用hbase.ipc.server.callqueue.read。(hbase.ipc.server.callqueue.read比例。将调用队列拆分为读和写队列:

    • 0.5 means there will be the same number of read and write queues

      0.5表示将会有相同数量的读和写队列。

    • < 0.5 for more read than write

      < 0.5 for more read than write。

    • > 0.5 for more write than read

      比读更多的写>。5。

  • Set hbase.ipc.server.callqueue.scan.ratio (HBase 1.0+) to split read call queues into small-read and long-read queues:

    设置hbase.ipc.server.callqueue.scan。比率(HBase 1.0+)将读调用队列分成小读和长读队列:

    • 0.5 means that there will be the same number of short-read and long-read queues

      0.5表示将会有相同数量的短读和长读队列。

    • < 0.5 for more short-read

      < 0.5用于更短的阅读。

    • > 0.5 for more long-read

      > 0.5用于更长的阅读。

45.2. Disable Nagle for RPC

45.2。禁用对RPC纳格尔

Disable Nagle’s algorithm. Delayed ACKs can add up to ~200ms to RPC round trip time. Set the following parameters:

纳格尔禁用的算法。延迟的ack可以加到~200ms到RPC往返时间。设置以下参数:

  • In Hadoop’s core-site.xml:

    在Hadoop的core-site.xml:

    • ipc.server.tcpnodelay = true

      ipc.server。tcpnodelay = true

    • ipc.client.tcpnodelay = true

      ipc.client。tcpnodelay = true

  • In HBase’s hbase-site.xml:

    在HBase的hbase-site.xml:

    • hbase.ipc.client.tcpnodelay = true

      hbase.ipc.client。tcpnodelay = true

    • hbase.ipc.server.tcpnodelay = true

      hbase.ipc.server。tcpnodelay = true

45.3. Limit Server Failure Impact

45.3。限制服务器故障影响

Detect regionserver failure as fast as reasonable. Set the following parameters:

尽可能快地检测区域服务器故障。设置以下参数:

  • In hbase-site.xml, set zookeeper.session.timeout to 30 seconds or less to bound failure detection (20-30 seconds is a good start).

    在hbase-site。xml,zookeeper.session设置。超时到30秒或更少的绑定故障检测(20-30秒是一个好的开始)。

  • Detect and avoid unhealthy or failed HDFS DataNodes: in hdfs-site.xml and hbase-site.xml, set the following parameters:

    检测和避免不健康的或失败的HDFS DataNodes:在HDFS站点。xml和hbase-site。xml,设置以下参数:

    • dfs.namenode.avoid.read.stale.datanode = true

      dfs.namenode.avoid.read.stale.datanode = true

    • dfs.namenode.avoid.write.stale.datanode = true

      dfs.namenode.avoid.write.stale.datanode = true

45.4. Optimize on the Server Side for Low Latency

45.4。优化服务器端的低延迟。

  • Skip the network for local blocks. In hbase-site.xml, set the following parameters:

    跳过本地块的网络。在hbase-site。xml,设置以下参数:

    • dfs.client.read.shortcircuit = true

      dfs.client.read。短路= true

    • dfs.client.read.shortcircuit.buffer.size = 131072 (Important to avoid OOME)

      dfs.client.read.shortcircuit.buffer。大小= 131072(避免OOME重要)

  • Ensure data locality. In hbase-site.xml, set hbase.hstore.min.locality.to.skip.major.compact = 0.7 (Meaning that 0.7 <= n <= 1)

    确保数据本地化。在hbase-site。xml,设置hbase.hstore.min. to.skip. compact = 0.7(意思是0.7 <= n <= 1)

  • Make sure DataNodes have enough handlers for block transfers. In hdfs-site.xml, set the following parameters:

    确保DataNodes有足够的处理块传输的处理程序。在hdfs-site。xml,设置以下参数:

    • dfs.datanode.max.xcievers >= 8192

      dfs.datanode.max。xcievers > = 8192

    • dfs.datanode.handler.count = number of spindles

      dfs.datanode.handler。锭数=锭数。

45.5. JVM Tuning

45.5。JVM调优

45.5.1. Tune JVM GC for low collection latencies

45.5.1。为低收集延迟调优JVM GC。

  • Use the CMS collector: -XX:+UseConcMarkSweepGC

    使用CMS收集器:-XX:+UseConcMarkSweepGC。

  • Keep eden space as small as possible to minimize average collection time. Example:

    保持eden空间尽可能小,以最小化平均收集时间。例子:

    -XX:CMSInitiatingOccupancyFraction=70
  • Optimize for low collection latency rather than throughput: -Xmn512m

    优化低收集延迟而不是吞吐量:-Xmn512m。

  • Collect eden in parallel: -XX:+UseParNewGC

    并行收集eden: -XX:+UseParNewGC。

  • Avoid collection under pressure: -XX:+UseCMSInitiatingOccupancyOnly

    避免在压力下收集:-XX:+UseCMSInitiatingOccupancyOnly。

  • Limit per request scanner result sizing so everything fits into survivor space but doesn’t tenure. In hbase-site.xml, set hbase.client.scanner.max.result.size to 1/8th of eden space (with -Xmn512m this is ~51MB )

    限制每个请求扫描结果的大小,所以所有的东西都适合于幸存者空间,但是没有使用期限。在hbase-site。xml,hbase.client.scanner.max.result设置。大小到eden空间的1/8(使用-Xmn512m这是~51MB)

  • Set max.result.size x handler.count less than survivor space

    设置max.result。尺寸x处理器。小于幸存者空间。

45.5.2. OS-Level Tuning

45.5.2。操作系统调优

  • Turn transparent huge pages (THP) off:

    打开透明的大页面(THP):

    echo never > /sys/kernel/mm/transparent_hugepage/enabled
    echo never > /sys/kernel/mm/transparent_hugepage/defrag
  • Set vm.swappiness = 0

    设置虚拟机。swappiness = 0

  • Set vm.min_free_kbytes to at least 1GB (8GB on larger memory systems)

    设置虚拟机。min_free_kbytes至少为1GB(大内存系统的8GB)

  • Disable NUMA zone reclaim with vm.zone_reclaim_mode = 0

    使用vm禁用NUMA区回收。zone_reclaim_mode = 0

46. Special Cases

46岁。特殊情况

46.1. For applications where failing quickly is better than waiting

46.1。对于快速失败的应用程序,要比等待更好。

  • In hbase-site.xml on the client side, set the following parameters:

    在hbase-site。xml在客户端,设置以下参数:

    • Set hbase.client.pause = 1000

      设置hbase.client。暂停= 1000

    • Set hbase.client.retries.number = 3

      设置hbase.client.retries。数量= 3

    • If you want to ride over splits and region moves, increase hbase.client.retries.number substantially (>= 20)

      如果你想跨越分裂和区域移动,增加hbase.client. retry。数量大幅(> = 20)

    • Set the RecoverableZookeeper retry count: zookeeper.recovery.retry = 1 (no retry)

      设置可恢复动物管理员重试计数:zookeeper.recovery。重试= 1(不重试)

  • In hbase-site.xml on the server side, set the Zookeeper session timeout for detecting server failures: zookeeper.session.timeout ⇐ 30 seconds (20-30 is good).

    在hbase-site。服务器端上的xml设置了检测服务器故障的Zookeeper会话超时:zookeeper.session。超时⇐30秒(20 - 30是好的)。

46.2. For applications that can tolerate slightly out of date information

46.2。适用于那些可以稍微超出日期信息的应用程序。

HBase timeline consistency (HBASE-10070) With read replicas enabled, read-only copies of regions (replicas) are distributed over the cluster. One RegionServer services the default or primary replica, which is the only replica that can service writes. Other RegionServers serve the secondary replicas, follow the primary RegionServer, and only see committed updates. The secondary replicas are read-only, but can serve reads immediately while the primary is failing over, cutting read availability blips from seconds to milliseconds. Phoenix supports timeline consistency as of 4.4.0 Tips:

HBase时间轴一致性(HBase -10070)具有读取的副本,而区域(副本)的只读副本分布在集群上。一个区域服务器服务默认或主副本,这是唯一可以服务的副本。其他区域服务器服务于次要副本,跟随主区域服务器,只看到提交的更新。二级副本是只读的,但是在主服务器失败时可以立即执行读取操作,从秒到毫秒将读取可用性blips。Phoenix支持时间轴一致性为4.4.0的提示:

  • Deploy HBase 1.0.0 or later.

    部署HBase 1.0.0或更高版本。

  • Enable timeline consistent replicas on the server side.

    在服务器端启用时间轴一致的副本。

  • Use one of the following methods to set timeline consistency:

    使用以下方法设置时间轴一致性:

    • Use ALTER SESSION SET CONSISTENCY = 'TIMELINE’

      使用ALTER SESSION SET一致性= 'TIMELINE '

    • Set the connection property Consistency to timeline in the JDBC connect string

      将连接属性的一致性设置为JDBC连接字符串中的时间线。

46.3. More Information

46.3。更多的信息

See the Performance section perf.schema for more information about operational and performance schema design options, such as Bloom Filters, Table-configured regionsizes, compression, and blocksizes.

查看性能部分perf。关于操作和性能模式设计选项的更多信息的模式,如Bloom filter、表配置的区域大小、压缩和块大小。

HBase and MapReduce

HBase和MapReduce

Apache MapReduce is a software framework used to analyze large amounts of data. It is provided by Apache Hadoop. MapReduce itself is out of the scope of this document. A good place to get started with MapReduce is https://hadoop.apache.org/docs/r2.6.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html. MapReduce version 2 (MR2)is now part of YARN.

Apache MapReduce是一个用于分析大量数据的软件框架。它由Apache Hadoop提供。MapReduce本身超出了这个文档的范围。从MapReduce开始的一个好地方是https://hadoop.apache.org/docs/r2.6.0/hadoop- MapReduce -client/hadoop- MapReduce -客户- core/mapreducetutories.html。MapReduce版本2 (MR2)现在是纱线的一部分。

This chapter discusses specific configuration steps you need to take to use MapReduce on data within HBase. In addition, it discusses other interactions and issues between HBase and MapReduce jobs. Finally, it discusses Cascading, an alternative API for MapReduce.

本章讨论了在HBase中使用MapReduce数据时需要采取的具体配置步骤。此外,还讨论了HBase与MapReduce作业之间的其他交互和问题。最后,讨论了MapReduce的一个替代API级联。

mapred and mapreduce

There are two mapreduce packages in HBase as in MapReduce itself: org.apache.hadoop.hbase.mapred and org.apache.hadoop.hbase.mapreduce. The former does old-style API and the latter the new mode. The latter has more facility though you can usually find an equivalent in the older package. Pick the package that goes with your MapReduce deploy. When in doubt or starting over, pick org.apache.hadoop.hbase.mapreduce. In the notes below, we refer to o.a.h.h.mapreduce but replace with o.a.h.h.mapred if that is what you are using.

在HBase中有两个mapreduce包,如mapreduce本身:org.apache.hadoop.hbase。mapred org.apache.hadoop.hbase.mapreduce。前者采用旧式API,后者采用新模式。后者有更多的功能,尽管您通常可以在旧的包中找到等价的。选择与MapReduce部署相关的包。当有疑问或重新开始时,选择org.apache.hadoop.hbase.mapreduce。在下面的笔记中,我们提到了o.a.h.h.。mapreduce但是替换o。h。h。如果你使用的是mapred。

47. HBase, MapReduce, and the CLASSPATH

47岁。HBase、MapReduce和类路径。

By default, MapReduce jobs deployed to a MapReduce cluster do not have access to either the HBase configuration under $HBASE_CONF_DIR or the HBase classes.

默认情况下,部署到MapReduce集群的MapReduce作业不能访问HBASE_CONF_DIR或HBase类下的HBase配置。

To give the MapReduce jobs the access they need, you could add hbase-site.xml_to _$HADOOP_HOME/conf and add HBase jars to the $HADOOP_HOME/lib directory. You would then need to copy these changes across your cluster. Or you could edit $HADOOP_HOME/conf/hadoop-env.sh and add hbase dependencies to the HADOOP_CLASSPATH variable. Neither of these approaches is recommended because it will pollute your Hadoop install with HBase references. It also requires you restart the Hadoop cluster before Hadoop can use the HBase data.

为了给MapReduce工作提供所需的访问权限,您可以添加hbase-site。xml_to _$HADOOP_HOME/conf,并将HBase jar添加到$HADOOP_HOME/lib目录。然后,您需要在集群中复制这些更改。或者你可以编辑$HADOOP_HOME/conf/hadoop-env。将hbase依赖项添加到hadoop - classpath变量中。这两种方法都不推荐使用,因为它会使用HBase引用污染您的Hadoop安装。它还要求您在Hadoop使用HBase数据之前重新启动Hadoop集群。

The recommended approach is to let HBase add its dependency jars and use HADOOP_CLASSPATH or -libjars.

推荐的方法是让HBase添加它的依赖项jar,并使用hadoop - classpath或-libjars。

Since HBase 0.90.x, HBase adds its dependency JARs to the job configuration itself. The dependencies only need to be available on the local CLASSPATH and from here they’ll be picked up and bundled into the fat job jar deployed to the MapReduce cluster. A basic trick just passes the full hbase classpath — all hbase and dependent jars as well as configurations — to the mapreduce job runner letting hbase utility pick out from the full-on classpath what it needs adding them to the MapReduce job configuration (See the source at TableMapReduceUtil#addDependencyJars(org.apache.hadoop.mapreduce.Job) for how this is done).

自从HBase 0.90。HBase将其依赖项jar添加到作业配置本身。依赖项只需要在本地类路径上可用,从这里它们将被打包到部署到MapReduce集群的fat job jar中。基本技巧只是通过完整的hbase类路径有依赖关系的jar——hbase和mapreduce工作以及配置——跑步让hbase效用挑选从全面类路径中需要将它们添加到mapreduce任务配置(见源代码在TableMapReduceUtil # addDependencyJars(org.apache.hadoop.mapreduce.Job)这是如何实现的)。

The following example runs the bundled HBase RowCounter MapReduce job against a table named usertable. It sets into HADOOP_CLASSPATH the jars hbase needs to run in an MapReduce context (including configuration files such as hbase-site.xml). Be sure to use the correct version of the HBase JAR for your system; replace the VERSION string in the below command line w/ the version of your local hbase install. The backticks (` symbols) cause the shell to execute the sub-commands, setting the output of hbase classpath into HADOOP_CLASSPATH. This example assumes you use a BASH-compatible shell.

下面的示例针对一个名为usertable的表运行了绑定的HBase RowCounter MapReduce作业。它设置到hadoop - classpath中,jar hbase需要在MapReduce上下文中运行(包括配置文件,如hbase-site.xml)。确保您的系统使用了正确的HBase JAR版本;将版本字符串替换为以下命令行w/本地hbase安装版本。backticks(符号)导致shell执行子命令,将hbase类路径的输出设置为hadoop - classpath。本例假设您使用的是bash兼容的shell。

$ HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` \
  ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/lib/hbase-mapreduce-VERSION.jar \
  org.apache.hadoop.hbase.mapreduce.RowCounter usertable

The above command will launch a row counting mapreduce job against the hbase cluster that is pointed to by your local configuration on a cluster that the hadoop configs are pointing to.

上面的命令将在hadoop configs指向的集群上的本地配置中,启动一个针对hbase集群的行计数mapreduce作业。

The main for the hbase-mapreduce.jar is a Driver that lists a few basic mapreduce tasks that ship with hbase. For example, presuming your install is hbase 2.0.0-SNAPSHOT:

主要用于hbase- apreduce。jar是一个驱动程序,列出了一些基本的mapreduce任务。例如,假设您的安装是hbase 2.0.0快照:

$ HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` \
  ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/lib/hbase-mapreduce-2.0.0-SNAPSHOT.jar
An example program must be given as the first argument.
Valid program names are:
  CellCounter: Count cells in HBase table.
  WALPlayer: Replay WAL files.
  completebulkload: Complete a bulk data load.
  copytable: Export a table from local cluster to peer cluster.
  export: Write table data to HDFS.
  exportsnapshot: Export the specific snapshot to a given FileSystem.
  import: Import data written by Export.
  importtsv: Import data in TSV format.
  rowcounter: Count rows in HBase table.
  verifyrep: Compare the data from tables in two different clusters. WARNING: It doesn't work for incrementColumnValues'd cells since the timestamp is changed after being appended to the log.

You can use the above listed shortnames for mapreduce jobs as in the below re-run of the row counter job (again, presuming your install is hbase 2.0.0-SNAPSHOT):

您可以使用上面列出的短名称来进行mapreduce作业,就像下面重新运行的行计数器作业一样(同样,假设您的安装是hbase 2.0.0快照):

$ HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` \
  ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/lib/hbase-mapreduce-2.0.0-SNAPSHOT.jar \
  rowcounter usertable

You might find the more selective hbase mapredcp tool output of interest; it lists the minimum set of jars needed to run a basic mapreduce job against an hbase install. It does not include configuration. You’ll probably need to add these if you want your MapReduce job to find the target cluster. You’ll probably have to also add pointers to extra jars once you start to do anything of substance. Just specify the extras by passing the system propery -Dtmpjars when you run hbase mapredcp.

您可能会发现更有选择的hbase mapredcp工具输出感兴趣;它列出了在hbase安装基础上运行基本mapreduce作业所需的最小jar集。它不包括配置。如果希望MapReduce任务找到目标集群,您可能需要添加这些内容。当你开始做任何实质性的事情时,你可能还需要添加指向额外jar的指针。当您运行hbase mapredcp时,只需通过传递系统propery -Dtmpjars来指定额外的功能。

For jobs that do not package their dependencies or call TableMapReduceUtil#addDependencyJars, the following command structure is necessary:

对于不打包其依赖项或调用TableMapReduceUtil#addDependencyJars的作业,需要以下命令结构:

$ HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase mapredcp`:${HBASE_HOME}/conf hadoop jar MyApp.jar MyJobMainClass -libjars $(${HBASE_HOME}/bin/hbase mapredcp | tr ':' ',') ...

The example may not work if you are running HBase from its build directory rather than an installed location. You may see an error like the following:

如果您是从构建目录而不是安装位置运行HBase,那么这个示例可能不会起作用。您可能会看到如下错误:

java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.mapreduce.RowCounter$RowCounterMapper

If this occurs, try modifying the command as follows, so that it uses the HBase JARs from the target/ directory within the build environment.

如果发生这种情况,请尝试修改以下命令,以便在构建环境中使用目标/目录中的HBase jar。

$ HADOOP_CLASSPATH=${HBASE_BUILD_HOME}/hbase-mapreduce/target/hbase-mapreduce-VERSION-SNAPSHOT.jar:`${HBASE_BUILD_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_BUILD_HOME}/hbase-mapreduce/target/hbase-mapreduce-VERSION-SNAPSHOT.jar rowcounter usertable
Notice to MapReduce users of HBase between 0.96.1 and 0.98.4

Some MapReduce jobs that use HBase fail to launch. The symptom is an exception similar to the following:

一些使用HBase的MapReduce作业无法启动。该症状与以下情况类似:

Exception in thread "main" java.lang.IllegalAccessError: class
    com.google.protobuf.ZeroCopyLiteralByteString cannot access its superclass
    com.google.protobuf.LiteralByteString
    at java.lang.ClassLoader.defineClass1(Native Method)
    at java.lang.ClassLoader.defineClass(ClassLoader.java:792)
    at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
    at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
    at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at
    org.apache.hadoop.hbase.protobuf.ProtobufUtil.toScan(ProtobufUtil.java:818)
    at
    org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.convertScanToString(TableMapReduceUtil.java:433)
    at
    org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:186)
    at
    org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:147)
    at
    org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:270)
    at
    org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:100)
...

This is caused by an optimization introduced in HBASE-9867 that inadvertently introduced a classloader dependency.

这是由HBASE-9867引入的优化导致的,它无意中引入了类加载器的依赖关系。

This affects both jobs using the -libjars option and "fat jar," those which package their runtime dependencies in a nested lib folder.

这将使用-libjar选项和“fat jar”来影响两个作业,它们将运行时依赖包打包在一个嵌套的lib文件夹中。

In order to satisfy the new classloader requirements, hbase-protocol.jar must be included in Hadoop’s classpath. See HBase, MapReduce, and the CLASSPATH for current recommendations for resolving classpath errors. The following is included for historical purposes.

为了满足新的类加载器的要求,hbase-protocol。jar必须包含在Hadoop的类路径中。有关解决类路径错误的当前建议,请参见HBase、MapReduce和类路径。以下内容为历史目的。

This can be resolved system-wide by including a reference to the hbase-protocol.jar in Hadoop’s lib directory, via a symlink or by copying the jar into the new location.

这可以通过包括对hbase协议的引用来解决整个系统。jar在Hadoop的lib目录中,通过一个符号链接或将jar复制到新的位置。

This can also be achieved on a per-job launch basis by including it in the HADOOP_CLASSPATH environment variable at job submission time. When launching jobs that package their dependencies, all three of the following job launching commands satisfy this requirement:

这也可以在每个工作的发布基础上实现,包括在作业提交时间的hadoop - classpath环境变量中。当启动包其依赖项的作业时,以下三个工作启动命令都满足以下要求:

$ HADOOP_CLASSPATH=/path/to/hbase-protocol.jar:/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass
$ HADOOP_CLASSPATH=$(hbase mapredcp):/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass
$ HADOOP_CLASSPATH=$(hbase classpath) hadoop jar MyJob.jar MyJobMainClass

For jars that do not package their dependencies, the following command structure is necessary:

对于不打包其依赖项的jar,需要以下命令结构:

$ HADOOP_CLASSPATH=$(hbase mapredcp):/etc/hbase/conf hadoop jar MyApp.jar MyJobMainClass -libjars $(hbase mapredcp | tr ':' ',') ...

See also HBASE-10304 for further discussion of this issue.

请参阅HBASE-10304,进一步讨论这个问题。

48. MapReduce Scan Caching

48。MapReduce扫描缓存

TableMapReduceUtil now restores the option to set scanner caching (the number of rows which are cached before returning the result to the client) on the Scan object that is passed in. This functionality was lost due to a bug in HBase 0.95 (HBASE-11558), which is fixed for HBase 0.98.5 and 0.96.3. The priority order for choosing the scanner caching is as follows:

TableMapReduceUtil现在重新存储了在传入的扫描对象上设置扫描缓存(在将结果返回给客户机之前缓存的行数)的选项。该功能由于HBase 0.95 (HBase -11558)中的bug而丢失,该缺陷在HBase 0.98.5和0.96.3中固定。选择扫描仪缓存的优先顺序如下:

  1. Caching settings which are set on the scan object.

    在扫描对象上设置的缓存设置。

  2. Caching settings which are specified via the configuration option hbase.client.scanner.caching, which can either be set manually in hbase-site.xml or via the helper method TableMapReduceUtil.setScannerCaching().

    通过配置选项hbase.client.scanner指定的缓存设置。缓存,可以在hbase站点中手动设置。xml或通过助手方法tablemapreduceutil.setscannercache()。

  3. The default value HConstants.DEFAULT_HBASE_CLIENT_SCANNER_CACHING, which is set to 100.

    默认值HConstants。default_hbase_client_scanner_cache设置为100。

Optimizing the caching settings is a balance between the time the client waits for a result and the number of sets of results the client needs to receive. If the caching setting is too large, the client could end up waiting for a long time or the request could even time out. If the setting is too small, the scan needs to return results in several pieces. If you think of the scan as a shovel, a bigger cache setting is analogous to a bigger shovel, and a smaller cache setting is equivalent to more shoveling in order to fill the bucket.

优化缓存设置是客户端等待结果的时间和客户端需要接收的结果集的数量之间的平衡。如果缓存设置太大,客户端可能会等待很长时间,或者请求甚至超时。如果设置太小,扫描需要返回几个片段的结果。如果你把扫描看作是一个铲,一个更大的缓存设置类似于一个更大的铲子,一个较小的缓存设置相当于更多的铲雪来填满桶。

The list of priorities mentioned above allows you to set a reasonable default, and override it for specific operations.

上面提到的优先级列表允许您设置合理的默认值,并为特定操作覆盖它。

See the API documentation for Scan for more details.

有关详细信息,请参阅API文档。

49. Bundled HBase MapReduce Jobs

49。捆绑HBase MapReduce工作

The HBase JAR also serves as a Driver for some bundled MapReduce jobs. To learn about the bundled MapReduce jobs, run the following command.

HBase JAR还充当了一些绑定MapReduce作业的驱动程序。要了解绑定的MapReduce作业,请运行以下命令。

$ ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-mapreduce-VERSION.jar
An example program must be given as the first argument.
Valid program names are:
  copytable: Export a table from local cluster to peer cluster
  completebulkload: Complete a bulk data load.
  export: Write table data to HDFS.
  import: Import data written by Export.
  importtsv: Import data in TSV format.
  rowcounter: Count rows in HBase table

Each of the valid program names are bundled MapReduce jobs. To run one of the jobs, model your command after the following example.

每个有效的程序名称都被绑定在MapReduce作业上。要运行其中一个作业,请在下面的示例中为您的命令建模。

$ ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-mapreduce-VERSION.jar rowcounter myTable

50. HBase as a MapReduce Job Data Source and Data Sink

50。HBase作为MapReduce作业数据源和数据接收器。

HBase can be used as a data source, TableInputFormat, and data sink, TableOutputFormat or MultiTableOutputFormat, for MapReduce jobs. Writing MapReduce jobs that read or write HBase, it is advisable to subclass TableMapper and/or TableReducer. See the do-nothing pass-through classes IdentityTableMapper and IdentityTableReducer for basic usage. For a more involved example, see RowCounter or review the org.apache.hadoop.hbase.mapreduce.TestTableMapReduce unit test.

HBase可作为数据源、TableInputFormat和数据接收器、TableOutputFormat或MultiTableOutputFormat,用于MapReduce作业。写MapReduce任务读或写HBase,建议子类化TableMapper和/或TableReducer。请参阅“不做”的传递类标识符和标识符类的基本用法。对于更复杂的示例,请参见RowCounter或查看org.apache.hadoop.hbase.mapreduce。TestTableMapReduce单元测试。

If you run MapReduce jobs that use HBase as source or sink, need to specify source and sink table and column names in your configuration.

如果使用HBase作为源或接收器运行MapReduce作业,则需要在配置中指定源和sink表和列名称。

When you read from HBase, the TableInputFormat requests the list of regions from HBase and makes a map, which is either a map-per-region or mapreduce.job.maps map, whichever is smaller. If your job only has two maps, raise mapreduce.job.maps to a number greater than the number of regions. Maps will run on the adjacent TaskTracker/NodeManager if you are running a TaskTracer/NodeManager and RegionServer per node. When writing to HBase, it may make sense to avoid the Reduce step and write back into HBase from within your map. This approach works when your job does not need the sort and collation that MapReduce does on the map-emitted data. On insert, HBase 'sorts' so there is no point double-sorting (and shuffling data around your MapReduce cluster) unless you need to. If you do not need the Reduce, your map might emit counts of records processed for reporting at the end of the job, or set the number of Reduces to zero and use TableOutputFormat. If running the Reduce step makes sense in your case, you should typically use multiple reducers so that load is spread across the HBase cluster.

当您从HBase中读取时,TableInputFormat请求HBase中的区域列表,并生成一个map,它要么是map-per-region,要么是mapreduce.job。地图地图,以较小的为准。如果你的工作只有两张地图,那就增加mapreduce。映射到大于区域数目的数字。如果您正在运行一个TaskTracer/NodeManager和每个节点的区域服务器,那么地图将在邻近的任务跟踪器/NodeManager上运行。当写入HBase时,避免减少步骤并从映射中返回到HBase可能是有意义的。当您的作业不需要MapReduce在地图上释放的数据的排序和排序时,这种方法就会起作用。在插入时,HBase“排序”,因此,除非您需要,否则没有必要对您的MapReduce集群进行重复排序(和调整数据)。如果您不需要Reduce,那么您的映射可能会在作业结束时发出处理报告的记录计数,或者将Reduce的数量设置为0,并使用TableOutputFormat。如果在您的情况下运行Reduce步骤是有意义的,那么您应该使用多个减速器,这样负载就会分散到HBase集群中。

A new HBase partitioner, the HRegionPartitioner, can run as many reducers the number of existing regions. The HRegionPartitioner is suitable when your table is large and your upload will not greatly alter the number of existing regions upon completion. Otherwise use the default partitioner.

一个新的HBase分区者,h分区的参与者,可以在现有区域的数量上运行。当您的表很大,并且您的上载不会在完成时大大改变现有区域的数量时,hpartiator是合适的。否则使用默认的分区。

51. Writing HFiles Directly During Bulk Import

51。在批量导入期间直接编写HFiles。

If you are importing into a new table, you can bypass the HBase API and write your content directly to the filesystem, formatted into HBase data files (HFiles). Your import will run faster, perhaps an order of magnitude faster. For more on how this mechanism works, see Bulk Loading.

如果您正在导入一个新表,您可以绕过HBase API并将内容直接写到文件系统中,格式化为HBase数据文件(HFiles)。您的导入将会运行得更快,速度可能会更快。有关这个机制如何工作的更多信息,请参阅批量加载。

52. RowCounter Example

52岁。rowcount例子

The included RowCounter MapReduce job uses TableInputFormat and does a count of all rows in the specified table. To run it, use the following command:

包含的RowCounter MapReduce作业使用TableInputFormat,并对指定表中的所有行进行计数。要运行它,请使用以下命令:

$ ./bin/hadoop jar hbase-X.X.X.jar

This will invoke the HBase MapReduce Driver class. Select rowcounter from the choice of jobs offered. This will print rowcounter usage advice to standard output. Specify the tablename, column to count, and output directory. If you have classpath errors, see HBase, MapReduce, and the CLASSPATH.

这将调用HBase MapReduce驱动程序类。从提供的工作选择中选择rowcounter。这将打印rowcounter使用建议到标准输出。指定tablename、列计数和输出目录。如果您有类路径错误,请参见HBase、MapReduce和类路径。

53. Map-Task Splitting

53岁。地图任务分解

53.1. The Default HBase MapReduce Splitter

53.1。默认的HBase MapReduce Splitter。

When TableInputFormat is used to source an HBase table in a MapReduce job, its splitter will make a map task for each region of the table. Thus, if there are 100 regions in the table, there will be 100 map-tasks for the job - regardless of how many column families are selected in the Scan.

当使用TableInputFormat在MapReduce作业中为HBase表提供源时,它的splitter将为表的每个区域生成一个映射任务。因此,如果表中有100个区域,那么无论在扫描中选择多少个列族,都将有100个映射任务。

53.2. Custom Splitters

53.2。自定义分割

For those interested in implementing custom splitters, see the method getSplits in TableInputFormatBase. That is where the logic for map-task assignment resides.

对于那些有兴趣实现自定义拆分的人,请参见TableInputFormatBase中的方法get。这就是映射任务分配的逻辑所在。

54. HBase MapReduce Examples

54。HBase MapReduce的例子

54.1. HBase MapReduce Read Example

54.1。HBase MapReduce读例子

The following is an example of using HBase as a MapReduce source in read-only manner. Specifically, there is a Mapper instance but no Reducer, and nothing is being emitted from the Mapper. The job would be defined as follows…​

下面是以只读方式使用HBase作为MapReduce源的示例。具体来说,有一个Mapper实例,但没有减速器,并且没有任何东西从映射器中发出。这项工作的定义如下……

Configuration config = HBaseConfiguration.create();
Job job = new Job(config, "ExampleRead");
job.setJarByClass(MyReadJob.class);     // class that contains mapper

Scan scan = new Scan();
scan.setCaching(500);        // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false);  // don't set to true for MR jobs
// set other scan attrs
...

TableMapReduceUtil.initTableMapperJob(
  tableName,        // input HBase table name
  scan,             // Scan instance to control CF and attribute selection
  MyMapper.class,   // mapper
  null,             // mapper output key
  null,             // mapper output value
  job);
job.setOutputFormatClass(NullOutputFormat.class);   // because we aren't emitting anything from mapper

boolean b = job.waitForCompletion(true);
if (!b) {
  throw new IOException("error with job!");
}

…​and the mapper instance would extend TableMapper…​

并且mapper实例将扩展TableMapper…

public static class MyMapper extends TableMapper<Text, Text> {

  public void map(ImmutableBytesWritable row, Result value, Context context) throws InterruptedException, IOException {
    // process data for the row from the Result instance.
   }
}

54.2. HBase MapReduce Read/Write Example

54.2。HBase MapReduce读/写的例子

The following is an example of using HBase both as a source and as a sink with MapReduce. This example will simply copy data from one table to another.

下面是一个使用HBase作为源和使用MapReduce的接收器的示例。本例将简单地将数据从一个表复制到另一个表。

Configuration config = HBaseConfiguration.create();
Job job = new Job(config,"ExampleReadWrite");
job.setJarByClass(MyReadWriteJob.class);    // class that contains mapper

Scan scan = new Scan();
scan.setCaching(500);        // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false);  // don't set to true for MR jobs
// set other scan attrs

TableMapReduceUtil.initTableMapperJob(
  sourceTable,      // input table
  scan,             // Scan instance to control CF and attribute selection
  MyMapper.class,   // mapper class
  null,             // mapper output key
  null,             // mapper output value
  job);
TableMapReduceUtil.initTableReducerJob(
  targetTable,      // output table
  null,             // reducer class
  job);
job.setNumReduceTasks(0);

boolean b = job.waitForCompletion(true);
if (!b) {
    throw new IOException("error with job!");
}

An explanation is required of what TableMapReduceUtil is doing, especially with the reducer. TableOutputFormat is being used as the outputFormat class, and several parameters are being set on the config (e.g., TableOutputFormat.OUTPUT_TABLE), as well as setting the reducer output key to ImmutableBytesWritable and reducer value to Writable. These could be set by the programmer on the job and conf, but TableMapReduceUtil tries to make things easier.

需要解释的是,TableMapReduceUtil在做什么,特别是在减速器上。TableOutputFormat被用作outputFormat类,并且在配置上设置了几个参数(例如,TableOutputFormat. output_table),以及将减缩输出键设置为ImmutableBytesWritable和reducer值可写。这些可以由程序员在作业和conf上设置,但是TableMapReduceUtil试图让事情变得更简单。

The following is the example mapper, which will create a Put and matching the input Result and emit it. Note: this is what the CopyTable utility does.

下面的示例是mapper,它将创建一个Put并匹配输入结果并发出它。注意:这是CopyTable实用程序所做的。

public static class MyMapper extends TableMapper<ImmutableBytesWritable, Put>  {

  public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {
    // this example is just copying the data from the source table...
      context.write(row, resultToPut(row,value));
    }

    private static Put resultToPut(ImmutableBytesWritable key, Result result) throws IOException {
      Put put = new Put(key.get());
      for (KeyValue kv : result.raw()) {
        put.add(kv);
      }
      return put;
    }
}

There isn’t actually a reducer step, so TableOutputFormat takes care of sending the Put to the target table.

实际上并没有减少的步骤,所以TableOutputFormat负责将Put发送到目标表。

This is just an example, developers could choose not to use TableOutputFormat and connect to the target table themselves.

这只是一个示例,开发人员可以选择不使用TableOutputFormat并连接到目标表本身。

54.3. HBase MapReduce Read/Write Example With Multi-Table Output

54.3。HBase MapReduce读/写例,具有多表输出。

TODO: example for MultiTableOutputFormat.

待办事项:MultiTableOutputFormat的示例。

54.4. HBase MapReduce Summary to HBase Example

54.4。HBase MapReduce概要到HBase示例。

The following example uses HBase as a MapReduce source and sink with a summarization step. This example will count the number of distinct instances of a value in a table and write those summarized counts in another table.

下面的示例使用HBase作为MapReduce源,并使用一个总结步骤。此示例将计算表中某个值的不同实例数,并将这些汇总计数写入另一个表中。

Configuration config = HBaseConfiguration.create();
Job job = new Job(config,"ExampleSummary");
job.setJarByClass(MySummaryJob.class);     // class that contains mapper and reducer

Scan scan = new Scan();
scan.setCaching(500);        // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false);  // don't set to true for MR jobs
// set other scan attrs

TableMapReduceUtil.initTableMapperJob(
  sourceTable,        // input table
  scan,               // Scan instance to control CF and attribute selection
  MyMapper.class,     // mapper class
  Text.class,         // mapper output key
  IntWritable.class,  // mapper output value
  job);
TableMapReduceUtil.initTableReducerJob(
  targetTable,        // output table
  MyTableReducer.class,    // reducer class
  job);
job.setNumReduceTasks(1);   // at least one, adjust as required

boolean b = job.waitForCompletion(true);
if (!b) {
  throw new IOException("error with job!");
}

In this example mapper a column with a String-value is chosen as the value to summarize upon. This value is used as the key to emit from the mapper, and an IntWritable represents an instance counter.

在这个示例中,选择一个带有字符串值的列作为总结的值。这个值用作从映射器发出的键,而IntWritable则表示实例计数器。

public static class MyMapper extends TableMapper<Text, IntWritable>  {
  public static final byte[] CF = "cf".getBytes();
  public static final byte[] ATTR1 = "attr1".getBytes();

  private final IntWritable ONE = new IntWritable(1);
  private Text text = new Text();

  public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {
    String val = new String(value.getValue(CF, ATTR1));
    text.set(val);     // we can only emit Writables...
    context.write(text, ONE);
  }
}

In the reducer, the "ones" are counted (just like any other MR example that does this), and then emits a Put.

在还原器中,“ones”被计数(就像其他例子一样),然后发出一个Put。

public static class MyTableReducer extends TableReducer<Text, IntWritable, ImmutableBytesWritable>  {
  public static final byte[] CF = "cf".getBytes();
  public static final byte[] COUNT = "count".getBytes();

  public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
    int i = 0;
    for (IntWritable val : values) {
      i += val.get();
    }
    Put put = new Put(Bytes.toBytes(key.toString()));
    put.add(CF, COUNT, Bytes.toBytes(i));

    context.write(null, put);
  }
}

54.5. HBase MapReduce Summary to File Example

54.5。HBase MapReduce摘要以文件为例。

This very similar to the summary example above, with exception that this is using HBase as a MapReduce source but HDFS as the sink. The differences are in the job setup and in the reducer. The mapper remains the same.

这与上面的概要示例非常相似,但它使用HBase作为MapReduce源,而HDFS作为接收器。不同之处在于工作的设置和减少。映射器保持不变。

Configuration config = HBaseConfiguration.create();
Job job = new Job(config,"ExampleSummaryToFile");
job.setJarByClass(MySummaryFileJob.class);     // class that contains mapper and reducer

Scan scan = new Scan();
scan.setCaching(500);        // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false);  // don't set to true for MR jobs
// set other scan attrs

TableMapReduceUtil.initTableMapperJob(
  sourceTable,        // input table
  scan,               // Scan instance to control CF and attribute selection
  MyMapper.class,     // mapper class
  Text.class,         // mapper output key
  IntWritable.class,  // mapper output value
  job);
job.setReducerClass(MyReducer.class);    // reducer class
job.setNumReduceTasks(1);    // at least one, adjust as required
FileOutputFormat.setOutputPath(job, new Path("/tmp/mr/mySummaryFile"));  // adjust directories as required

boolean b = job.waitForCompletion(true);
if (!b) {
  throw new IOException("error with job!");
}

As stated above, the previous Mapper can run unchanged with this example. As for the Reducer, it is a "generic" Reducer instead of extending TableMapper and emitting Puts.

如上所述,前面的Mapper可以与此示例保持不变。至于减速机,它是一种“通用”的减速机,而不是扩展的制表机和发射装置。

public static class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable>  {

  public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
    int i = 0;
    for (IntWritable val : values) {
      i += val.get();
    }
    context.write(key, new IntWritable(i));
  }
}

54.6. HBase MapReduce Summary to HBase Without Reducer

54.6。HBase MapReduce概要到HBase而不减少。

It is also possible to perform summaries without a reducer - if you use HBase as the reducer.

如果你使用HBase作为减速器,也可以不使用减速器来执行摘要。

An HBase target table would need to exist for the job summary. The Table method incrementColumnValue would be used to atomically increment values. From a performance perspective, it might make sense to keep a Map of values with their values to be incremented for each map-task, and make one update per key at during the cleanup method of the mapper. However, your mileage may vary depending on the number of rows to be processed and unique keys.

工作总结需要有一个HBase目标表。表方法incrementColumnValue将用于原子递增值。从性能的角度来看,将值的映射与它们的值保持一致,以对每个Map -task进行递增,并在mapper的清理方法中对每个键进行一次更新,这可能是有意义的。但是,根据要处理的行数和唯一键的不同,您的里程可能会有所不同。

In the end, the summary results are in HBase.

最后,总结结果在HBase中。

54.7. HBase MapReduce Summary to RDBMS

54.7。HBase MapReduce摘要到RDBMS。

Sometimes it is more appropriate to generate summaries to an RDBMS. For these cases, it is possible to generate summaries directly to an RDBMS via a custom reducer. The setup method can connect to an RDBMS (the connection information can be passed via custom parameters in the context) and the cleanup method can close the connection.

有时,为RDBMS生成摘要更合适。对于这些情况,可以通过自定义减少器直接将概要文件生成到RDBMS。setup方法可以连接到RDBMS(连接信息可以通过上下文中的自定义参数传递),并且清理方法可以关闭连接。

It is critical to understand that number of reducers for the job affects the summarization implementation, and you’ll have to design this into your reducer. Specifically, whether it is designed to run as a singleton (one reducer) or multiple reducers. Neither is right or wrong, it depends on your use-case. Recognize that the more reducers that are assigned to the job, the more simultaneous connections to the RDBMS will be created - this will scale, but only to a point.

重要的是,要了解工作中的减速器的数量会影响到总结的实现,并且您将不得不将其设计成您的减速器。具体地说,它是否被设计为单例(一个减速器)或多个减速器。两者都不是对或错,这取决于你的用例。要认识到分配给该作业的减缩器越多,就会创建与RDBMS越同步的连接——这将扩展,但只会达到一个点。

public static class MyRdbmsReducer extends Reducer<Text, IntWritable, Text, IntWritable>  {

  private Connection c = null;

  public void setup(Context context) {
    // create DB connection...
  }

  public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
    // do summarization
    // in this example the keys are Text, but this is just an example
  }

  public void cleanup(Context context) {
    // close db connection
  }

}

In the end, the summary results are written to your RDBMS table/s.

最后,将汇总结果写到RDBMS表/s中。

55. Accessing Other HBase Tables in a MapReduce Job

55。在MapReduce作业中访问其他HBase表。

Although the framework currently allows one HBase table as input to a MapReduce job, other HBase tables can be accessed as lookup tables, etc., in a MapReduce job via creating an Table instance in the setup method of the Mapper.

尽管框架目前允许一个HBase表作为MapReduce作业的输入,其他的HBase表可以作为查找表来访问,等等,在MapReduce作业中,通过在Mapper的setup方法中创建一个表实例。

public class MyMapper extends TableMapper<Text, LongWritable> {
  private Table myOtherTable;

  public void setup(Context context) {
    // In here create a Connection to the cluster and save it or use the Connection
    // from the existing table
    myOtherTable = connection.getTable("myOtherTable");
  }

  public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {
    // process Result...
    // use 'myOtherTable' for lookups
  }

56. Speculative Execution

56。投机执行

It is generally advisable to turn off speculative execution for MapReduce jobs that use HBase as a source. This can either be done on a per-Job basis through properties, or on the entire cluster. Especially for longer running jobs, speculative execution will create duplicate map-tasks which will double-write your data to HBase; this is probably not what you want.

通常建议关闭使用HBase作为源的MapReduce作业的投机性执行。这可以通过属性或整个集群来实现。特别是对于长时间运行的作业,推测执行将创建重复的映射任务,将您的数据写入HBase;这可能不是你想要的。

See spec.ex for more information.

有关更多信息,请参见spec.ex。

57. Cascading

57。级联

Cascading is an alternative API for MapReduce, which actually uses MapReduce, but allows you to write your MapReduce code in a simplified way.

级联是MapReduce的一个替代API,它实际上使用MapReduce,但是允许您以简化的方式编写MapReduce代码。

The following example shows a Cascading Flow which "sinks" data into an HBase cluster. The same hBaseTap API could be used to "source" data as well.

下面的示例展示了一个级联流,它将数据“汇聚”到一个HBase集群中。同样的hBaseTap API也可以用于“源”数据。

// read data from the default filesystem
// emits two fields: "offset" and "line"
Tap source = new Hfs( new TextLine(), inputFileLhs );

// store data in an HBase cluster
// accepts fields "num", "lower", and "upper"
// will automatically scope incoming fields to their proper familyname, "left" or "right"
Fields keyFields = new Fields( "num" );
String[] familyNames = {"left", "right"};
Fields[] valueFields = new Fields[] {new Fields( "lower" ), new Fields( "upper" ) };
Tap hBaseTap = new HBaseTap( "multitable", new HBaseScheme( keyFields, familyNames, valueFields ), SinkMode.REPLACE );

// a simple pipe assembly to parse the input into fields
// a real app would likely chain multiple Pipes together for more complex processing
Pipe parsePipe = new Each( "insert", new Fields( "line" ), new RegexSplitter( new Fields( "num", "lower", "upper" ), " " ) );

// "plan" a cluster executable Flow
// this connects the source Tap and hBaseTap (the sink Tap) to the parsePipe
Flow parseFlow = new FlowConnector( properties ).connect( source, hBaseTap, parsePipe );

// start the flow, and block until complete
parseFlow.complete();

// open an iterator on the HBase table we stuffed data into
TupleEntryIterator iterator = parseFlow.openSink();

while(iterator.hasNext())
  {
  // print out each tuple from HBase
  System.out.println( "iterator.next() = " + iterator.next() );
  }

iterator.close();

Securing Apache HBase

确保Apache HBase

Reporting Security Bugs
To protect existing HBase installations from exploitation, please do not use JIRA to report security-related bugs. Instead, send your report to the mailing list private@apache.org, which allows anyone to send messages, but restricts who can read them. Someone on that list will contact you to follow up on your report.

HBase adheres to the Apache Software Foundation’s policy on reported vulnerabilities, available at http://apache.org/security/.

HBase遵循Apache Software Foundation关于报告漏洞的策略,可在http://apache.org/security/上找到。

If you wish to send an encrypted report, you can use the GPG details provided for the general ASF security list. This will likely increase the response time to your report.

如果您希望发送一个加密的报告,您可以使用GPG的详细信息来提供一般的ASF安全列表。这可能会增加您报告的响应时间。

HBase provides mechanisms to secure various components and aspects of HBase and how it relates to the rest of the Hadoop infrastructure, as well as clients and resources outside Hadoop.

HBase提供了各种机制来确保HBase的各种组件和方面,以及它如何与Hadoop基础设施的其余部分、以及Hadoop外的客户端和资源相关。

58. Using Secure HTTP (HTTPS) for the Web UI

58岁。为Web UI使用安全的HTTP (HTTPS)。

A default HBase install uses insecure HTTP connections for Web UIs for the master and region servers. To enable secure HTTP (HTTPS) connections instead, set hbase.ssl.enabled to true in hbase-site.xml. This does not change the port used by the Web UI. To change the port for the web UI for a given HBase component, configure that port’s setting in hbase-site.xml. These settings are:

默认的HBase安装使用不安全的HTTP连接来为主服务器和区域服务器提供Web ui。要启用安全的HTTP (HTTPS)连接,设置hbase.ssl。在hbase-site.xml中启用。这不会改变Web UI所使用的端口。要为一个给定的HBase组件更改web UI的端口,可以在HBase -site.xml中配置该端口的设置。这些设置包括:

  • hbase.master.info.port

    hbase.master.info.port

  • hbase.regionserver.info.port

    hbase.regionserver.info.port

If you enable HTTPS, clients should avoid using the non-secure HTTP connection.

If you enable secure HTTP, clients should connect to HBase using the https:// URL. Clients using the http:// URL will receive an HTTP response of 200, but will not receive any data. The following exception is logged:

如果您启用了安全的HTTP,客户端应该使用https:// URL连接到HBase。使用http:// URL的客户机将收到200个HTTP响应,但不会接收任何数据。日志记录如下异常:

javax.net.ssl.SSLException: Unrecognized SSL message, plaintext connection?

This is because the same port is used for HTTP and HTTPS.

这是因为同一个端口用于HTTP和HTTPS。

HBase uses Jetty for the Web UI. Without modifying Jetty itself, it does not seem possible to configure Jetty to redirect one port to another on the same host. See Nick Dimiduk’s contribution on this Stack Overflow thread for more information. If you know how to fix this without opening a second port for HTTPS, patches are appreciated.

HBase使用Jetty作为Web UI。如果不修改Jetty本身,就不可能配置Jetty将一个端口重定向到同一主机上的另一个端口。有关更多信息,请参见Nick Dimiduk对这个堆栈溢出线程的贡献。如果您知道如何解决这个问题,而不为HTTPS打开第二个端口,那么补丁就会被欣赏。

59. Using SPNEGO for Kerberos authentication with Web UIs

59。使用SPNEGO对Web ui进行Kerberos身份验证。

Kerberos-authentication to HBase Web UIs can be enabled via configuring SPNEGO with the hbase.security.authentication.ui property in hbase-site.xml. Enabling this authentication requires that HBase is also configured to use Kerberos authentication for RPCs (e.g hbase.security.authentication = kerberos).

通过配置SPNEGO和HBase .security.authentication,可以启用对HBase Web ui的kerberos身份验证。在hbase-site.xml ui属性。启用此身份验证要求HBase也被配置为使用Kerberos身份验证(e)。g hbase.security。kerberos身份验证=)。

<property>
  <name>hbase.security.authentication.ui</name>
  <value>kerberos</value>
  <description>Controls what kind of authentication should be used for the HBase web UIs.</description>
</property>
<property>
  <name>hbase.security.authentication</name>
  <value>kerberos</value>
  <description>The Kerberos keytab file to use for SPNEGO authentication by the web server.</description>
</property>

A number of properties exist to configure SPNEGO authentication for the web server:

有许多属性用于为web服务器配置SPNEGO身份验证:

<property>
  <name>hbase.security.authentication.spnego.kerberos.principal</name>
  <value>HTTP/_HOST@EXAMPLE.COM</value>
  <description>Required for SPNEGO, the Kerberos principal to use for SPNEGO authentication by the
  web server. The _HOST keyword will be automatically substituted with the node's
  hostname.</description>
</property>
<property>
  <name>hbase.security.authentication.spnego.kerberos.keytab</name>
  <value>/etc/security/keytabs/spnego.service.keytab</value>
  <description>Required for SPNEGO, the Kerberos keytab file to use for SPNEGO authentication by the
  web server.</description>
</property>
<property>
  <name>hbase.security.authentication.spnego.kerberos.name.rules</name>
  <value></value>
  <description>Optional, Hadoop-style `auth_to_local` rules which will be parsed and used in the
  handling of Kerberos principals</description>
</property>
<property>
  <name>hbase.security.authentication.signature.secret.file</name>
  <value></value>
  <description>Optional, a file whose contents will be used as a secret to sign the HTTP cookies
  as a part of the SPNEGO authentication handshake. If this is not provided, Java's `Random` library
  will be used for the secret.</description>
</property>

60. Secure Client Access to Apache HBase

60。安全客户端访问Apache HBase。

Newer releases of Apache HBase (>= 0.92) support optional SASL authentication of clients. See also Matteo Bertozzi’s article on Understanding User Authentication and Authorization in Apache HBase.

Apache HBase的更新版本(>= 0.92)支持客户端可选的SASL认证。请参阅Matteo Bertozzi的文章,了解Apache HBase中的用户身份验证和授权。

This describes how to set up Apache HBase and clients for connection to secure HBase resources.

这描述了如何设置Apache HBase和客户端来连接到安全的HBase资源。

60.1. Prerequisites

60.1。先决条件

Hadoop Authentication Configuration

To run HBase RPC with strong authentication, you must set hbase.security.authentication to kerberos. In this case, you must also set hadoop.security.authentication to kerberos in core-site.xml. Otherwise, you would be using strong authentication for HBase but not for the underlying HDFS, which would cancel out any benefit.

要使用强身份验证运行HBase RPC,必须设置HBase .security。kerberos身份验证。在这种情况下,您还必须设置hadoop.security。在core-site.xml中对kerberos进行身份验证。否则,您将为HBase使用强大的身份验证,而不是针对底层的HDFS,这将抵消任何好处。

Kerberos KDC

You need to have a working Kerberos KDC.

您需要一个工作的Kerberos KDC。

60.2. Server-side Configuration for Secure Operation

60.2。用于安全操作的服务器端配置。

First, refer to security.prerequisites and ensure that your underlying HDFS configuration is secure.

首先,指的是安全。先决条件和确保底层的HDFS配置是安全的。

Add the following to the hbase-site.xml file on every server machine in the cluster:

将以下内容添加到hbase站点。集群中每个服务器机器上的xml文件:

<property>
  <name>hbase.security.authentication</name>
  <value>kerberos</value>
</property>
<property>
  <name>hbase.security.authorization</name>
  <value>true</value>
</property>
<property>
<name>hbase.coprocessor.region.classes</name>
  <value>org.apache.hadoop.hbase.security.token.TokenProvider</value>
</property>

A full shutdown and restart of HBase service is required when deploying these configuration changes.

在部署这些配置更改时,需要完全关闭和重新启动HBase服务。

60.3. Client-side Configuration for Secure Operation

60.3。安全操作的客户端配置。

First, refer to Prerequisites and ensure that your underlying HDFS configuration is secure.

首先,请参考先决条件,确保底层的HDFS配置是安全的。

Add the following to the hbase-site.xml file on every client:

将以下内容添加到hbase站点。每个客户端的xml文件:

<property>
  <name>hbase.security.authentication</name>
  <value>kerberos</value>
</property>

The client environment must be logged in to Kerberos from KDC or keytab via the kinit command before communication with the HBase cluster will be possible.

在与HBase集群通信之前,必须通过kinit命令将客户机环境从KDC或keytab登录到Kerberos。

Be advised that if the hbase.security.authentication in the client- and server-side site files do not match, the client will not be able to communicate with the cluster.

请注意,如果hbase.security。在客户端和服务器端站点文件中的身份验证不匹配,客户端将无法与集群通信。

Once HBase is configured for secure RPC it is possible to optionally configure encrypted communication. To do so, add the following to the hbase-site.xml file on every client:

一旦HBase配置为安全RPC,就有可能配置加密通信。为此,将以下内容添加到hbase站点。每个客户端的xml文件:

<property>
  <name>hbase.rpc.protection</name>
  <value>privacy</value>
</property>

This configuration property can also be set on a per-connection basis. Set it in the Configuration supplied to Table:

这个配置属性也可以在每个连接的基础上设置。将其设置在提供给表的配置中:

Configuration conf = HBaseConfiguration.create();
Connection connection = ConnectionFactory.createConnection(conf);
conf.set("hbase.rpc.protection", "privacy");
try (Connection connection = ConnectionFactory.createConnection(conf);
     Table table = connection.getTable(TableName.valueOf(tablename))) {
  .... do your stuff
}

Expect a ~10% performance penalty for encrypted communication.

对于加密通信,期望有10% ~10%的性能损失。

60.4. Client-side Configuration for Secure Operation - Thrift Gateway

60.4。安全操作-节约网关的客户端配置。

Add the following to the hbase-site.xml file for every Thrift gateway:

将以下内容添加到hbase站点。每个节俭网关的xml文件:

<property>
  <name>hbase.thrift.keytab.file</name>
  <value>/etc/hbase/conf/hbase.keytab</value>
</property>
<property>
  <name>hbase.thrift.kerberos.principal</name>
  <value>$USER/_HOST@HADOOP.LOCALDOMAIN</value>
  <!-- TODO: This may need to be HTTP/_HOST@<REALM> and _HOST may not work. You may have to put the concrete full hostname. -->
</property>
<!-- Add these if you need to configure a different DNS interface from the default -->
<property>
  <name>hbase.thrift.dns.interface</name>
  <value>default</value>
</property>
<property>
  <name>hbase.thrift.dns.nameserver</name>
  <value>default</value>
</property>

Substitute the appropriate credential and keytab for $USER and $KEYTAB respectively.

分别以$USER和$ keytab替代适当的凭据和keytab。

In order to use the Thrift API principal to interact with HBase, it is also necessary to add the hbase.thrift.kerberos.principal to the acl table. For example, to give the Thrift API principal, thrift_server, administrative access, a command such as this one will suffice:

为了使用节俭API主体与HBase交互,还需要添加HBase . Thrift .kerberos。用于acl表的主体。例如,为节约API主体,thrift_server,管理访问,这样的命令就足够了:

grant 'thrift_server', 'RWCA'

For more information about ACLs, please see the Access Control Labels (ACLs) section

有关acl的更多信息,请参见访问控制标签(ACLs)部分。

The Thrift gateway will authenticate with HBase using the supplied credential. No authentication will be performed by the Thrift gateway itself. All client access via the Thrift gateway will use the Thrift gateway’s credential and have its privilege.

节约网关将使用提供的凭据对HBase进行身份验证。没有认证将由节约网关本身执行。通过节俭网关的所有客户端访问将使用节俭网关的凭据,并拥有它的特权。

60.5. Configure the Thrift Gateway to Authenticate on Behalf of the Client

60.5。配置节俭网关,以代表客户端进行身份验证。

Client-side Configuration for Secure Operation - Thrift Gateway describes how to authenticate a Thrift client to HBase using a fixed user. As an alternative, you can configure the Thrift gateway to authenticate to HBase on the client’s behalf, and to access HBase using a proxy user. This was implemented in HBASE-11349 for Thrift 1, and HBASE-11474 for Thrift 2.

安全操作-节俭网关的客户端配置描述了如何使用一个固定用户对一个节俭客户进行身份验证。作为另一种选择,您可以配置节约网关,以客户的名义向HBase进行身份验证,并使用代理用户访问HBase。这是在HBASE-11349中实现的节约1,HBASE-11474用于节约2。

Limitations with Thrift Framed Transport

If you use framed transport, you cannot yet take advantage of this feature, because SASL does not work with Thrift framed transport at this time.

如果您使用框架传输,您还不能利用这个特性,因为在这个时候SASL不使用节俭框架传输。

To enable it, do the following.

要启用它,请执行以下操作。

  1. Be sure Thrift is running in secure mode, by following the procedure described in Client-side Configuration for Secure Operation - Thrift Gateway.

    确保节俭是在安全模式下运行,遵循客户端配置中描述的安全操作—节约网关。

  2. Be sure that HBase is configured to allow proxy users, as described in REST Gateway Impersonation Configuration.

    请确保HBase配置为允许代理用户,如REST网关模拟配置所述。

  3. In hbase-site.xml for each cluster node running a Thrift gateway, set the property hbase.thrift.security.qop to one of the following three values:

    在hbase-site。每个集群节点运行一个节约网关的xml,设置属性hbase.thrift.security。qop的以下三个值之一:

    • privacy - authentication, integrity, and confidentiality checking.

      隐私——认证、完整性和机密性检查。

    • integrity - authentication and integrity checking

      完整性-认证和完整性检查。

    • authentication - authentication checking only

      身份验证——只检查身份验证。

  4. Restart the Thrift gateway processes for the changes to take effect. If a node is running Thrift, the output of the jps command will list a ThriftServer process. To stop Thrift on a node, run the command bin/hbase-daemon.sh stop thrift. To start Thrift on a node, run the command bin/hbase-daemon.sh start thrift.

    重新启动节俭网关流程,以使更改生效。如果一个节点正在运行节约,jps命令的输出将列出一个ThriftServer进程。要在节点上停止节约,运行命令bin/hbase-daemon。sh停止节俭。要在节点上开始节约,运行命令bin/hbase-daemon。sh开始节俭。

60.6. Configure the Thrift Gateway to Use the doAs Feature

60.6。配置节俭网关以使用doAs特性。

Configure the Thrift Gateway to Authenticate on Behalf of the Client describes how to configure the Thrift gateway to authenticate to HBase on the client’s behalf, and to access HBase using a proxy user. The limitation of this approach is that after the client is initialized with a particular set of credentials, it cannot change these credentials during the session. The doAs feature provides a flexible way to impersonate multiple principals using the same client. This feature was implemented in HBASE-12640 for Thrift 1, but is currently not available for Thrift 2.

配置“节约网关”以代表客户端进行身份验证,描述如何配置“节约网关”,以客户端身份验证到HBase,并使用代理用户访问HBase。这种方法的局限性是,在客户端使用特定的凭证集初始化之后,它不能在会话期间更改这些凭据。doAs特性提供了一种灵活的方式,可以使用同一个客户机模拟多个主体。这个特性是在HBASE-12640中实现的,用于节约1,但是目前还不能用于节约2。

To enable the doAs feature, add the following to the hbase-site.xml file for every Thrift gateway:

要启用doAs功能,请将以下内容添加到hbase站点。每个节俭网关的xml文件:

<property>
  <name>hbase.regionserver.thrift.http</name>
  <value>true</value>
</property>
<property>
  <name>hbase.thrift.support.proxyuser</name>
  <value>true/value>
</property>

To allow proxy users when using doAs impersonation, add the following to the hbase-site.xml file for every HBase node:

为了在使用doAs模拟时允许代理用户,请将以下内容添加到hbase站点。每个HBase节点的xml文件:

<property>
  <name>hadoop.security.authorization</name>
  <value>true</value>
</property>
<property>
  <name>hadoop.proxyuser.$USER.groups</name>
  <value>$GROUPS</value>
</property>
<property>
  <name>hadoop.proxyuser.$USER.hosts</name>
  <value>$GROUPS</value>
</property>

Take a look at the demo client to get an overall idea of how to use this feature in your client.

看一下演示客户端,了解如何在客户端使用这个特性。

60.7. Client-side Configuration for Secure Operation - REST Gateway

60.7。安全操作- REST网关的客户端配置。

Add the following to the hbase-site.xml file for every REST gateway:

将以下内容添加到hbase站点。每个REST网关的xml文件:

<property>
  <name>hbase.rest.keytab.file</name>
  <value>$KEYTAB</value>
</property>
<property>
  <name>hbase.rest.kerberos.principal</name>
  <value>$USER/_HOST@HADOOP.LOCALDOMAIN</value>
</property>

Substitute the appropriate credential and keytab for $USER and $KEYTAB respectively.

分别以$USER和$ keytab替代适当的凭据和keytab。

The REST gateway will authenticate with HBase using the supplied credential.

REST网关将使用提供的凭据对HBase进行身份验证。

In order to use the REST API principal to interact with HBase, it is also necessary to add the hbase.rest.kerberos.principal to the acl table. For example, to give the REST API principal, rest_server, administrative access, a command such as this one will suffice:

为了使用REST API主体与HBase交互,还需要添加HBase .rest.kerberos。用于acl表的主体。例如,要给REST API principal、rest_server、管理访问,这样的命令就足够了:

grant 'rest_server', 'RWCA'

For more information about ACLs, please see the Access Control Labels (ACLs) section

有关acl的更多信息,请参见访问控制标签(ACLs)部分。

HBase REST gateway supports SPNEGO HTTP authentication for client access to the gateway. To enable REST gateway Kerberos authentication for client access, add the following to the hbase-site.xml file for every REST gateway.

HBase REST网关支持为客户端访问网关的SPNEGO HTTP身份验证。要为客户端访问启用REST网关Kerberos身份验证,请将以下内容添加到hbase站点。每个REST网关的xml文件。

<property>
  <name>hbase.rest.support.proxyuser</name>
  <value>true</value>
</property>
<property>
  <name>hbase.rest.authentication.type</name>
  <value>kerberos</value>
</property>
<property>
  <name>hbase.rest.authentication.kerberos.principal</name>
  <value>HTTP/_HOST@HADOOP.LOCALDOMAIN</value>
</property>
<property>
  <name>hbase.rest.authentication.kerberos.keytab</name>
  <value>$KEYTAB</value>
</property>
<!-- Add these if you need to configure a different DNS interface from the default -->
<property>
  <name>hbase.rest.dns.interface</name>
  <value>default</value>
</property>
<property>
  <name>hbase.rest.dns.nameserver</name>
  <value>default</value>
</property>

Substitute the keytab for HTTP for $KEYTAB.

将HTTP的keytab替换为$ keytab。

HBase REST gateway supports different 'hbase.rest.authentication.type': simple, kerberos. You can also implement a custom authentication by implementing Hadoop AuthenticationHandler, then specify the full class name as 'hbase.rest.authentication.type' value. For more information, refer to SPNEGO HTTP authentication.

HBase REST网关支持不同的“HBase .rest.authentication”。类型:简单,kerberos。您还可以通过实现Hadoop AuthenticationHandler来实现自定义身份验证,然后将完整的类名指定为“hbase.rest.authentication”。类型的值。有关更多信息,请参阅SPNEGO HTTP身份验证。

60.8. REST Gateway Impersonation Configuration

60.8。其他网关模拟配置

By default, the REST gateway doesn’t support impersonation. It accesses the HBase on behalf of clients as the user configured as in the previous section. To the HBase server, all requests are from the REST gateway user. The actual users are unknown. You can turn on the impersonation support. With impersonation, the REST gateway user is a proxy user. The HBase server knows the actual/real user of each request. So it can apply proper authorizations.

默认情况下,REST网关不支持模拟。它代表客户端访问HBase,就像在前一节中配置的用户一样。对于HBase服务器,所有请求都来自REST网关用户。实际的用户是未知的。您可以打开模拟支持。通过模拟,REST gateway用户是一个代理用户。HBase服务器知道每个请求的实际/实际用户。因此,它可以应用适当的授权。

To turn on REST gateway impersonation, we need to configure HBase servers (masters and region servers) to allow proxy users; configure REST gateway to enable impersonation.

要打开REST网关模拟,我们需要配置HBase服务器(主机和区域服务器)来允许代理用户;配置REST网关以启用模拟。

To allow proxy users, add the following to the hbase-site.xml file for every HBase server:

要允许代理用户,请将以下内容添加到hbase站点。每个HBase服务器的xml文件:

<property>
  <name>hadoop.security.authorization</name>
  <value>true</value>
</property>
<property>
  <name>hadoop.proxyuser.$USER.groups</name>
  <value>$GROUPS</value>
</property>
<property>
  <name>hadoop.proxyuser.$USER.hosts</name>
  <value>$GROUPS</value>
</property>

Substitute the REST gateway proxy user for $USER, and the allowed group list for $GROUPS.

将REST网关代理用户替换为$ user,并将允许的组列表替换为$GROUPS。

To enable REST gateway impersonation, add the following to the hbase-site.xml file for every REST gateway.

要启用REST网关模拟,请将以下内容添加到hbase站点。每个REST网关的xml文件。

<property>
  <name>hbase.rest.authentication.type</name>
  <value>kerberos</value>
</property>
<property>
  <name>hbase.rest.authentication.kerberos.principal</name>
  <value>HTTP/_HOST@HADOOP.LOCALDOMAIN</value>
</property>
<property>
  <name>hbase.rest.authentication.kerberos.keytab</name>
  <value>$KEYTAB</value>
</property>

Substitute the keytab for HTTP for $KEYTAB.

将HTTP的keytab替换为$ keytab。

61. Simple User Access to Apache HBase

61年。简单的用户访问Apache HBase。

Newer releases of Apache HBase (>= 0.92) support optional SASL authentication of clients. See also Matteo Bertozzi’s article on Understanding User Authentication and Authorization in Apache HBase.

Apache HBase的更新版本(>= 0.92)支持客户端可选的SASL认证。请参阅Matteo Bertozzi的文章,了解Apache HBase中的用户身份验证和授权。

This describes how to set up Apache HBase and clients for simple user access to HBase resources.

这描述了如何设置Apache HBase和客户端,以便简单的用户访问HBase资源。

61.1. Simple versus Secure Access

61.1。简单和安全的访问

The following section shows how to set up simple user access. Simple user access is not a secure method of operating HBase. This method is used to prevent users from making mistakes. It can be used to mimic the Access Control using on a development system without having to set up Kerberos.

下一节将介绍如何设置简单的用户访问。简单的用户访问不是操作HBase的安全方法。该方法用于防止用户出错。它可以用来模拟在开发系统上使用的访问控制,而不必设置Kerberos。

This method is not used to prevent malicious or hacking attempts. To make HBase secure against these types of attacks, you must configure HBase for secure operation. Refer to the section Secure Client Access to Apache HBase and complete all of the steps described there.

此方法不用于防止恶意或黑客攻击。要使HBase安全防范这些类型的攻击,您必须为安全操作配置HBase。请参见安全客户端访问Apache HBase的部分,并完成所描述的所有步骤。

61.2. Prerequisites

61.2。先决条件

None

没有一个

61.3. Server-side Configuration for Simple User Access Operation

61.3。用于简单用户访问操作的服务器端配置。

Add the following to the hbase-site.xml file on every server machine in the cluster:

将以下内容添加到hbase站点。集群中每个服务器机器上的xml文件:

<property>
  <name>hbase.security.authentication</name>
  <value>simple</value>
</property>
<property>
  <name>hbase.security.authorization</name>
  <value>true</value>
</property>
<property>
  <name>hbase.coprocessor.master.classes</name>
  <value>org.apache.hadoop.hbase.security.access.AccessController</value>
</property>
<property>
  <name>hbase.coprocessor.region.classes</name>
  <value>org.apache.hadoop.hbase.security.access.AccessController</value>
</property>
<property>
  <name>hbase.coprocessor.regionserver.classes</name>
  <value>org.apache.hadoop.hbase.security.access.AccessController</value>
</property>

For 0.94, add the following to the hbase-site.xml file on every server machine in the cluster:

在0.94中,将以下内容添加到hbase站点。集群中每个服务器机器上的xml文件:

<property>
  <name>hbase.rpc.engine</name>
  <value>org.apache.hadoop.hbase.ipc.SecureRpcEngine</value>
</property>
<property>
  <name>hbase.coprocessor.master.classes</name>
  <value>org.apache.hadoop.hbase.security.access.AccessController</value>
</property>
<property>
  <name>hbase.coprocessor.region.classes</name>
  <value>org.apache.hadoop.hbase.security.access.AccessController</value>
</property>

A full shutdown and restart of HBase service is required when deploying these configuration changes.

在部署这些配置更改时,需要完全关闭和重新启动HBase服务。

61.4. Client-side Configuration for Simple User Access Operation

61.4。简单用户访问操作的客户端配置。

Add the following to the hbase-site.xml file on every client:

将以下内容添加到hbase站点。每个客户端的xml文件:

<property>
  <name>hbase.security.authentication</name>
  <value>simple</value>
</property>

For 0.94, add the following to the hbase-site.xml file on every server machine in the cluster:

在0.94中,将以下内容添加到hbase站点。集群中每个服务器机器上的xml文件:

<property>
  <name>hbase.rpc.engine</name>
  <value>org.apache.hadoop.hbase.ipc.SecureRpcEngine</value>
</property>

Be advised that if the hbase.security.authentication in the client- and server-side site files do not match, the client will not be able to communicate with the cluster.

请注意,如果hbase.security。在客户端和服务器端站点文件中的身份验证不匹配,客户端将无法与集群通信。

61.4.1. Client-side Configuration for Simple User Access Operation - Thrift Gateway

61.4.1。简单用户访问操作-节俭网关的客户端配置。

The Thrift gateway user will need access. For example, to give the Thrift API user, thrift_server, administrative access, a command such as this one will suffice:

节俭网关用户将需要访问。例如,为节约API用户,thrift_server,管理访问,这样的命令就足够了:

grant 'thrift_server', 'RWCA'

For more information about ACLs, please see the Access Control Labels (ACLs) section

有关acl的更多信息,请参见访问控制标签(ACLs)部分。

The Thrift gateway will authenticate with HBase using the supplied credential. No authentication will be performed by the Thrift gateway itself. All client access via the Thrift gateway will use the Thrift gateway’s credential and have its privilege.

节约网关将使用提供的凭据对HBase进行身份验证。没有认证将由节约网关本身执行。通过节俭网关的所有客户端访问将使用节俭网关的凭据,并拥有它的特权。

61.4.2. Client-side Configuration for Simple User Access Operation - REST Gateway

61.4.2。简单用户访问操作- REST网关的客户端配置。

The REST gateway will authenticate with HBase using the supplied credential. No authentication will be performed by the REST gateway itself. All client access via the REST gateway will use the REST gateway’s credential and have its privilege.

REST网关将使用提供的凭据对HBase进行身份验证。其他网关本身不会执行任何身份验证。所有通过REST网关的客户端访问都将使用REST gateway的凭据,并具有它的特权。

The REST gateway user will need access. For example, to give the REST API user, rest_server, administrative access, a command such as this one will suffice:

其他网关用户将需要访问。例如,为了给REST API用户、rest_server、管理访问,这样的命令就足够了:

grant '