W3Cschool
恭喜您成為首批注冊(cè)用戶
獲得88經(jīng)驗(yàn)值獎(jiǎng)勵(lì)
使用Spark將數(shù)據(jù)批量加載到HBase有兩種選擇。有一些基本的批量加載功能適用于行具有數(shù)百萬(wàn)列的情況和未整合列的情況,以及Spark批量加載過(guò)程的映射側(cè)之前的分區(qū)。
Spark還有一個(gè)精簡(jiǎn)記錄批量加載選項(xiàng),第二個(gè)選項(xiàng)是為每行少于10k列的表設(shè)計(jì)的。第二個(gè)選項(xiàng)的優(yōu)點(diǎn)是Spark shuffle操作的吞吐量更高,而且負(fù)載更少。
這兩種實(shí)現(xiàn)的工作方式或多或少類似于MapReduce批量加載過(guò)程,因?yàn)榉謪^(qū)器根據(jù)區(qū)域拆分對(duì)rowkeys進(jìn)行分區(qū),并且行鍵按順序發(fā)送到reducer,以便HFiles可以直接從reduce階段寫出。
在Spark術(shù)語(yǔ)中,批量加載將圍繞Spark repartitionAndSortWithinPartitions實(shí)現(xiàn),然后是Spark foreachPartition。
首先讓我們看一下使用基本批量加載功能的示例
批量加載實(shí)例
以下示例顯示了Spark的批量加載:
val sc = new SparkContext("local", "test")
val config = new HBaseConfiguration()
val hbaseContext = new HBaseContext(sc, config)
val stagingFolder = ...
val rdd = sc.parallelize(Array(
(Bytes.toBytes("1"),
(Bytes.toBytes(columnFamily1), Bytes.toBytes("a"), Bytes.toBytes("foo1"))),
(Bytes.toBytes("3"),
(Bytes.toBytes(columnFamily1), Bytes.toBytes("b"), Bytes.toBytes("foo2.b"))), ...
rdd.hbaseBulkLoad(TableName.valueOf(tableName),
t => {
val rowKey = t._1
val family:Array[Byte] = t._2(0)._1
val qualifier = t._2(0)._2
val value = t._2(0)._3
val keyFamilyQualifier= new KeyFamilyQualifier(rowKey, family, qualifier)
Seq((keyFamilyQualifier, value)).iterator
},
stagingFolder.getPath)
val load = new LoadIncrementalHFiles(config)
load.doBulkLoad(new Path(stagingFolder.getPath),
conn.getAdmin, table, conn.getRegionLocator(TableName.valueOf(tableName)))
該hbaseBulkLoad函數(shù)需要三個(gè)必需參數(shù):
在Spark bulk load命令之后,使用HBase的LoadIncrementalHFiles對(duì)象將新創(chuàng)建的HFile加載到HBase中。
使用Spark進(jìn)行批量加載的附加參數(shù)
您可以在hbaseBulkLoad上使用其他參數(shù)選項(xiàng)設(shè)置以下屬性。
使用附加參數(shù):
val sc = new SparkContext("local", "test")
val config = new HBaseConfiguration()
val hbaseContext = new HBaseContext(sc, config)
val stagingFolder = ...
val rdd = sc.parallelize(Array(
(Bytes.toBytes("1"),
(Bytes.toBytes(columnFamily1), Bytes.toBytes("a"), Bytes.toBytes("foo1"))),
(Bytes.toBytes("3"),
(Bytes.toBytes(columnFamily1), Bytes.toBytes("b"), Bytes.toBytes("foo2.b"))), ...
val familyHBaseWriterOptions = new java.util.HashMap[Array[Byte], FamilyHFileWriteOptions]
val f1Options = new FamilyHFileWriteOptions("GZ", "ROW", 128, "PREFIX")
familyHBaseWriterOptions.put(Bytes.toBytes("columnFamily1"), f1Options)
rdd.hbaseBulkLoad(TableName.valueOf(tableName),
t => {
val rowKey = t._1
val family:Array[Byte] = t._2(0)._1
val qualifier = t._2(0)._2
val value = t._2(0)._3
val keyFamilyQualifier= new KeyFamilyQualifier(rowKey, family, qualifier)
Seq((keyFamilyQualifier, value)).iterator
},
stagingFolder.getPath,
familyHBaseWriterOptions,
compactionExclude = false,
HConstants.DEFAULT_MAX_FILE_SIZE)
val load = new LoadIncrementalHFiles(config)
load.doBulkLoad(new Path(stagingFolder.getPath),
conn.getAdmin, table, conn.getRegionLocator(TableName.valueOf(tableName)))
現(xiàn)在讓我們看看如何調(diào)用精簡(jiǎn)記錄大批量加載實(shí)現(xiàn)
使用精簡(jiǎn)記錄批量加載示例:
val sc = new SparkContext("local", "test")
val config = new HBaseConfiguration()
val hbaseContext = new HBaseContext(sc, config)
val stagingFolder = ...
val rdd = sc.parallelize(Array(
("1",
(Bytes.toBytes(columnFamily1), Bytes.toBytes("a"), Bytes.toBytes("foo1"))),
("3",
(Bytes.toBytes(columnFamily1), Bytes.toBytes("b"), Bytes.toBytes("foo2.b"))), ...
rdd.hbaseBulkLoadThinRows(hbaseContext,
TableName.valueOf(tableName),
t => {
val rowKey = t._1
val familyQualifiersValues = new FamiliesQualifiersValues
t._2.foreach(f => {
val family:Array[Byte] = f._1
val qualifier = f._2
val value:Array[Byte] = f._3
familyQualifiersValues +=(family, qualifier, value)
})
(new ByteArrayWrapper(Bytes.toBytes(rowKey)), familyQualifiersValues)
},
stagingFolder.getPath,
new java.util.HashMap[Array[Byte], FamilyHFileWriteOptions],
compactionExclude = false,
20)
val load = new LoadIncrementalHFiles(config)
load.doBulkLoad(new Path(stagingFolder.getPath),
conn.getAdmin, table, conn.getRegionLocator(TableName.valueOf(tableName)))
請(qǐng)注意, 在使用精簡(jiǎn)記錄批量加載時(shí), 函數(shù)會(huì)返回一個(gè)元組,其中第一個(gè)值為行鍵,第二個(gè)值為FamiliesQualifiersValues對(duì)象,它將包含所有列族此行的所有值。
Copyright©2021 w3cschool編程獅|閩ICP備15016281號(hào)-3|閩公網(wǎng)安備35020302033924號(hào)
違法和不良信息舉報(bào)電話:173-0602-2364|舉報(bào)郵箱:jubao@eeedong.com
掃描二維碼
下載編程獅App
編程獅公眾號(hào)
聯(lián)系方式:
更多建議: