Beruflich Dokumente
Kultur Dokumente
How can I automate the process of verifying that my query executed successfully while
running hive -e "query"?
This is assuming you are using the terminal (shell or shellscript for the matter.)
In bash,
var=`hive -e "query" `
echo $?
$? gives you the return status of the last executed (our hive query) command .It is
non zero in case of syntactical errors etc. and zero on success. $? can be used in
your conditional checks to decide further course.
How can I load data on HDFS into Hive in a way that copies the files into the Hive
warehouse directory instead of moving?
From the question, it seems the data is already present in the hdfs, so instead of
loading the data you can create the "EXTERNAL" table specifying the
location.
create external table table_name (
id int,
name string
)
location '/my/location/hdfs';
What are the problems that engineers have run into using Hive
There are no row-level update statements in Hive QL. You basically have to
overwrite all the files associated with a partition/table every time you want to
do an update. While this doesn't seem so bad at first, imagine if you had to
perform this "update" overwrite on terabytes of data frequently.
There is hope though! An integration of Hive and HBase is in the works.
Although user variable substitution is supported, the query cannot overwrite user
variables. This means you cannot do running sums or top N per group. You
can only do an equality join. This means you cannot easily do ranking.
second. I want only that particular language rest of them i don't want to
display. How to write the query?
select COUNT(*)
FROM <table>
WHERE time = <seconds desired>
and language = <language>;
Create tables with partitions by date. And load data into corresponding partitions
every day. Archive data which is 180 days older.
What is the best studio software/tool to run HIVE SQL/HQL queries by a
data analyst?
Hive command line. Forces me to think through the query properly before I run it.
The best one I have used is not available to the public. It is Facebook's internal UI
for Hive called HiPal.
The thing that sets HiPal aside from all the other tools is that it takes into account
the fact that tools like Hive are batch oriented. Your query is going to run for
minutes or hours. The problem with tools like Hue and others is they don't "set your
query aside" while it runs allowing to you to write and run other queries at the same
time. Any tool querying Hive, Impala, etc should have the functionality to write &
run multiple queries concurrently for the same user in the same session.
Written 5 Nov.
Downvote
How does Hive store tables and is that format accessible to Pig?
I have been reading about Pig and in the paper it is said to work directly on raw
files. How does Hive
Meta data about (Hive) tables is stored in HCatalog, which itself stores the data in a
RDBMS like MySQL. Pig (and other tools) can access these table definitions too. The
data itself can be stored in HDFS or S3 for example.
Note that Hive does a best effort schema on read, i.e. when a query is execute the
data is read from the storage location and deserialised based on the meta data.
That gives you the flexibility to add or remove data directly on the file system
without Hive which is useful when you land data from external systems.
Hive also has two types of table definition - the 'normal' one which means that
dropping a table removes the meta data and the data on the filesystem and an
'external' table definition which means Hive only manages the meta data and
dropping the table leaves the data on the filesystem untouched.
What kind of tools are you using to design your database schema in Hive?
we have tools like ER/Studio ( By Embarcadero) to define relation ships between
tables, similarly do we have something for Bigdata Shops as well ?
.
ER/Studio works for designing Hive schema as well. Of course, we may not be able
to specify certain table specs like external vs internal tables, location, partitions(?)
etc.
Written 30 Dec.
Hive kicking off mapreduce jobs for tasks that really don't require mapreduce
job.
Hive supports creation of external tables only on level one. It won't read or
allow sub-directories.
The biggest of all is convincing people on how it is different from traditional rdbms
and why certain things can't be achieved in hive.
The single most annoying thing that ALWAYS kills me is that when extracting data
from Hive, empty string ('') and NULL are represented differently. NULL extracts as
'\N' which ends up getting loaded into my target systems. And I always seem to
forget this so I then have to go back, make a code change, and then reload the
whole thing.
We cannot delete/insert selective rows in hive...
For example, this querry:
Delete from table tablename where column1=='value'
'value'
Hive is great for batch processing on big data, not so good at interactive query.
Pig alone doesn't understand the partitioning scheme you've setup in Hive.
But let's assume that you really want to use Pig to add data to a table to be queried
by Hive. This is pretty straight forward process.
1. You would have Pig write to a directory structure with a specific naming
standard (e.g. /home/user/warehouse/x/date=20130105/).
2. Then you Create External Table x (Id int, name string) partitioned by (date
string) LOCATION='/home/user/warehouse/x';
3. Then you tell Hive to recover all partitions. See LanguageManual DDL for
specific syntax.
4. Then you query the hive table x (e.g. select Id, name from x;)
What is the best way to take the back up of hive partitioned table in to a disk?
i have a running table which is partitioned by date. i want to take its backup in form
of file . in case the hdfs blocks where table data is stored got corrupted or missing, i
can restore the same from backup file.
It would be best if you give an indication of how big your partition is.
HDFS is considered fairly reliable from a technical failure point of view, but to
archive data you may want to copy it offline, with copyToLocal, or copy the data to
another cluster or another location on the cluster. For this, you can use DistCp.
The first issue addressed was the Datanode startup time, primarily for
snapshotting during -upgrade processing. HDFS-1445 reduced this time from
about 10 minutes per volume to aprx 30 seconds per volume. Other potential
improvements are noted in HDFS-1443, but they are an order of magnitude
smaller.
Column Stores: What are the advantages of Trevni over RCFile format?
You may want to look at ORCFile that is being developed as a successor to RCFile in
the context of Hive. The Hive ORCFile Jira (Create a new Optimized Row Columnar
file format for Hive) has detailed notes on the benefits of the new format (and some
comparisons with Trevni as well) - and I think it those would roughly answer this
question as well. Once it stabilizes - most likely ORCFile will become the de-facto
replacement for RCFile - so the comparison between these two might be a more
useful one.
The biggest advantage is developer productivity, though this can come at the
expense of execution speed (mostly latency) and efficiency (high throughput via
brute force).
First I'll point out that HiveQL is SQL, or at least a variant of SQL. And since no
database vendor follows the SQL standard perfectly, they're all variants as far as I'm
concerned. The Tableau connector for Hive supports all of the same important
functionality as any of the other connectors we offer for SQL databases.
Hive does have benefits over other SQL systems implemented in databases. Hive
has several interesting UDF packages and makes it easy to contribute new UDFs.
You can explicitly control the map and reduce transform stages, and even route
them through simple scripts written quickly in languages like Python or Perl. You can
also work with many different serialization formats, making it easy to ingest nearly
any kind of data. Finally, you can connect your Hive queries to several Hadoop
packages for statistical processing, including Apache Mahout, RHipe, RHive or
RHadoop. For these reasons Hive can improve developer productivity when working
with challenging data formats or complex analytical tasks.
When it comes to system performance, Hive has several downsides. First, the batch
nature of Map / Reduce makes Hive perform poorly when you need low-latency
execution for simple queries. I don't have concrete numbers from a real
measurements study, but my sense is that you need data which is at least 10s of
millions of records or data which occupies at least 10s of GB on a modest-sized
cluster before Hive will start outperforming classic databases like PostgreSQL. The
other downside of Hive is that the query compiler and optimizer are still very young.
Not only does the compiler frequently choose mediocre execution plans when joins
or filtering are involved, but there are a tremendous number of tuning knobs,
indexing strategies, and physical layout strategies with bucketing and partitioning,
which all have unclear interactions with each other and which make it very hard to
design your table structures for best performance with a general class of queries.
While the community is working vigorously to address these issues, they are still
missing a lot of functionality commonly present in database systems for supporting
high-performance analytical queries. Other systems like Apache Drill may side-step
some of these issues altogether. It's not clear how much you will gain from Hive
over your existing SQL system in terms of raw performance, but your productivity
may improve. In the meantime keep an eye on the rapid progress the Hadoop
community is making in improving the wholApache Hive: Is there a way to query
only certain nodes within a Hadoop/Hive cluster?
Say I have 3 nodes: 192.168.1.100, 192.168.1.101 and 192.168.1.102.
Is there a way to pull data back from only 2 of the nodes instead of all 3?
I am not sure why you want to do this manually. But to get the same effect, I
suggest you properly partition your data. When the data is ingested HDFS will
take care of distributing it appropriately across the different nodes. When you need
to query data that is properly partitioned then Hive will take care of ensuring that
only the data nodes with the appropriate data will be queried.
How big is your Hive Meta Store? How many Tables do you have in your
Meta Store?
1 upvote by Anonymous.
I dont think there is any limitation at moment. I have seen instances of more than
300 tables but not all of them has huge data though. 1 or 2 has more than 500TB .
I have managed a metastore with thousands of tables and some tables have
hundreds of thousands of partitions a table. At that scale the metastore (mysql
backed) is fairly large ~20GB and you may start needed a decent database machine
to ensure smooth operation.
I have the following hive query: select count(distinct id) as total from mytable;
which automatically spawns: 1408 Mappers 1 Reducer
up vote 11
down vote
favorite
3
I need to manually set the number of reducers and I have tried the following: set
mapred.reduce.tasks=50 set hive.exec.reducers.max=50
but none of these settings seem to be honored. The query takes forever to run. Is
there a way to manually set the reducers or maybe rewrite the query so it can result
in more reducers? Thanks!
3 Answers
active oldest votes
up vote 18 writing query in hive like this:
down vote
SELECT COUNT(DISTINCT id) ....
accepted
set mapred.reduce.tasks=50
2. rewrite query as following:
SELECT COUNT(*) FROM ( SELECT DISTINCT id FROM ... ) t;
This will result in 2 map+reduce jobs instead of one, but performance gain will be
substantial.
answered Jan 7 '12 at
14:58
share|improve this
answer
Wojtek
2,86132857
Thanks a lot. This worked! magicalo Jan 9 '12 at 15:22
cool. how comes the hive compiler doesn't do this optimization
(turning into 2 MR jobs) by itself automatically? ihadanny Apr 26
'13 at 21:22
There are situations where turning this into 2 MR jobs isn't an
optimization. For instance, if id is already close to unique and the
table is stored in a columnar file format (like RCFILE), than 1 MR job
would certainly be better. Since situations like that aren't outlandish,
I imagine that's why no one has built this optimization into Hive.
Daniel Koverman May 16 '13 at 19:59
add a comment
You could set the number of reducers spawned per node in the conf/mapred-site.xml config
file. See here: http://hadoop.apache.org/common/docs/r0.20.0/cluster_setup.html.
In particular, you need to set this property:
mapred.tasktracker.reduce.tasks.maximum
that is applicable for all jobs. If you want to set for a specific query, i think it is
better use set mapred.reduce.tasks brain storm Aug 6 '14 at 17:32
add a comment
greater than one. Looking at job settings, something has set mapred.reduce.tasks, I
presume Hive. How does it choose that number?
Note: here are some messages while running a Hive job that should be a clue:
...
Number of reduce tasks not specified. Estimated from input data size:
500
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
...
hadoop hive
asked Apr 24 '13 at
22:27
share|improve this
question
dfrankow
3,374135686
Good question. Specifically, when does hive choose to do Number of
reduce tasks determined at compile time and when does it choose to
do estimated from input data size? ihadanny Apr 25 '13 at 14:39
added that in the answer below Joydeep Sen Sarma Apr 26 '13 at
1:14
add a comment
1 Answer
active oldest votes
The default of 1 maybe for a vanilla Hadoop install. Hive overrides it.
In open source hive (and EMR likely)
# reducers = (# bytes of input to mappers)
/ (hive.exec.reducers.bytes.per.reducer)
If you know exactly the number of reducers you want, you can set
mapred.reduce.tasks, and this will override all heuristics. (By default this is set to
-1, indicating Hive should use its heuristics.)
In some cases - say 'select count(1) from T' - Hive will set the number of reducers to
1 , irrespective of the size of input data. These are called 'full aggregates' - and if the
only thing that the query does is full aggregates - then the compiler knows that the
data from the mappers is going to be reduced to trivial amount and there's no point
running multiple reducers.
Query:
**$ bin/hive -e "insert overwrite table pokes select a.* from invites
a where a.ds='2008-08-15';"**
Question:
share|improve this
answer
alexlod
635515
add a comment
One more option is to try WebHCat API from browser or command line, using
utilities like Curl. Here's WebHCat API to delete a hive job
up vote 0
Also note that the link says that
down
vote
The job is not immediately deleted, therefore the information returned may not reflect
deletion, as in our example. Use GET jobs/:jobid to monitor the job and confirm that
it is eventually deleted.
select
emp.deptno, emp.ename, emp.empno, emp.job, emp.mgr,
emp.mgr, emp.hiredate, emp.sal, emp.comm, dept.dname,
dept.loc
from emp
join dept on from emp.deptno = dept.deptno;
share|improve this
question
zero323 user2702383
6,29181729 21
add a comment
2 Answers
active oldest votes
And you problem could be related to data skewness (It means some keys are
very dense).
answered Sep 11 '13 at
10:16
up vote 0
down vote
share|improve this
answer
Ashish
3,42411018
add a comment
up vote A skewed join will send a disproportionately large number of values to only one
0 down reducer and you'll get the long tail of a 99% job complete syndrome, so may be
vote
experiencing this. Looking at the job logs (IO specially) will reveal if this is the culprit.
In such cases you can use Skewed Join Optimization, which in turn relies on List
Bucketing. You will have to determine which key values (depno) are heavily skewed
and declare it accordingly in the DDL:
alter table emp (schema) skewed by
(depno) on ('<skewedvalue>');
Read the linked article for details, go over the comments and changes of HIVE-3086.