Steven's Blog

A dream land of peace!

Hive Insert Using Select *

I want to insert into a table with values from another table. and I want to use the follwing statements,

1
insert into table table_a select * from table_b;

Both table_a and table_b have the followig 3 fields, “name”, “age”, and the “load_day” field which is used to partition.

If using the above syntax, we will get errors saying table has just 2 fields and table_b has 3.

So one right way to insert the data is:

1
insert into table table_a(partition="2016-08-01") select name, age from table_b where load_day = "2016-08-01"

What if there are many fields, thus we will need to list them one by one after select which is very tedious.

Here is the solution,

1
2
set hive.exec.dynamic.partition.mode=nonstrict;
insert into table table_a partition(load_day) select * from table_b;

Copy and Paste Spaces Separated Text Into Excel

最近经常需要把空格作为delimiter的文本粘贴到excel里。之前的做法是

1
cat aa.log | awk '{print $1}' | pbcopy

来一列一列地copy到excel里。甚是繁琐。 excel不会用的啊。

今天发现了原来还可以这么来操作。

  1. 把整个文件的内容粘贴到excel里去
  2. 在excel中选中文件内容, 然后点击 Data->Text to Columns, 然后选择合适的delimiter去把文本内容给搞到excel中去。

Bingo!

Weird Things Happen When Scripts Executing With . And Passing Args With Getopts

I have met the following scripts, say aa.sh

1
2
3
4
5
6
7
8
9
10
11
12
echo "OPTIND for aa.sh before calling is: " $OPTIND
while getopts "t:c:" opt
do
    case "$opt" in
        t) time="${OPTARG}";;
        c) count="${OPTARG}";;
    esac
done

echo "time for aa.sh is $time"
echo "count for aa.sh is $count"
echo "OPTIND for aa.sh after calling is: " $OPTIND`sh

and bb.sh

1
2
3
4
5
6
7
8
9
10
11
12
13
14
echo "OPTIND for bb.sh before calling is: " $OPTIND
echo "we are setting the OPTIND to 1"
while getopts "t:c:" opt
do
    case "$opt" in
        t) time="${OPTARG}";;
        c) count="${OPTARG}";;
    esac
done

echo "time for bb.sh is $time"
echo "count for bb.sh is $count"

echo "OPTIND for bb.sh after calling is: " $OPTIND

and call.sh

1
2
. ./aa.sh -t 2016-07-15 -c 4
. ./bb.sh -a 2016-07-20 -c 0

and now execute the script, it will give out the followig result,

1
2
3
4
5
6
7
8
OPTIND for aa.sh before calling is:  1
time for aa.sh is 2016-07-15
count for aa.sh is 4
OPTIND for aa.sh after calling is:  5
OPTIND for bb.sh before calling is:  5
=time for bb.sh is 2016-07-15
count for bb.sh is 4
OPTIND for bb.sh after calling is:  5

Which is definitely not what we want.

We can now change bb.sh into the following to test it again.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
echo "OPTIND for bb.sh before calling is: " $OPTIND

OPTIND=1

echo "we are setting the OPTIND to 1"
while getopts "t:c:" opt
do
    case "$opt" in
        t) time="${OPTARG}";;
        c) count="${OPTARG}";;
    esac
done

echo "time for bb.sh is $time"
echo "count for bb.sh is $count"

echo "OPTIND for bb.sh after calling is: " $OPTIND

and now “sh call.sh” will give out the following result,

1
2
3
4
5
6
7
8
9
OPTIND for aa.sh before calling is:  1
time for aa.sh is 2016-07-15
count for aa.sh is 4
OPTIND for aa.sh after calling is:  5
OPTIND for bb.sh before calling is:  5
we are setting the OPTIND to 1
time for bb.sh is 2016-07-20
count for bb.sh is 0
OPTIND for bb.sh after calling is:  5

What if we set the OPTIND to 3 now?

1
2
3
4
5
6
7
8
9
OPTIND for aa.sh before calling is:  1
time for aa.sh is 2016-07-15
count for aa.sh is 4
OPTIND for aa.sh after calling is:  5
OPTIND for bb.sh before calling is:  5
we are setting the OPTIND to 3
time for bb.sh is 2016-07-15
count for bb.sh is 0
OPTIND for bb.sh after calling is:  5

Now the count value is changed to 0, however time is still not changed.

How to understand this? First let us look at the definitions of $OPTIND and getopts

1
2
A getopts construct usually comes packaged in a while loop, which processes the options and
arguments one at a time, then increments the implicit $OPTIND variable to point to the next.

In a while loop contaning getopts, getopts will use $OPTIND to find the arguments. if we call the script using “dot” which is the same as using “source”, $OPTIND will be global and available to the next script using getopts, and the next script will then can not find the right argument using $OPTIND now.

The solution here is, we can reset $OPTIND to 1 or we can call the script using “sh” rather than “.”

怎么去除windows编辑过的文件中的^M字符

We can use the following two methods to get rid of the “^M” character in a file which was edited on windows using vim.

1.

1
:%s/\r//g

2.

1
:%s/ctrl-v ctrl-m//g

The second one means pressing ctrl-v, then ctrl-m first.

Regexp_extract Usage in Impala

I need to use the regexp_extract function to extract certain parts of a string recently when I am doing big data analysis.

http://www.cloudera.com/documentation/archive/impala/2-x/2-1-x/topics/impala_string_functions.html, this link shows us how to do, but sadly regex in impala is a little different from those in perl or python, so I have to look through the page and try again.

I keep a note here for regexp_extract for my later own usage.

Say I have a string

1
%2C%22hidisp%22%3A27%2C%22quietMode

and I want to extract the number followd by “hidisp” and ‘”:’, which is number ‘27’ here.

As describe in the above document,

1
Because the impala-shell interpreter uses the \ character for escaping, use \\ to represent the regular expression escape character in any regular expressions that you submit through impala-shell

So if we want to represent the numbers here, we have use ‘\d’ rather than just ‘\d’ which is a standard in other programming languages.

regexp_extract usage is in the following format,

1
regexp_extract(string subject, string pattern, int index)
  1. group 0 matches the full pattern string, including the portion outside any () group, so
1
select regexp_extract('%2C%22hidisp%22%3A27%2C%22quietMode', 'hidisp%22%3A(\\d+)', 0);

this will give out:

1
hidisp%22%3A27
  1. group 1 matches just the contents inside the first () group in the pattern string:
1
select regexp_extract('%2C%22hidisp%22%3A27%2C%22quietMode', 'hidisp%22%3A(\\d+)', 1);

will give out

1
27

And for the support of non-greedy matches using .*?, take the following string as an example,

1
%2C%22reboot%22%3A27%2C%22quietMode%2C%22reboot%22%3A12%2C%22quietMode

How to extract the first match number for “reboot”, we can get the result in the following ways

a. without any “.*?” around the string pattern ‘reboot%22%3A(\d+)’, which is the easiest way I think

1
select regexp_extract('%2C%22reboot%22%3A27%2C%22quietMode%2C%22reboot%22%3A12%2C%22quietMode', 'reboot%22%3A(\\d+)', 1);

or

b. append “.*?” right after the string pattern ‘reboot%22%3A(\d+)’,

1
select regexp_extract('%2C%22reboot%22%3A27%2C%22quietMode%2C%22reboot%22%3A12%2C%22quietMode', 'reboot%22%3A(\\d+).*?', 1);

or

c. surround the string pattern ‘reboot%22%3A(\d+)’ with ‘.*?’ at both sides

1
select regexp_extract('%2C%22reboot%22%3A27%2C%22quietMode%2C%22reboot%22%3A12%2C%22quietMode', '.*?reboot%22%3A(\\d+).*?', 1);

So after the tree ways to get the leftmost match, we can easily guess how to get the rifht-most match.

1
select regexp_extract('%2C%22reboot%22%3A27%2C%22quietMode%2C%22reboot%22%3A12%2C%22quietMode', '.*?reboot%22%3A(\\d+)', 1);

That is just by appending the string pattern with ‘.*?’ at its left side.

Tmux: Error While Loading Shared Libraries: libevent-2.0.so.5

I have kept a blog on tmux installation which is about missing module libevent, and the post is on:

1
http://isunix.github.io/blog/2014/12/24/libevent-not-found-error-while-install-tmux/

Recently even after I install the livevent module, I still get the following errors after installing tmux and then running tmux:

1
 ./tmux: error while loading shared libraries: libevent-2.0.so.5: cannot open shared object file: No such file or

I installed the libevent lib in $HOME/local/lib, and tmux says it can not find the library, Weird. With the help of one of my colleague, this problem is solved.

1
2
3
4
5
export DIR="$HOME/local"
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$DIR/lib
./configure --prefix=$DIR CFLAGS="-I$DIR/include" LDFLAGS="-L$DIR/lib"
make
make install

As we can see, the point here is setting the “LD_LIBRARY_PATH” variable.

Configure Hive on Mac

花了很长的时间在hive的安装和配置上, 先把有用的链接记下来, 等有空了再详细的写下安装过程。

1
2
3
4
5
6
http://stackoverflow.com/questions/5016505/mysql-grant-all-privileges-on-database
https://cwiki.apache.org/confluence/display/Hive/AdminManual+Configuration
http://stackoverflow.com/questions/27099898/java-net-urisyntaxexception-when-starting-hive
https://noobergeek.wordpress.com/2013/11/09/simplest-way-to-install-and-configure-hive-for-mac-osx-lion/
https://cwiki.apache.org/confluence/display/Hive/GettingStarted
https://amodernstory.com/2015/03/29/installing-hive-on-mac/

Update All Python Modules Through Pip

I want to update all the installed python modules through pip. Here below is one recipe I found.

1
2
3
4
5
import pip
from subprocess import call

for dist in pip.get_installed_distributions():
    call("pip install --upgrade " + dist.project_name, shell=True)

Sort an Object by One Key Using Js

I have something like the following,

1
2
3
var sophos = {name: "aa", time:"2016/03/02 01:12"};
var mcafee = {name: "bb", time:"2016/03/03 01:12"};
var trend = {name: "cc", time:"2016/03/04 01:12"};

and I want to find out the one with the earliest time, in the above code, it is “2016/03/02 01:12”, and the corresponding item is “sophos”, here is the realization is js code.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
var sophos = {name: "aa", time:"2016/03/02 01:12"};
var mcafee = {name: "bb", time:"2016/03/03 01:12"};
var trend = {name: "cc", time:"2016/03/04 01:12"};

var obj = [sophos, mcafee, trend];

function compare(a, b){
    if (a.time < b.time){
        return -1;
    }
    else if (a.time > b.time){
        return 1;
    }
    else {
        return 0;
    }
}

var result = obj.sort(compare);
console.log(result);

Sort Lines by Length

I have a file containing quite many lines of strings and I want to sort them by line length. Here is how.

1
cat $file | awk '{ print length($0) " " $0; }' | sort -r -n | cut -d ' ' -f 2- | less