Steven's Blog

A dream land of peace!

Resources on Shell Scripts and Linux Tools

一些可能有用的关于linux和shell的链接

Steven Sun’s Blog

You may find the following pages in Steven Sun’s Blog might be useful:

  1. Some Useful Shell Scripts
  2. Basics of Linux Screen
  3. Update All Python Modules Through Pip
  4. Sort Lines by Length
  5. Wonderful Post on Stdin
  6. Wget Several Files Using Bash Cmd
  7. Useful Awk Commands
  8. Fun Shell Tricks
  9. Sed One-liners
  10. More Color Themes for Pycharm and Webstorm
  11. Grep Lines in One File and Not in the Other
  12. More Perl One-liners
  13. Use Python Request to Download Movies
  14. Find the Common Lines in Multiple Files
  15. sort|uniq vs uniq|sort
  16. Trap Interrupts in Shell Scripts
  17. Installing From Multiple Requirements Files
  18. Useful Bash Scripts
  19. rename a Tmux Session
  20. Notes to Be Take While Using Ssh-keygen
  21. Some Useful Linux Commands
  22. Nohup简介

Grep简介:

Awk简介:

Sed简介:

Find简介:

Regex简介:

Regex test tool:

Hive Insert Using Select *

I want to insert into a table with values from another table. and I want to use the follwing statements,

1
insert into table table_a select * from table_b;

Both table_a and table_b have the followig 3 fields, “name”, “age”, and the “load_day” field which is used to partition.

If using the above syntax, we will get errors saying table has just 2 fields and table_b has 3.

So one right way to insert the data is:

1
insert into table table_a(partition="2016-08-01") select name, age from table_b where load_day = "2016-08-01"

What if there are many fields, thus we will need to list them one by one after select which is very tedious.

Here is the solution,

1
2
set hive.exec.dynamic.partition.mode=nonstrict;
insert into table table_a partition(load_day) select * from table_b;

Copy and Paste Spaces Separated Text Into Excel

最近经常需要把空格作为delimiter的文本粘贴到excel里。之前的做法是

1
cat aa.log | awk '{print $1}' | pbcopy

来一列一列地copy到excel里。甚是繁琐。 excel不会用的啊。

今天发现了原来还可以这么来操作。

  1. 把整个文件的内容粘贴到excel里去
  2. 在excel中选中文件内容, 然后点击 Data->Text to Columns, 然后选择合适的delimiter去把文本内容给搞到excel中去。

Bingo!

Notes on Reading the Paper by Matei Zaharia and So On

This post keeps note for the paper “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing” by Matei Zaharia and so on.

1.

1
Spark exposes RDDs through a language-integrated API where each dataset is represented as an object and transformations are invoked using methods on these objects.

2.

1
Programmers start by defining one or more RDDs through transformations on data in stable storage (e.g., map and filter). They can then use these RDDs in actions, which are operations that return a value to the application or export data to a storage system. Examples of actions include count (which returns the number of elements in the dataset), collect (which returns the elements themselves), and save (which outputs the dataset to a storage system)

3.

1
Spark computes RDDs lazily the first time they are used in an action, so that it can pipeline transformations

4.

1
In addition, programmers can call a persist method to indicate which RDDs they want to reuse in future operations. Spark keeps persistent RDDs in memory by default, but it can spill them to disk if there is not enough RAM.

5.

1
RDDs themselves are statically typed objects parametrized by an element type. For example, RDD[Int] is an RDD of integers.

6.

1
Transformations are lazy operations that define a new RDD, while actions launch a computation to return a value to the program or write data to external storage.

7.

1
map is a one-to-one mapping, while flatMap maps each input value to one or more outputs (similar to the map in MapReduce).

8.

1
Operations such as groupByKey, reduceByKey and sort automatically result in a hash or range partitioned RDD

9.

1
We found it both sufficient and useful to classify dependencies into two types: narrow dependencies, where each partition of the parent RDD is used by at most one partition of the child RDD, wide dependencies, where multiple child partitions may depend on it. For example, map leads to a narrow dependency, while join leads to to wide dependencies (unless the parents are hash-partitioned).

Weird Things Happen When Scripts Executing With . And Passing Args With Getopts

I have met the following scripts, say aa.sh

1
2
3
4
5
6
7
8
9
10
11
12
echo "OPTIND for aa.sh before calling is: " $OPTIND
while getopts "t:c:" opt
do
    case "$opt" in
        t) time="${OPTARG}";;
        c) count="${OPTARG}";;
    esac
done

echo "time for aa.sh is $time"
echo "count for aa.sh is $count"
echo "OPTIND for aa.sh after calling is: " $OPTIND`sh

and bb.sh

1
2
3
4
5
6
7
8
9
10
11
12
13
14
echo "OPTIND for bb.sh before calling is: " $OPTIND
echo "we are setting the OPTIND to 1"
while getopts "t:c:" opt
do
    case "$opt" in
        t) time="${OPTARG}";;
        c) count="${OPTARG}";;
    esac
done

echo "time for bb.sh is $time"
echo "count for bb.sh is $count"

echo "OPTIND for bb.sh after calling is: " $OPTIND

and call.sh

1
2
. ./aa.sh -t 2016-07-15 -c 4
. ./bb.sh -a 2016-07-20 -c 0

and now execute the script, it will give out the followig result,

1
2
3
4
5
6
7
8
OPTIND for aa.sh before calling is:  1
time for aa.sh is 2016-07-15
count for aa.sh is 4
OPTIND for aa.sh after calling is:  5
OPTIND for bb.sh before calling is:  5
=time for bb.sh is 2016-07-15
count for bb.sh is 4
OPTIND for bb.sh after calling is:  5

Which is definitely not what we want.

We can now change bb.sh into the following to test it again.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
echo "OPTIND for bb.sh before calling is: " $OPTIND

OPTIND=1

echo "we are setting the OPTIND to 1"
while getopts "t:c:" opt
do
    case "$opt" in
        t) time="${OPTARG}";;
        c) count="${OPTARG}";;
    esac
done

echo "time for bb.sh is $time"
echo "count for bb.sh is $count"

echo "OPTIND for bb.sh after calling is: " $OPTIND

and now “sh call.sh” will give out the following result,

1
2
3
4
5
6
7
8
9
OPTIND for aa.sh before calling is:  1
time for aa.sh is 2016-07-15
count for aa.sh is 4
OPTIND for aa.sh after calling is:  5
OPTIND for bb.sh before calling is:  5
we are setting the OPTIND to 1
time for bb.sh is 2016-07-20
count for bb.sh is 0
OPTIND for bb.sh after calling is:  5

What if we set the OPTIND to 3 now?

1
2
3
4
5
6
7
8
9
OPTIND for aa.sh before calling is:  1
time for aa.sh is 2016-07-15
count for aa.sh is 4
OPTIND for aa.sh after calling is:  5
OPTIND for bb.sh before calling is:  5
we are setting the OPTIND to 3
time for bb.sh is 2016-07-15
count for bb.sh is 0
OPTIND for bb.sh after calling is:  5

Now the count value is changed to 0, however time is still not changed.

How to understand this? First let us look at the definitions of $OPTIND and getopts

1
2
A getopts construct usually comes packaged in a while loop, which processes the options and
arguments one at a time, then increments the implicit $OPTIND variable to point to the next.

In a while loop contaning getopts, getopts will use $OPTIND to find the arguments. if we call the script using “dot” which is the same as using “source”, $OPTIND will be global and available to the next script using getopts, and the next script will then can not find the right argument using $OPTIND now.

The solution here is, we can reset $OPTIND to 1 or we can call the script using “sh” rather than “.”

Some Useful Shell Scripts

Here is a collection of those *sh commands I used that I think might be useful for later usage.

1.compare two directories and see if the differences between them.

1
diff -urNa test test1

2.find all files with “old” in their filename, and then romove them

1
find ./ -name "*old*" -exec rm -i {} \;

3.grep

1
ll | grep "Jul 20" | awk '{print $NF}' | xargs -I , grep -i "Successfully" , | wc -l

4.substitute string in multiple files and backup the orignal files with “~”

1
find . | xargs grep "good" -sl | uniq | xargs perl -pi~ -e 's/good/bad/g'

5.substitute string in multiple files and backup the orignal files with extension “bak”

1
find . | xargs grep "bad" -sl | uniq | xargs perl -i.bak -p -e 's/bad/excellent/g'

After checking that everything is ok, we can then remove those “bak” files

1
find ./ -name "*bak*" -exec rm -i {} \;

6.grep lines before and aftern certain matching line

1
grep -C 4 '. $kpi_shell $load_date' orig_cron_kpi.sh  | grep "\.sh" | less

7.grep lines in one file while not in the other.

1
grep -F -x -v -f bb.txt aa.txt | grep "\.sh" | less

怎么去除windows编辑过的文件中的^M字符

We can use the following two methods to get rid of the “^M” character in a file which was edited on windows using vim.

1.

1
:%s/\r//g

2.

1
:%s/ctrl-v ctrl-m//g

The second one means pressing ctrl-v, then ctrl-m first.

Regexp_extract Usage in Impala

I need to use the regexp_extract function to extract certain parts of a string recently when I am doing big data analysis.

http://www.cloudera.com/documentation/archive/impala/2-x/2-1-x/topics/impala_string_functions.html, this link shows us how to do, but sadly regex in impala is a little different from those in perl or python, so I have to look through the page and try again.

I keep a note here for regexp_extract for my later own usage.

Say I have a string

1
%2C%22hidisp%22%3A27%2C%22quietMode

and I want to extract the number followd by “hidisp” and ‘”:’, which is number ‘27’ here.

As describe in the above document,

1
Because the impala-shell interpreter uses the \ character for escaping, use \\ to represent the regular expression escape character in any regular expressions that you submit through impala-shell

So if we want to represent the numbers here, we have use ‘\d’ rather than just ‘\d’ which is a standard in other programming languages.

regexp_extract usage is in the following format,

1
regexp_extract(string subject, string pattern, int index)
  1. group 0 matches the full pattern string, including the portion outside any () group, so
1
select regexp_extract('%2C%22hidisp%22%3A27%2C%22quietMode', 'hidisp%22%3A(\\d+)', 0);

this will give out:

1
hidisp%22%3A27
  1. group 1 matches just the contents inside the first () group in the pattern string:
1
select regexp_extract('%2C%22hidisp%22%3A27%2C%22quietMode', 'hidisp%22%3A(\\d+)', 1);

will give out

1
27

And for the support of non-greedy matches using .*?, take the following string as an example,

1
%2C%22reboot%22%3A27%2C%22quietMode%2C%22reboot%22%3A12%2C%22quietMode

How to extract the first match number for “reboot”, we can get the result in the following ways

a. without any “.*?” around the string pattern ‘reboot%22%3A(\d+)’, which is the easiest way I think

1
select regexp_extract('%2C%22reboot%22%3A27%2C%22quietMode%2C%22reboot%22%3A12%2C%22quietMode', 'reboot%22%3A(\\d+)', 1);

or

b. append “.*?” right after the string pattern ‘reboot%22%3A(\d+)’,

1
select regexp_extract('%2C%22reboot%22%3A27%2C%22quietMode%2C%22reboot%22%3A12%2C%22quietMode', 'reboot%22%3A(\\d+).*?', 1);

or

c. surround the string pattern ‘reboot%22%3A(\d+)’ with ‘.*?’ at both sides

1
select regexp_extract('%2C%22reboot%22%3A27%2C%22quietMode%2C%22reboot%22%3A12%2C%22quietMode', '.*?reboot%22%3A(\\d+).*?', 1);

So after the tree ways to get the leftmost match, we can easily guess how to get the rifht-most match.

1
select regexp_extract('%2C%22reboot%22%3A27%2C%22quietMode%2C%22reboot%22%3A12%2C%22quietMode', '.*?reboot%22%3A(\\d+)', 1);

That is just by appending the string pattern with ‘.*?’ at its left side.

Tmux: Error While Loading Shared Libraries: libevent-2.0.so.5

I have kept a blog on tmux installation which is about missing module libevent, and the post is on:

1
http://isunix.github.io/blog/2014/12/24/libevent-not-found-error-while-install-tmux/

Recently even after I install the livevent module, I still get the following errors after installing tmux and then running tmux:

1
 ./tmux: error while loading shared libraries: libevent-2.0.so.5: cannot open shared object file: No such file or

I installed the libevent lib in $HOME/local/lib, and tmux says it can not find the library, Weird. With the help of one of my colleague, this problem is solved.

1
2
3
4
5
export DIR="$HOME/local"
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$DIR/lib
./configure --prefix=$DIR CFLAGS="-I$DIR/include" LDFLAGS="-L$DIR/lib"
make
make install

As we can see, the point here is setting the “LD_LIBRARY_PATH” variable.

Configure Hive on Mac

花了很长的时间在hive的安装和配置上, 先把有用的链接记下来, 等有空了再详细的写下安装过程。

1
2
3
4
5
6
http://stackoverflow.com/questions/5016505/mysql-grant-all-privileges-on-database
https://cwiki.apache.org/confluence/display/Hive/AdminManual+Configuration
http://stackoverflow.com/questions/27099898/java-net-urisyntaxexception-when-starting-hive
https://noobergeek.wordpress.com/2013/11/09/simplest-way-to-install-and-configure-hive-for-mac-osx-lion/
https://cwiki.apache.org/confluence/display/Hive/GettingStarted
https://amodernstory.com/2015/03/29/installing-hive-on-mac/