Tuesday, July 19, 2011

Find and remove duplicates in file | Sort data

Commands to be used
cut
sort
uniq

Step 1:
Cut command is used to select the desired data from the file. Let's say data in my file: students.txt is as follows. We need to find duplicates in second field of the data in this file.

#vi students.txt
101 101 Mike
102 102 Ryan
103 103 Dev
104 102 Steve
105 100 Bill

I can use CUT command to select second field by executing following command
#cut -d ' ' -f2 students.txt > secondField.txt

# vi secondField.txt
101
102
103
102
100

-d flag: Delimiter; here we are using space as delimiter
-f flag: Field Number

Step 2:
Now I have my desired second field. I can issue SORT command to sort the data.

#sort -n secondField.txt > sortedData.txt

#vi sortedData.txt
100
101
102
102
103

-n flag: Sort numerically

Step 3:
Finally, we can use UNIQ command to find the duplicate or unique data.

Display only unique data
#uniq -uc sortedData.txt
1 100
1 101
1 103

Display only duplicated data
#uniq -dc sortedData.txt
2 102

Display all data without repeatition
#uniq sortedData.txt
100
101
102
103


-u flag: unique data
-d flag: duplicate data
-c flag: show the count

Applications:


1. To find and remove duplicate data in voicemail.conf of asterisk

2. To sort sip peers information
Collect sip peers information
#asterisk -rx "sip show peers" >> sippeers

Collect the extension/username (Column 1) of peers
#less sippeers|cut -d' ' -f1|cut -d'/' -f1> file1

Collect the IP address(Column 2) information. We have to use awk because AWK treats multiple delimiter as one delimiter. In sippeers file we have multiple spaces separating column 1 and column 2.
#awk -F" " '{print $2}' sippeers>file2

Count the number of lines in file1 and file2. Make sure that both has same number of lines
#wc -l file*

Put the collected Column1(Username/Extension) and Column2 (IP Addresses) in one file i.e file3
#paste file1 file2 > file3

Sort the data in file3. By default sort command takes the list and sort numerically according to the first column.
#sort file3

-k switch can be used to sort by specific column.
e.g #sort -k 2 file 3 --> This will sort according to second column

Final Script:

#!/bin/bash
#Author erdevendra@gmail.com Script to sort the SIP peers registered to the server and write to file sippeers
#This script is used to find which extension is in use and which is not
#Extension in use will have IP Address attached to it

#pull sip users from asterisk
/usr/sbin/asterisk -rx "sip show peers" > sippeers
#Filter users extensions
/usr/bin/less sippeers|cut -d' ' -f1|cut -d'/' -f1>file1
#Filter IP addresses
/usr/bin/awk -F" " '{print $2}' sippeers >file2
#Put users extensions and IP addresses together in a file
/usr/bin/paste file1 file2>file3
/bin/rm sippeers
#sort the file by extensions
/usr/bin/sort file3>sippeers
#remove temporary files
/bin/rm file1 file2 file3


3. To find the total number of IP addresses leased by DHCP server. [ Note: DHCP seems to keep same IP address multiple times in dhcp.leases database. ]

#less /var/lib/dhcp/db/dhcpd.leases|grep 10.219|awk -F" " '{print $2}'|awk -F"." '{print $3 $4}'|sort -n|uniq|wc -l

In Linux, DHCP server stores dhcp leases at /var/lib/dhcp/db/dhcpd.leases
In my example, I am filtering IP addresses for 10.219.1.1 network using 'grep 10.219'
awk -F" " '{print $2}' --> This filters out IP addresses only
awk -F"." '{print $3 $4}' --> This filters out 3rd and 4th octet of IP address e.g 10.219.2.230 will return 2230 (i.e 2.230)
sort -n --> This will sort the data in ascending order
uniq --> This will remove the repetition of data
wc -l ---> This will return the total number of lines, which in turn is the total number of uniq IP addresses already being assigned by DHCP server


For more: http://www.liamdelahunty.com/tips/linux_remove_duplicate_lines_with_uniq.php

Very good explanation with examples:
http://www.techrepublic.com/article/lesser-known-linux-commands-join-paste-and-sort/5031653