Home  >  120 Linux  >  120.001 Linux - File tricks

Linux file tricks

sed tricks, grep tricks came in handy all the time when understanding/ingesting/cleaning data

I use sed, wc, grep, wget, cut to navigate very large text files. These are some of the commands I’ve used:

Print one line of a file (-n is silent flag, important otherwise it prints all lines up to the specified line): sed -n 12345p file_name.csv

Print multiple lines: sed -n 12345,12349p file_name.csv

Print one line of a file into a temp text file sed -n 12345p file_name.csv > temp.csv

Replace all instances of a string in a file (-i flag is for “in place”) sed -i 's/original/new/g' file_name.csv

Delete a specific line of a file sed -i 12345d file_name.csv

Get total number of lines in a file: wc -l file_name.csv

Get total count of a pattern in a file: grep -c 'pattern' file_name.csv

Print out lines matching one or multiple patterns: grep -e 'pattern1' -e 'pattern2' file_name.csv

Awesome wget command to download all files in a folder: wget --user=tgiggy --ask-password -r --no-parent https://mft.adt.com/bv/ * -r is recursive * –no-parent says not to download anything from the parent directory

Delete a column from csv When you need to remove a column (delete a column) directly from a csv file, use the cut command: https://unix.stackexchange.com/questions/34646/is-there-a-command-line-spell-to-drop-a-column-in-a-csv-file This came in really handy when working on LegalZoom raw text data. Raw text data always has problems, very hard to clean. cut -d'|' --complement -f 19 UserOrder_b.csv > UserOrder_c.csv

Note: manually manipulating text files before ingesting into the database is probably a bad idea

It would be better to write a script for Spark or some other data tool to bring data into the database. This way, if the data needs to be ingested again, we get an updated data set, or we were wrong about some mod applied, it can be reliably done again from scratch, simply by running the script.

Mac

Copy the contents of a text file into the clipboard pbcopy < file


Graph: