History

anebz 40e3256f4b missing cs semester data wrangling		2020-03-31 20:24:36 +02:00
..
README.md	missing cs semester data wrangling	2020-03-31 20:24:36 +02:00

README.md

The Missing Semester of your CS education

1. The shell

Create new file:

touch hello.txt

Write to new file / delete everything and write new:

echo 'my_string' > hello.txt

Append to file

echo 'my_string' >> hello.txt

To write two commands at the same time, use |. Also so that the output of the first command is the input to the second.

Write multiline string:

echo -e "Hello \nWorld \n" >> greetings.txt

Given a file with content:

#!/bin/sh
curl --head --silent https://missing.csail.mit.edu

Make it executable and execute it

chmod +x my_file
./my_file

Stackoverflow info:

Generally, using ./filename.sh specifies a file in the current directory and using filename.sh specifies a file in the current directory or any directory of PATH. The first usage removes any uncertainty as to which file is accessed. In this case, you are attempting to execute the script with bash or another interpreter (by virtue of assumed #!/bin/bash as first line in your script) just by entering the filename. This usage requires the directory is specified. Alternatively, you can try bash filename.sh which seems to work with unspecified directory.

To find in file:

grep my_string file.txt

Run script, find string in string and output to file

./semester | grep 'my_string' > last-modified.txt

Use grep -o to display only matched pattern, grep -P for regex. See stackoverflow

2. Shell tools and scripting

2.1. Shell scripting

Add text to each file in directory, capture many files at the same time:

cp /path/to/project/{foo,bar,baz}.sh /newpath
mv *{.py,.sh} folder

2.2. Shell tools

# Find all directories named src
find . -name src -type d
# Find all python files that have a folder named test in their path
find . -path '**/test/**/*.py' -type f
# Find all files modified in the last day
find . -mtime -1
# Find all zip files with size in range 500k to 10M
find . -size +500k -size -10M -name '*.tar.gz'

# Delete all files with .tmp extension
find . -name '*.tmp' -exec rm {} \;
# Find all PNG files and convert them to JPG
find . -name '*.png' -exec convert {} {.}.jpg \;

find finds files/filenames, to find content inside files use grep. Some of the flags include -C for getting context lines around it, or -v for inverting the result, getting those lines that don't match the pattern. -R for recursively iterating in directories.

Alternatives like rg expand the capabilities of grep -R by ignoring .git files, etc.

# Find all python files where I used the requests library
rg -t py 'import requests'
# Find all files (including hidden files) without a shebang line
rg -u --files-without-match "^#!"
# Find all matches of foo and print the following 5 lines
rg foo -A 5
# Print statistics of matches (# of matched lines and files )
rg --stats PATTERN

Finding shell commands

# find shell commands where you used apt
history | grep apt

Another trick is Ctrl+R, you can insert a subtring to find matching shell commands. By pressing Ctrl+R again, you can iterate through them. Tools for seeing history-based autosuggestions: zsh.

Navigating directories

You can use fasd to autojump to common directories that you visit often.

To list all files in a directory in

human readable
show hidden
print details
sort by date

, do ls -halt.

3. Editors

Vim has multiple operating modes.

Normal: for moving around a file and making edits
Insert: for inserting text
Replace: for replacing text
Visual (plain, line, or block) mode: for selecting blocks of text
Command-line: for running a command

The letter x in insert mode just inserts the wharacter 'x', but in normal mode, it deletes the caracter under the cursor. In visual mode, it deletes the selection.

In its default configuration, it shows the current mode in the bottom left. The initial/default mode is normal mode. Press <ESC> to switch from any mode, to normal mode. From normal to insert, with i. To replace mode, with R. With visual mode, with v.

To go to insertion mode, press i and insert text. Press <ESC> to go back to normal mode.

Vim can have many tabs and buffers, initially it has a single tab, containing a single window.

3.1. Command line

To enter Command line mode, type :.

:q quit (close window)
:q! quit, discard all changes
:w save (“write”)
:w <file_name> saves to file_name
:r <file_name> read from file_name, paste content above cursor
:wq save and quit
:e {name of file} open file for editing
:ls show open buffers
:help {topic} open help
- :help :w opens help for the :w command
- :help w opens help for the w movement
:!<command> to execute external commands, like :!ls

3.2. Movements

Basic movement: hjkl (left, down, up, right)
Words: w (next word), b (beginning of word), e (end of word)
v and move cursor, selects those lines. then press : to save or whatever. or d to delete for example
Lines: 0 (beginning of line), ^ (first non-blank character), $ (end * of line)
Screen: H (top of screen), M (middle of screen), L (bottom of screen)
Scroll: Ctrl-u (up), Ctrl-d (down)
File: gg (beginning of file), G (end of file)
Line numbers: {line_number}G. To see which line you are in, <Ctr> + G
- <Ctr> + O goes to next occurrence, <Ctr> + I to previous
Misc: % (corresponding item)
- Cursor on ( and press %, you go to the matching )
Find: f{character}, t{character}, F{character}, T{character}
- find/to forward/backward {character} on the current line
- , / ; for navigating matches
Search: /{regex}, n / N for navigating matches

3.3. Edits

i enter insert mode
- but for manipulating/deleting text, want to use something* more than backspace
o / O insert line below / above
A append to line (in the end of line)
d{motion} delete {motion}
- e.g. dw is delete word, d$ is delete to end of line, d0 is* delete to beginning of line, d$ delete until end of line
dd delete whole line. 2dd, delete this and next line
rx to replace the character at the cursor by x. ra, deletes current character and writes a.
c{motion} change {motion}
- e.g. cw is change word
- like d{motion} followed by i
x delete character (equal do dl)
s substitute character (equal to xi)
- :s/old/new changes first occurrence of old to new
- :s/old/new/g changes all occurreces of old to new throughout the line
visual mode + manipulation
- select text, d to delete it or c to change it
u to undo, U to undo whole line, <Ctr> + R to redo
y to copy / “yank” (some other commands like d also copy)
p to paste under the cursor
Lots more to learn: e.g. ~ flips the case of a character

3.4. Counts

3w move 3 words forward
5j move 5 lines down
7dw delete 7 words

3.5. Demo

To learn vim: vimtutor.

sudo apt install vim
vimtutor

4. Data wrangling

Data wrangling: take data in one format and turn it into a different format.

Obtain the log of the server, but only ssh stuff (that's why | grep sshd)

ssh myserver journalctl | grep sshd

This gets the whole log in my computer, and filters ssh stuff in my computer. To run everything on the server, + filter for "disconnected from" entries. less gives us a “pager” that allows us to scroll up and down through the long output.

We can stick the current filtered logs into a file so that we don’t have to access the network while developing.

ssh myserver 'journalctl | grep sshd | grep "Disconnected from"' > ssh.log
less ssh.log

Filter through regular expressions. sed is a “stream editor” that builds on top of the old ed editor. In it, you basically give short commands for how to modify the file, rather than manipulate its contents directly (although you can do that too). There are tons of commands, but one of the most common ones is s: substitution. For example, we can write:

ssh myserver journalctl
 | grep sshd
 | grep "Disconnected from"
 | sed 's/.*Disconnected from //'

Sort to sort them, uniq -c to collapse consecutive equal lines into one. Then sort them again.

sort -n will sort in numeric (instead of lexicographic) order. -k1,1 means “sort by only the first whitespace-separated column”. The ,n part says “sort until the nth field, where the default is the end of the line.

If we wanted the least common ones, we could use head instead of tail. There’s also sort -r, which sorts in reverse order.

ssh myserver journalctl
| grep sshd
| grep "Disconnected from"
| sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/'
| sort | uniq -c
| sort -nk1,1 | tail -n10

awk is another editor, see in webpage.

If you’re fetching HTML data, pup might be helpful. For JSON data, try jq.

README.md Unescape Escape