Data Science, Engineering, Machine Learning

Data Science, Engineering, Machine Learning

Datasets

Data.gov - A directory of government data downloads
/r/datasets - A subreddit that has hundreds of interesting data sets
Awesome datasets - A list of data sets hosted on GitHub
rs.io - A great blog post with hundreds of interesting data sets

Shell

Glob Patterns and Wildcards: *, ?, [a-z]

Command Line with bash

Bash Basics	Syntax	Arguments
Current time and date	date
Calendar	cal
Diff side to side	diff -y file1 file2	-q report only if differ, -y side by side
Execute from history	history, !num, !!
Clear screen	clear
Close terminal	exit
Print working directory	pwd
List the contents	ls \| la \| ll	-A all, -h size -l list, -p add / to dir
Change directory	cd	~, .., -
Make directory	mkdir
Remove empty directory	rmdir
Copy	cp	-i interactive, -r recursive
Remove	rm	-i interactive, -r recursive
Move	mv	-i interactive
Find	find	[location] -name ['filename'] -iname ['icasename']
Username	whoami
User info	id	-un
Groups	groups
Change mode (permissions)	chmod	[ugoa][+-=][rwx] files \| 777
Show file permissions	stat
Run as superuser	sudo	-u username
Change owner	chown	[new_owner][:new_group] file...

Text Processing	Syntax	Arguments
Python	python	-c "print(42)"
Command type	type	-p path, -t type
List aliases	compgen	-a
Create alias	alias d=date
Delete alias	unalias d
Locatae a command	which
Manual	man
First manual line	whatis
Manual built-in	help
Text page reader	less	-S truncate
Print head of files	head	-n lines
Print tail of files	tail	-n lines
Word count (lines, words, bytes)	wc	-с bytes, -m chars, -l lines, -w words
Print as table	column	-s separator, -t table
Shuffle lines	shuf	-n head
Determine file type	file
Concatenate and print files	cat \| tac
Sort and print files	sort	-r reverse, -u unique, -t separator, -k range, -g numeric
Print columns of files	cut	-d separator, -f range
Regex finder	grep	-E extended, -h no-filename, -n show-line, -i ignore-case, -v non-matching
Print to screen	echo
Print formatted to screen	printf
Create a file	touch
Translate (replace symbols)	tr

Command Line Flow	Syntax
Redirect stdout (overwrite, append)	>, >>
Redirect stderr (overwrite, append)	2>, 2>>
Redirect out and err	> file 2>&1
Redirect stdin	<
Pipe left output to right input	\|
Drop output	> /dev/null
Current process descriptors	/proc/$$/fd

Command Line Shortcuts (Hotkeys)

Command	PowerShell	bash
Interrupt	CTRL + C	CTRL + C
EOF	CTRL + D	CTRL + D
Clear	CTRL + L	CTRL + L
Clear input	ESC	CTRL + U ESC + Backspace
Commander info	CTRL + Q	CTRL + X, I
Commander extract	SHIFT + F2

Docker

Docker CLI

Command	Syntax
List running containers (-a for all)	docker ps
Show running container stats (CPU and Memory)	docker stats
Show daemon disk space usage	docker system df
Show container processes	docker top
Show container logs	docker logs [-ft]
Show container modified files	docker diff
Show container mapped ports	docker ports
Start a stopped container	docker start
Stop a running container	docker stop
Kill a running container	docker kill
Remove specific container (-f for running)	docker rm
Remove stopped containers (-a for running)	docker сontainer prune
Create an image out of container	docker commit
List available images	docker images
Download an image	docker pull
Remove specific image	docker rmi
Remove dangling images (-a for unused)	docker image prune
Create an image out of Dockerfile	docker build .
List all volumes	docker volume ls
SSH into container	docker exec -it /bin/sh or bash
Build and Run	docker run [--name CONTAINER] [-p HOST:CONTAINER -P] [-v LOCAL:CONTAINER] [-d] [--rm]
Copy files between container and host	docker cp <CONTAINER:SOURCE TARGET>
Rename a container	docker rename

Docker-Compose CLI

Command	Syntax	Notes
List running services	docker-compose ps	Recommended
Build	docker-compose build <service(s)>	Recommended
Build, re(create), start and attach	docker-compose [-f docker-compose.dev.yml] up -d <service(s)>	Recommended
Start existing container	docker-compose start <service(s)>	Not recreated and env vars are not updated
Stop and remove ALL containters	docker-compose down	Specific services cannot be specified
Stop runnning containters without removing	docker-compose stop <service(s)>	Recommended
Stop and start containers	docker-compose restart <service(s)>	Build context (including env vars) are not updated
Force to stop	docker-compose kill <service(s)>	SIGKILL: use only if won't stop
Remove stopped containers	docker-compose rm -fsv <service(s)>	Recommended (force, stop the container first, volumes)
Execute command	docker-compose exec <service(s)>	Recommended

API

HTTP Request Status Codes

Status Code	Description
200	OK
201	POST OK
204	DELETE OK
301	Redirect
400	Bad request
401	Not authenticated
403	Forbidden
404	Not found

API Querying with Python

API Basics	Syntax
Import module	import requests
GET Request	requests.get(url, params={}, headers={})
POST Request	requests.post(url, json=payload)
PUT/PATCH Request	requests.patch(url, json=payload)
DELETE Request	requests.delete(url)
Status	response.status_code
Content in String	response.content
Request/Response in JSON	response.json()
Content-Type	response.headers['content-type']

Scrapping Basics	Syntax
Import module	from bs4 import BeautifulSoup
Initialize the parser	parser = BeautifulSoup(response_content, 'html.parser')
Get the body tag	parser.body
Get the inside text of a tag	parser.head.title.text
Find specific tags	parser.body.find_all('p', id='i', class_='c')
Find all tags by selectors	parser.body.select('.c')

Regex

Regexr.com

Character classes
.	any character except newline
\w \d \s	word, digit, whitespace
\W \D \S	not word, digit, whitespace
[abc]	any of a, b, or c
[^abc]	not a, b, or c
[a-g]	character between a & g

Anchors
^abc$	start / end of the string
\b \B	word, not-word boundary

Escaped characters
. * \\	escaped special characters
\t \n \r	tab, linefeed, carriage return

Groups & Lookaround
(abc)	capture group
(?P<name>abc)	named capture group
\1	backreference to group #1
(?:abc)	non-capturing group
(?=abc)	positive lookahead (is followed by abc)
(?!abc)	negative lookahead (is not followed by abc)
(?<=abc)	positive lookahead (is preceded by abc)
(?<!abc)	negative lookahead (is not preceded by abc)

Quantifiers & Alternation
a* a+ a?	0 or more, 1 or more, 0 or 1
a{5} a{2,}	exactly five, two or more
a{1,3}	between one & three
a+? a{2,}?	match as few as possible
ab\|cd	match ab or cd

Regex with Python

Description	Syntax
Python module	import re \| re.search(pattern, string) \| re.findall(patttern, string)
Regex pattern check	s.str.contains(r'', na=False, flags=re.IGNORECASE) \| IGNORECASE = I
Regex pattern extract	s.str.extract(r'', expand=True, flags) \| expand returns df
Regex pattern replace	s.str.replace(r'', replace, flags)
Regex all patterns extract	s.str.extractall(r'')
Raw expression (prevents \)	r''
Escape	\

DAX and MS BI

Python

List Comrehensions (Представление списков RU)

Jupyter Shortcuts (Hotkeys)

Command Mode (press Esc to enable)	Edit Mode (press Enter to enable)
F: find and replace	Tab: code completion or indent
Ctrl-Shift-P: open the command palette	Shift-Tab: tooltip
Enter: enter edit mode	Ctrl-]: indent
Shift-Enter: run cell, select below	Ctrl-[: dedent
Ctrl-Enter: run selected cells	Ctrl-A: select all
Alt-Enter: run cell, insert below	Ctrl-Z: undo
Y: to code	Ctrl-Shift-Z: redo
M: to markdown	Ctrl-Y: redo
R: to raw	Ctrl-Home: go to cell start
1: to heading 1	Ctrl-Up: go to cell start
2: to heading 2	Ctrl-End: go to cell end
3: to heading 3	Ctrl-Down: go to cell end
4: to heading 4	Ctrl-Left: go one word left
5: to heading 5	Ctrl-Right: go one word right
6: to heading 6	Ctrl-Backspace: delete word before
K: select cell above	Ctrl-Delete: delete word after
Up: select cell above	Ctrl-M: command mode
Down: select cell below	Ctrl-Shift-P: open the command palette
J: select cell below	Esc: command mode
Shift-K: extend selected cells above	Shift-Enter: run cell, select below
Shift-Up: extend selected cells above	Ctrl-Enter: run selected cells
Shift-Down: extend selected cells below	Alt-Enter: run cell, insert below
Shift-J: extend selected cells below	Ctrl-Shift-Minus: split cell
A: insert cell above	Ctrl-S: Save and Checkpoint
B: insert cell below	Down: move cursor down
X: cut selected cells	Up: move cursor up
C: copy selected cells
Shift-V: paste cells above
V: paste cells below
Z: undo cell deletion
D,D: delete selected cells
Shift-M: merge selected cells, or current cell with cell below if only one cell selected
Ctrl-S: Save and Checkpoint
S: Save and Checkpoint
L: toggle line numbers
O: toggle output of selected cells
Shift-O: toggle output scrolling of selected cells
H: show keyboard shortcuts
I,I: interrupt kernel
0,0: restart the kernel (with dialog)
Esc: close the pager
Q: close the pager
Shift-Space: scroll notebook up
Space: scroll notebook down

Python Basics

Import Modules	Syntax
Importing a whole module	import csv
Importing a whole module with an alias	import csv as c
Importing a single definition	from csv import reader
Importing multiple definitions	from csv import reader, writer
Importing all definitions	from csv import *
Reimport a module	pd = importlib.reload(pandas)

String Basics	Syntax
Replace substring within a string	<string>.replace(substring, string)
Convert to title cases (capitalize every letter after every dot)	<string>.title()
Check a string for the existence of a substring	if <substring> in <string>
Split a string into a list of strings	<string>.split(separator)
Slice characters from a string by position	<string>[:5]

String functions: capitalize, count, startswith, endswith, find, format, lower, upper, lstrip, rstrip, strip, replace, split, swapcase, title, zfill;

String interpolation	Syntax
Insert values into a string in order	"{} {}".format(value, value)
Insert values into a string by position	"{0} {1}".format(value, value)
Insert values into a string by name	"{name}".format(name="value")
Format specification for precision of two decimal places	"{:.2f}".format(float)
Order for format specification when using precision and comma separator	"{:,.2f}".format(float)
Python 3.6 String Interpolation	f"Hello {variable}"

Dates and Times Basics	Syntax
Import module	import datetime as dt
Instantiating dt.datetime	dt.datetime(year, month, day)
Creating dt.datetime from a string	dt.datetime.strptime("day/month/year", "%d/%m/%Y")
Converting dt.datetime to a string	dt_object.strftime("%d/%m/%Y")
Instantiating a dt.time	dt.time(hour=int, minute=int, second=int, microsecond=int)
Retrieving a part of a date	dt_object.day
Retrieving a date	dt_object.date()
Instantiating a dt.timedelta	dt.timedelta(weeks=3)

Dates and Times Math	Type
datetime - datetime	timedelta
datetime - timedelta	datetime
datetime + timedelta	datetime
timedelta + timedelta	timedelta
timedelta - timedelta	timedelta

Format	Description
%d	Day of the month as a zero-padded number
%A	Day of the week as a word
%m	Month as a zero-padded number
%Y	Year as four-digit number
%y	Year as two-digit number with zero-padding
%B	Month as a word
%H	Hour in 24 hour time as zero-padded number
%p	a.m. or p.m.
%I	Hour in 12 hour time as zero-padded number
%M	Minute as a zero-padded number

JSON Basics	Syntax
Import module	import json
JSON string to Object	json.loads('json')
JSON file to Object	json.load(open('path'))
Object to JSON string	json.dumps(obj, sort_keys=True, indent=4)
Dictionary keys	obj.keys()
Delete key	del obj[key]

List Comprehansions and Lambdas	Syntax
Ranges (integers only)	range(min, max, interval)
List comprehension	[i * 10 for i in [0,1,2,3,4,5] if i > 0]
Functions on Objects	min\|max\|sorted(obj, key=function, reverse=True) \| one argument function extracts scalar value
Lambda function	f = lambda x, y: x * y
Ternary operator	return <val> if <expression> else None

Object-Oriented

class MyClass():
	def __init__(self, param_1):
		self.attribute_1 = param_1
	def add_20(self):
		self.attribute_1 += 20

mc = MyClass(10)
mc.add_20()
print(mc.attribute_1)

asyncio

asyncio	Syntax
Import module	import asyncio
Grab an event loop	asyncio.run(<coroutine>)
Make an async function	async def func():
await coroutine	await func()
Awaitable sleep	asyncio.sleep(n)
Start an awaitable task (future)	asyncio.create_task(<coroutine>)
Gather multiple coroutines	asyncio.gather([coroutines])
Asynchronous Queue	asyncio.Queue()
Asynchronous Iterable	async for <iterable>
Get Event Loop	asyncio.get_event_loop()
Start Event Loop	try: loop.run_until_complete(main())
Close Event Loop	finally: loop.close()
Currently pending tasks	asyncio.Task.all_tasks()
Create Task Group	async with asyncio.TaskGroup() as tg
Add Task to the Group	tg.create_task(func())

venv

Description	Syntax
Create	py -m venv venv
Activate	source venv\Scripts\activate
Deactivate	deactivate
Dependencies	pip install -r requirements.txt

alembic

Description	Syntax
Init	alembic init alembic
Create migration	alembic revision --autogenerate [-m "message"]
Migrate upgrade	alembic upgrade head
Migrate downgrade	alembic downgrade -1
History	alembic history

PyQT5

Description	Syntax
Import modules	from PyQt5.QtWidgets import
Create an instance (one per app)	app = QApplication(sys.argv)
Start the event loop	app.exec_()
Create window (no parent = window)	window = QWidget()
Create main window	window = QMainWindow()
Show window (hidden by default)	window.show()

NumPy Arrays

NumPy Selecting	Syntax
Import module	import numpy as np
Convert a list of lists into a ndarray	np.array(list(csv.reader(open(file, "r"))))
Selecting a row from an ndarray	ndarr[1]
Selecting multiple rows from an ndarray	ndarr[1:]
Selecting a specific item from an ndarray	ndarr[1,1]
Selecting multiple columns	ndarr[:,1:3] \| ndarr[:, [1,2]]
Selecting a 2D slice	ndarr[1:4,:3]

NumPy Boolean Indexing	Syntax
Reading in a CSV file	np.genfromtxt('.csv', delimiter=',', skip_header=1)
Creating a Boolean array from filtering criteria	np.array([2,4,6,8]) < 5
Boolean filtering for 1D ndarray	a = np.array([2,4,6,8]) \| a[a < 5]
Boolean filtering for 2D ndarray	ndarr[ndarr[:,12] > 50]
Assigning values in a 2D ndarray using indices	ndarr[1,1] = 1 \| ndarr[:,0] = 1 \| ndarr[:,7] = ndarr[:,7].mean()
Assigning values using Boolean arrays	ndarr[ndarr[:,5] == 2, 15] = 1

NumPy 1D Statistics	Syntax
Vectorized math	+ - * /
Functions	ndarray .min() .max() .mean() .sum()

NumPy Utils	Syntax
Advanced ranges	np.arange(min, max, interval)

Pandas Transform and Clean

Tidy data: each variable is a column, each observation is a row, and each type of observational unit is a table
Imputation: The technical name for filling in a missing value with a replacement value

Pandas Information Basics	Syntax
Import module	import pandas as pd
Reading a file into a dataframe	pd.read_csv('.csv', index_col=0, parse_dates=['col'], encoding='')
Reading a JSON into a dataframe	pd.read_json()
Exporting data	df.to_csv('.csv', index=False)
Dataframe object info	df.info(memory_usage='deep')
Describing a dataframe/series object	df.describe(include='all') \| s.describe()
Returning a dataframe/series data types	df.dtypes \| s.dtype()
Returning or setting column names	df.columns
Returning the dimensions of a dataframe	dt.shape
Create dataframe/series	pd.DataFrame({'col': []}, columns=cols) \| pd.Series([])
Series to dataframe/list	s.to_frame('col') \| s.tolist()

Pandas Select Operations	Syntax
Selecting the first n rows	df.head(5)
Selecting random n rows	df.sample(5, random_state=1)
Selecting a single column	df['col']
Selecting multiple columns	df[['col', 'col2']]
Shorthand Convention for columns	df['col'] \| df[['col', 'col2']]
Shorthand Convention for rows	df['row':'row3']
Selecting rows by label	df.loc[<row_labels>, [column_labels]]
Selecting rows by index	df.iloc[<row_index>, [column_index]]

Pandas Missing Values Handling	Syntax
Unique value counts for a dataframe/series	s.unique() \| s.value_counts(dropna=False, bins=3, normalize=True)
Selecting null and non-null values	s.isnull() \| s.notnull()
Selecting null rows	df.isnull().any(axis=1)
Renaming an existing column	df.rename(columns={'src_name': 'dest_name'}, inplace=True)
Dropping an existing column	df.drop(labels=['row'], columns=['col'], inplace=True)
Dropping missing values	df.dropna(axis=0, thresh=number_of_records, inplace=True)
Show duplicated rows	df.duplicated(cols)
Drop duplicated rows	df.drop_duplicates(cols)
Fill missing values	df.fillna(value) \| s.fillna(value)
Reset index column	df.reset_index(drop=True, inplace=True)
Renaming an index	df.rename_axis(None, axis=0)

Pandas Boolean Masks Operations	Syntax
In operator	df['col'].isin(['val1', 'val2'])
Between method	df['col'].between(val1, val2)
Updating values using Boolean filtering	s.loc[s['col'] == 0, 'col'] = np.nan
Updating values using a Mapping dict	s.map({ 'src_name': 'dest_name' })
Updating values using mask method	s.mask(bool_mask, new_values)

Pandas Sort and Convert Basics	Syntax
Sorting by index column	df.sort_index(ascending=False)
Sorting by column values	df.sort_values(ascending=False)
Converting column to datetime	pd.to_datetime(series, errors="coerce")
Converting column to numeric	pd.to_numeric(series, errors="coerce")
Converting column to float/int	s.astype(float/int)
Stack multiple columns into one	df.stack(dropna=False)

Pandas Vectorized Accessors	Syntax
Multi-dimensional numpy array	df.values
Access datetime values in series	s.dt \| s.dt.year
Replace substring	s.str.replace('"', '')
Extracting values from strings (first word)	s.str.split().str[0]

Pandas Aggregation Methods	Syntax
Grouping	df.groupby('col')
Group indexing	gr['col']
Select group data	gr.get_group('value')
Groups and indexes	gr.groups
Aggregations	gr.size() \| mean, sum, count, min, max
Multiple aggregations	gr.agg(functions_list)
Aggregate	df.pivot_table(index=gr_cols, columns=gr_cols, values=val_cols, aggfunc=functions, margins=True)

Pandas Transforming Data	Syntax
Apply function for each row in Series	s.map(func/dict)
Apply function - Series: for each row, DataFrame: for each column	s.apply(func, args) \| df.apply(func, args, axis=0)
Apply function to every cell in the DataFrame	df.applymap(func)
Unpivot	df.melt(id_vars=cols, value_vars=cols) \| pd.melt()
Pivot	df.pivot(index=cols, columns=cols, values=cols)
List-like to a row (Pandas 0.25)	df.explot(column, ignore_index=True)

-Pandas Combining DataFrames*

Union
- pd.concat(df_list, axis, ignore_index=True)
- df.append() # shortcut
Join
- pd.merge(left=df1, right=df2, how='inner', on='col', on_left='col1', on_right='col2', left_index = True, right_index = True, suffixes=('_x', '_y'))
- df.merge() # shortcut
- df.join() # using indexes

Data Visualization

Find The Graphic You Need
The Data Visualisation Catalogue
Plot With Pandas
Charjunk, Data-ink ratio: effective data visualization
Tableau Color Blind 10
Kernel density estimation (KDE): better histogram
Small multiple: series of similar graphs or charts using the same scale and axes
Matplotlib styles

Pandas

df.plot(x='col', y='col', kind='scatter')
- c='color', color='color'
- figsize=(,), ax=ax1, grid=True
- label='', legend=True, title=''
- xlim=(,), xticks=[]
- rot=30, alpha=1
- autopct='%.1f%%' # -% String Formatting -.1 precision -f fixed point -% perc -% symbol
- secondary_y=False, marker='o'
df.plot.bar(x='col', y='col')
df.plot.kde()
df.hist(bins=, range=(,), histtype='step')
df.box(x='col', y='col')
df.<graph>()

from pandas.plotting import scatter_matrix
scatter_matrix(cols, figsize(,))

Seaborn

Seaborn Basics	Syntax
Import module	import seaborn as sns
Set background style	sns.set_style('darkgrid' \| 'whitegrid' \| 'dark' \| 'white' \| 'ticks')
Remove spines	sns.despine(left=True, bottom=True)
Histogram \w KDE	sns.distplot(y_values)
Kernel Density Plot	sns.kdeplot(y_values, shade=True)
Countplot (Clustered Bar)	sns.countplot(x='x_col', hue='y_col', data=df, order=[], hue_order=[])
Strip Plot (Narrow Scatter)	sns.stripplot(x='x_col', y='y_col', data=df, jitter=True)
Box Plot (And Whisker)	sns.boxplot(x='x_col', y='y_col', data=df, whis=4, orient='vertical', width=.15)
Heatmap	sns.heatmap(data, cmap='Blues', cbar=False, annot=False, yticklabels=False)
Small Multiple	g = sns.FacetGrid(df, col='col1', row='col2', hue='col3', size=height)
Fill Small Multiple	g.map(sns.kdeplot, 'y_col', shade=True)

Matplotlib

Matplotlib Basics	Syntax
Import module	import matplotlib.pyplot as plt
Jupyter inline	%matplotlib inline
Set plot style	plt.style.use('fivethirtyeight')
Plot style list	plt.style.available
Show plot	plt.show()
Save plot	plt.savefig('file') \| fig.savefig('file')
Create figure	fig = plt.figure(figsize=(width_dpi, height_dpi))
Add plot to the figure	ax = fig.add_subplot(nrows, ncols, plot_number)
Create figure and subplots	fig, axes = plt.subplots(nrows, ncols, figsize=(width, height))
Disable spines	ax.spines['side'].set_visible(False) \| right, bottom, top, left

Matplotlib Charts	Syntax
Line chart	plt.plot(x_values, y_values, c='color', label='', linewidth=3)
Bar plot	plt.bar(bar_positions, bar_heights, [bar_width])
Horizontal bar plot	plt.barh(bar_positions, bar_widths, [bar_height])
Scatter plot	plt.scatter(x, y)
Histogram	plt.hist(y_values, bins=int, range=(min, max))
Box plot	plt.boxplot(values)

Matplotlib Plot and Axis	plt	ax	Arguments
Set Title	title	set_title	'text'
Add Legend	legend	legend	'text', loc='upper left', fontsize=12
Set Axis Labels	xlabel	set_xlabel	'text', size=12
Ticks and Their Labels	xticks	set_xticks, ax.set_xticklabels	[ticks], [labels], rotation=90, size=12
Batch Tick Parameters	tick_params	tick_params	bottom='off', top='off', left='off', right='off', labelbottom='off', labelsize=12
Set Axis Limit Range	xlim	set_xlim	min, max
Add H/V Lines	axhline	axhline	y, label='', c='color', alpha=1
Add Text	text	text	x, y, 'text'

SQL

Data Query Language (DQL)
- Data Definition Language (DDL): CREATE, ALTER, and DROP
- Data Control Language (DCL): GRANT, REVOKE
- Data Manipulation Language (DML): SELECT, INSERT, UPDATE, DELETE

sqlstyle.guide
trino.io: Fast distributed SQL query engine for big data analytics

DML Operations

select <column_name, ...,  *>
from <table_name>
where <condition>
group by <column_name, ...>
having <condition>
order by <column_name | column_number, ...> [desc]
limit 10;

insert into <table_name> [(
	<column_name>, ...
)] values (
	<value_1>, ...
), ...;

update <table_name>
set <column_name> = <new_value>, ...
where <condition>;

delete from <table_name> where <condition>;

Description	Syntax
Execution order	from -> where -> group by -> having -> select -> order by -> limit
String concatenation (SQLite)	\|\|
Unique	distinct
Aggregation functions	count, sum, avg, min, max, len
Rounding results	round(<column, value>, <n>)
Casting types	cast(<column, value> as <type>)
Case conversion	lower(), upper()
Conditional logic	case when <expression> then <value1> [...] [else <value2>] end as <name>
IN Operator	<column, value> in (<values>)
Joining data	<inner, left, right, full (outer), cross> join on <condition>
Combining rows	union [all], intersect, except
Null operations	is [not] null
Like pattern	like '[pattern] [%]'
Named subquery	with <name> as <query> [...]

DDL Operations

create table <table_name> (
	<column_name> <column_type> [primary key]
	,primary key (<column_name>, ...)
	,foreign key (<column_name>) references <table_name>(<column_name>)
);

alter table <table_name>
	add column <column_name> <column_type>;

Description	Syntax
SQLite column types	text, integer, real, numeric, blob
Creating a view	create view <name> as <query>
Removing an object	drop <view, table> [if exists] <name>

Normalization

First normal form: the values in each column of a table must be atomic
Second normal form: every non candidate-key attribute must depend on the whole candidate key, not just part of it
Third normal form: eliminating the transitive functional dependencies

SQL with Python

#!conda install -yc conda-forge ipython-sql

%%capture
%load_ext sql
%sql sqlite:///sqlite_file.db

%%sql
<query>

SQLite

Description	Syntax
Import module	import sqlite3
Connect to database	conn = connect(path)
Close the connection	conn.close()
Create a cusror	cursor = conn.cursor()
Run the query	cursor.execute(sql_query)
Return one row	cursor.fetchone()
Return n rows	cursor.fetchmany(n)
Return the full results	cursor.fetchall()
No cursor shortcut	conn.execute(sql_query).fetchall()

SQLite Shell

Description	Syntax
Open database	sqlite3 <dbname.db>
Enable column headers	.headers on
Enable column output	.mode column
Help	.help
Tables list	.tables
Run in shell	.shell <command>
Quit	.quit
View the schema for a table	.schema <table_name>

SQLAlchemy

from sqlalchemy import create_engine
engine = create_engine(f'mysql://{LOGIN}:{PASS}@{URL}/{DB}?charset=utf8')
with engine.begin() as conn:
    cursor = conn.execute('''
        select value
        from table
        ''')
df = pd.DataFrame(cursor.all())

SQL Server Snippets

Date and Time Conversions Using SQL Server

select convert(varchar(8), getdate(), 112) as [DateKey]
select cast([YYYMMDD] as datetime) as [DateTime]

Business

Communication

Fuzzy language is vague language and it is common in the workspace
Sought clarification: What is the reason behind the request? What is the right question to ask?
Proxies: is a variable that stands in place of another variable (which is typically hard to get)
Price dumping: occurs when manufacturers export a product to another country at a price below the normal price with an injuring effect (could be illegal)
Line organization: most requests come directly from your manager
Functional organization: requests can come from all over the company
Majority rule: a decision rule that selects alternatives with more than half the votes
Prototyping has several advantages like easier estimation, profitability decision, changes and goals flexibility
Supply Chain Management (SCM): the management of the flow of goods and services

Metrics

Measurements that help management track of the overall health of the business:

Metrics are observed across time
Metrics are calcullted separately at specific points in time
Metrics are understood in a chronological context

A good metric should have the following characteristics:

Accurate: do not create anything wrongly measured
Simple and intelligible: easy to read for anyone
Easy to drill down into: are we doing good or bad and why
Actionable: ability to change things according to the measure
Dynamic: metrics need to change over reasonable periods of time
Standardized: everyone should see the same thing with no inconsistency
Business Oriented: should be relevant for the business

Examples:

Gross Domestic Product (GDP)
Inflation
Unemployment Rate
Revenue
Conversion Rate (CR)
Average Order Value (AOV): reduce payback period and increase Return on Investment (ROI) in retail
Net Promoter Score (NPS): quantifies customer satisfaction
- % Promoters - % Detractors = (# Promoters - # Detractors) / # Total
Churn Rate is when a customer ceases to be a customer (subscription-based)
- # Churned Customers / # Total Customers
- The more customers you lose, the smaller the pool of potential customers becomes
- Current customers are more likely to buy the more expensive products than new customers (subscription based)
- Churn rate informs how happy customers are with your product
- Happy customers provide free advertising
- Rretaining existing customers is more profitable than acquiring new customers
  - Applying and Evaluating Models to Predict Customer Attrition Using Data Mining Techniques

Probability and Statistics

Introducing KaTeX: The fastest math typesetting library for the web
Granularity: the level of detail at which data is stored
Rule of thumb: principe based on practical experience rather than theory

Sampling

Population: the set of all individuals relevant to a particulas statistical question
Sample: a smaller group selected from a population
Parameter: a population metric
Statistic: a sample metric
Sampling error: difference between the metrics of a population and the metrics of a sample
- sampling error = parameter - statistic
Representativeness: every individual in the population has an equal chance to be selected, leading to smaller sampling error

Sampling methods:

Simple random sampling (SRS): a sampling method using random numbers to select a few sample units # pd.sample()
Stratified sampling: organize (stratify) data into different groups (stratums), and then sample randomly each group
- Maximize the variability between strata (different groups)
- Minimize the variability within each stratum
- The stratification criterion should be strongly correlated with the property you're trying to measure
Сluster sampling: picking only a few of the individual data souces (clusters)
Descriptive statistics: describing a sample or a population by measuring and visualizing stuff
Inferential statistics: using a sample (infering) to draw conclusions about a population

Variables

Variable is a property with varying value. Can be divided into two categories:

Quantitative variable: describes how much there is of something
- We can tell the size or direction of the difference
- e.g. height, age (date), points, experience
Qualitative variable (Categorical): describes what or how
- We cannot tell the size and direction of the difference
- e.g. name, position, place, college

Scales of measurement is the system of rules that define how each variable is measured:

The Nominal scale: measuring qualitative variables only
An Ordinal scale: measuring quantitative variables only
- We can tell the direction of the difference
- We cannot tell the size of the difference (intervals between ranks could differ)
- We should be aware calculating averages for ordinal variables (different results with shifted encoding systems)
An Interval or Ratio scales: measuring quantitative variables only
- Preserves the order between values and has well-defined intervals using real numbers
- On a Ratio scale, the zero point means "no quantity", while on an Interval scale it indicates the presence of a quantity
- Using a Ratio scale we can measure the difference in terms of ratios (division)
- Discrete variable: there is no possible intermediate value between any two adjacent values
- Continuous variable: contains an infinity of values between any two values

Frequency Distributions

Frequency Distribution Table shows how frequencies are distributed
Grouped Frequency Distribution Talbes: each group (interval) is called a class interval
- s.value_counts(bins=intervals)
- pd.interval_range(start=0, end=100, freq=10)
- there should be a good balance between information and comprehensibility

Types of Frequencies:

Absolute frequencies: absolute counts # s.value_counts()
Relative frequencies: proportions and percentages # s.value_counts(normalize=True)

Percentiles and Quartiles:

Percentile rank of a score is the percentage of scores in its distribution that are less than it
Percentile and percentile rank are related terms, but percentile is measured in percentages
- from scipy.stats import percentileofscore
- percentileofscore(a=series, score=value, kind='weak')
Quartiles: the three percentiles, 25th (lower quartile), the 50th (middle quartile), and the 75th (upper quartile), that divide the distribution in four equal parts # s.describe(percentiles=[])

Types of Distributions:

Skewed Distributions
- Left skewed (negatively skewed): the tail points in the direction of negative numbers
- Right skewed (positively skewed): the tail points in the direction of positive numbers
Symmetrical Distributions
- Normal distribution (Gaussian distribution): the values pile up in the middle and gradually decrease toward both ends
- Uniform distribution: the values are distributed uniformly

Visualizing Distributions:

Nominal and Ordinal variables is common to visualize using bar plot, pie chart (better sense for the relative frequencies)
The most commonly used graph for visualizing distributions is the histogram
Smoothed histogram that display densities (probabilities) instead of frequencies is called Kernel Density Estimate (KDE) plot
When we need to compare multiple (> 4) distributions, it is better to use strip plot or box plot
- Quartiles
  - Lower quartile index: $Qi_1 = (n+1) * 0.25$
  - Upper quartile index: $Qi_3 = (n+1) * 0.75$
  - Interquartile range: $\text{IQR} = \text{upper quartile} - \text{lower quartile}$
- Outliers are values in the distribution that are much larger or much lower than the rest of the values
  - Lower bound: $\min = Q_1 - 1.5* \text{IQR}$
  - Upper bound: $\max = Q_3 + 1.5 * \text{IQR}$

Averages and Variability

Arithmetic Mean μ (Parameter): total sum divided by total number of values (distances belove and above are the same) $\dfrac{1}{N}(\sum_{i=1}^N x_i)$
Sample Mean x̄ (Statistics): there are three possible scenarios: overestimation, underestimation, equal estimation (when x̄>μ and x̄<μ, sampling error occurs)
Sampling Error: $μ - x̄$
Sample Representativity: the more representative a sample is, the closer x̄ will be to μ
Sample Size: the larger the sample, the more chances we have to get a representative sample and less sampling error
Unbiased Estimator: statistic that are on average equal to the parameter it estimates
- This is true for any distribution of real numbers with equal sample size
Weighted Mean: takes into account the different weights $\dfrac{\sum_{i=1}^{N} x_i w_i}{\sum_{i=1}^{N} w_i}$
- np.average(houses_per_year['Mean Price'], weights=houses_per_year['Houses Sold'])
Open-Ended Distribution: distribution with open boundary, for example "10 or more / 10+"
Median: the middle value in a sorted distribution ($Q_2$), resistant to outliers (robust statistics) # s.median()
Mode: the most frequent value in the destribution # s.mode()
- The best option for discrete values, because it gives you the whole number
- The distribution could be unimodal, bimodal or even multimodal (in case of more than one mode)
Range of Distribution: measure the variability of a distribution (average distance, dispersion) # s.std()
- $\text{mean absolute deviation} = \dfrac{\sum_{i=1}^{N} |x_i - \mu|}{N}$
- $\text{mean squared deviation (variance)} = \dfrac{\sum_{i=1}^{N} (x_i - \mu)^2}{N}$
- $\text{standard deviation} = \sqrt{\text{variance}}$
- Bessel's correction suggests to divide by n-1, instead of n, to prevent sample underestimation # np.std(list, ddof=1)

	Can be used for	Can't be used for	Ideal for
Mean	Interval or Ratio Continuous Ordinal	Nominal Non-numeric Ordinal For different weights use weighted mean	Summarizing numerical distributions with each value in the distribution
Median	Interval or Ratio Numeric Ordinal	Nominal Non-numeric Ordinal	Summarizing numerical distributions with outliers Open-ended distributions
Mode	Interval or Ratio Nominal or Ordinal	Uniform distributions Continuous Ordinal	Nominal or Non-numeric Ordinal Discrete values

	Value	Reporting to non-technical audiences
Mean	1.04	The average house has 1.04 kitchens
Median	1	The average house has one kitchen
Mode	1	The typical house has one kitchen

Machine Learning

PyTorch for Deep Learning - Full Course / Tutorial

Correlation: attributes relations [-1, 1]
- Pearson correlation coefficient: df.corr()
Weighted sum model (WSM): is the best known and simplest multi-criteria decision analysis (MCDA)
- $A_i^{WSM-score} = \sum_{j=1}^{n} w_j a_{ij} \text {, for i = 1, 2, 3, \dots, m}$
Min-Max Feature scaling (Normalization): compare different scales in a meaningful way [0, 1]
- $x' = \frac {x-\min(x)} {\max(x) - \min(x)}$

Computer Vision

OpenCV Course - Full Tutorial with Python

Blender

Blender Beginner Donut Tutorial

Modelling

Touchpad = Rotate
Shift + Touchpad = Move
Ctrl + Touchpad = Zoom

Drag Corner = Join / New Window
Tab = Object / Edit

Shift + A = New Object
Shift + D = Duplicate
Shift + Tab (Ctrl) = Snap (Fact Project + Individual Elements)

Ctrl + Tab = Mode Wheel
Ctrl + A = Apply Stuff (scale, rotation)

~ = View Wheel
N = Properties
X = Delete
O = Proportional Editing (wheel)
Ctrl + P = Parent Selected
P = Separate
Select -> Select Random
L = Select Linked
Ctrl + L = Link Stuff (Shared Material)
Ctrl + R = Loop Cut
Ctrl + B = Bevel Edge
H = Hide
Alt + H = Unhide All
Alt + G = Reset Location
Alt + Z = X-Ray
Alt + Click = Edge Loop

G = Grab
R = Rotate
S = Scale
Alt + S = Directional Scale
x/y/z = Choose Axis

// Modifiers

Subdivision - create extra faces
Solidify - add a volume
Shrinkwrap - wrap aroud inner mesh
Displace - randomize stuff using textures

// Sculpting

F = Radius
Shift + F = Strength

X = Draw
Ctrl + X = Reverse Draw

G = Grab
I = Inflate
Shift + S = Smooth Details

Brushes -> Stroke -> Airbrush = Click and Hold

// Animation

I = Insert Keyframe

FFmpeg

ffplay - play video
ffprobe - get metadata
ffmpeg -i vids.txt -f concat
ffmpeg -i input.mp4 output.mp4
-c:v [mpeg4, libx264, libx265] -c:a [mp3, aac]
-c copy (skip re-encoding)
-b:v 7500k
-r 30
-s 1920x1080
-ss 00:00:00 -to 00:00:00
-ac 2 (downmix to stereo, do not use -c:a copy)
-map_chapters -1 (remove chapters)
-map 0 (keep all audio channels)
-sn (remove subtitiles)

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
dataquest.io		dataquest.io
discord-dumper		discord-dumper
go-algorithms		go-algorithms
tti-introduction		tti-introduction
.gitignore		.gitignore
readme.md		readme.md

dv1x3r/data-notes

Folders and files

Latest commit

History

Repository files navigation