- Data Science, Engineering, Machine Learning
- Data.gov - A directory of government data downloads
- /r/datasets - A subreddit that has hundreds of interesting data sets
- Awesome datasets - A list of data sets hosted on GitHub
- rs.io - A great blog post with hundreds of interesting data sets
- Glob Patterns and Wildcards: *, ?, [a-z]
Bash Basics | Syntax | Arguments |
---|---|---|
Current time and date | date | |
Calendar | cal | |
Diff side to side | diff -y file1 file2 | -q report only if differ, -y side by side |
Execute from history | history, !num, !! | |
Clear screen | clear | |
Close terminal | exit | |
Print working directory | pwd | |
List the contents | ls | la | ll | -A all, -h size -l list, -p add / to dir |
Change directory | cd | ~, .., - |
Make directory | mkdir | |
Remove empty directory | rmdir | |
Copy | cp | -i interactive, -r recursive |
Remove | rm | -i interactive, -r recursive |
Move | mv | -i interactive |
Find | find | [location] -name ['filename'] -iname ['icasename'] |
Username | whoami | |
User info | id | -un |
Groups | groups | |
Change mode (permissions) | chmod | [ugoa][+-=][rwx] files | 777 |
Show file permissions | stat | |
Run as superuser | sudo | -u username |
Change owner | chown | [new_owner][:new_group] file... |
Text Processing | Syntax | Arguments |
---|---|---|
Python | python | -c "print(42)" |
Command type | type | -p path, -t type |
List aliases | compgen | -a |
Create alias | alias d=date | |
Delete alias | unalias d | |
Locatae a command | which | |
Manual | man | |
First manual line | whatis | |
Manual built-in | help | |
Text page reader | less | -S truncate |
Print head of files | head | -n lines |
Print tail of files | tail | -n lines |
Word count (lines, words, bytes) | wc | -с bytes, -m chars, -l lines, -w words |
Print as table | column | -s separator, -t table |
Shuffle lines | shuf | -n head |
Determine file type | file | |
Concatenate and print files | cat | tac | |
Sort and print files | sort | -r reverse, -u unique, -t separator, -k range, -g numeric |
Print columns of files | cut | -d separator, -f range |
Regex finder | grep | -E extended, -h no-filename, -n show-line, -i ignore-case, -v non-matching |
Print to screen | echo | |
Print formatted to screen | printf | |
Create a file | touch | |
Translate (replace symbols) | tr |
Command Line Flow | Syntax |
---|---|
Redirect stdout (overwrite, append) | >, >> |
Redirect stderr (overwrite, append) | 2>, 2>> |
Redirect out and err | > file 2>&1 |
Redirect stdin | < |
Pipe left output to right input | | |
Drop output | > /dev/null |
Current process descriptors | /proc/$$/fd |
Command | PowerShell | bash |
---|---|---|
Interrupt | CTRL + C | CTRL + C |
EOF | CTRL + D | CTRL + D |
Clear | CTRL + L | CTRL + L |
Clear input | ESC | CTRL + U ESC + Backspace |
Commander info | CTRL + Q | CTRL + X, I |
Commander extract | SHIFT + F2 |
Command | Syntax | Notes |
---|---|---|
List running services | docker-compose ps | Recommended |
Build | docker-compose build <service(s)> | Recommended |
Build, re(create), start and attach | docker-compose [-f docker-compose.dev.yml] up -d <service(s)> | Recommended |
Start existing container | docker-compose start <service(s)> | Not recreated and env vars are not updated |
Stop and remove ALL containters | docker-compose down | Specific services cannot be specified |
Stop runnning containters without removing | docker-compose stop <service(s)> | Recommended |
Stop and start containers | docker-compose restart <service(s)> | Build context (including env vars) are not updated |
Force to stop | docker-compose kill <service(s)> | SIGKILL: use only if won't stop |
Remove stopped containers | docker-compose rm -fsv <service(s)> | Recommended (force, stop the container first, volumes) |
Execute command | docker-compose exec <service(s)> | Recommended |
- What is OAuth? Definition and How it Works
- Scraping Client Side Rendered Data with Python and Selenium
Status Code | Description |
---|---|
200 | OK |
201 | POST OK |
204 | DELETE OK |
301 | Redirect |
400 | Bad request |
401 | Not authenticated |
403 | Forbidden |
404 | Not found |
API Basics | Syntax |
---|---|
Import module | import requests |
GET Request | requests.get(url, params={}, headers={}) |
POST Request | requests.post(url, json=payload) |
PUT/PATCH Request | requests.patch(url, json=payload) |
DELETE Request | requests.delete(url) |
Status | response.status_code |
Content in String | response.content |
Request/Response in JSON | response.json() |
Content-Type | response.headers['content-type'] |
Scrapping Basics | Syntax |
---|---|
Import module | from bs4 import BeautifulSoup |
Initialize the parser | parser = BeautifulSoup(response_content, 'html.parser') |
Get the body tag | parser.body |
Get the inside text of a tag | parser.head.title.text |
Find specific tags | parser.body.find_all('p', id='i', class_='c') |
Find all tags by selectors | parser.body.select('.c') |
Character classes | |
---|---|
. | any character except newline |
\w \d \s | word, digit, whitespace |
\W \D \S | not word, digit, whitespace |
[abc] | any of a, b, or c |
[^abc] | not a, b, or c |
[a-g] | character between a & g |
Anchors | |
---|---|
^abc$ | start / end of the string |
\b \B | word, not-word boundary |
Escaped characters | |
---|---|
. * \\ | escaped special characters |
\t \n \r | tab, linefeed, carriage return |
Groups & Lookaround | |
---|---|
(abc) | capture group |
(?P<name>abc) | named capture group |
\1 | backreference to group #1 |
(?:abc) | non-capturing group |
(?=abc) | positive lookahead (is followed by abc) |
(?!abc) | negative lookahead (is not followed by abc) |
(?<=abc) | positive lookahead (is preceded by abc) |
(?<!abc) | negative lookahead (is not preceded by abc) |
Quantifiers & Alternation | |
---|---|
a* a+ a? | 0 or more, 1 or more, 0 or 1 |
a{5} a{2,} | exactly five, two or more |
a{1,3} | between one & three |
a+? a{2,}? | match as few as possible |
ab|cd | match ab or cd |
Description | Syntax |
---|---|
Python module | import re | re.search(pattern, string) | re.findall(patttern, string) |
Regex pattern check | s.str.contains(r'', na=False, flags=re.IGNORECASE) | IGNORECASE = I |
Regex pattern extract | s.str.extract(r'', expand=True, flags) | expand returns df |
Regex pattern replace | s.str.replace(r'', replace, flags) |
Regex all patterns extract | s.str.extractall(r'') |
Raw expression (prevents \) | r'' |
Escape | \ |
Command Mode (press Esc to enable) | Edit Mode (press Enter to enable) |
---|---|
F: find and replace | Tab: code completion or indent |
Ctrl-Shift-P: open the command palette | Shift-Tab: tooltip |
Enter: enter edit mode | Ctrl-]: indent |
Shift-Enter: run cell, select below | Ctrl-[: dedent |
Ctrl-Enter: run selected cells | Ctrl-A: select all |
Alt-Enter: run cell, insert below | Ctrl-Z: undo |
Y: to code | Ctrl-Shift-Z: redo |
M: to markdown | Ctrl-Y: redo |
R: to raw | Ctrl-Home: go to cell start |
1: to heading 1 | Ctrl-Up: go to cell start |
2: to heading 2 | Ctrl-End: go to cell end |
3: to heading 3 | Ctrl-Down: go to cell end |
4: to heading 4 | Ctrl-Left: go one word left |
5: to heading 5 | Ctrl-Right: go one word right |
6: to heading 6 | Ctrl-Backspace: delete word before |
K: select cell above | Ctrl-Delete: delete word after |
Up: select cell above | Ctrl-M: command mode |
Down: select cell below | Ctrl-Shift-P: open the command palette |
J: select cell below | Esc: command mode |
Shift-K: extend selected cells above | Shift-Enter: run cell, select below |
Shift-Up: extend selected cells above | Ctrl-Enter: run selected cells |
Shift-Down: extend selected cells below | Alt-Enter: run cell, insert below |
Shift-J: extend selected cells below | Ctrl-Shift-Minus: split cell |
A: insert cell above | Ctrl-S: Save and Checkpoint |
B: insert cell below | Down: move cursor down |
X: cut selected cells | Up: move cursor up |
C: copy selected cells | |
Shift-V: paste cells above | |
V: paste cells below | |
Z: undo cell deletion | |
D,D: delete selected cells | |
Shift-M: merge selected cells, or current cell with cell below if only one cell selected | |
Ctrl-S: Save and Checkpoint | |
S: Save and Checkpoint | |
L: toggle line numbers | |
O: toggle output of selected cells | |
Shift-O: toggle output scrolling of selected cells | |
H: show keyboard shortcuts | |
I,I: interrupt kernel | |
0,0: restart the kernel (with dialog) | |
Esc: close the pager | |
Q: close the pager | |
Shift-Space: scroll notebook up | |
Space: scroll notebook down |
Import Modules | Syntax |
---|---|
Importing a whole module | import csv |
Importing a whole module with an alias | import csv as c |
Importing a single definition | from csv import reader |
Importing multiple definitions | from csv import reader, writer |
Importing all definitions | from csv import * |
Reimport a module | pd = importlib.reload(pandas) |
String Basics | Syntax |
---|---|
Replace substring within a string | <string>.replace(substring, string) |
Convert to title cases (capitalize every letter after every dot) | <string>.title() |
Check a string for the existence of a substring | if <substring> in <string> |
Split a string into a list of strings | <string>.split(separator) |
Slice characters from a string by position | <string>[:5] |
- String functions:
capitalize
,count
,startswith
,endswith
,find
,format
,lower
,upper
,lstrip
,rstrip
,strip
,replace
,split
,swapcase
,title
,zfill
;
String interpolation | Syntax |
---|---|
Insert values into a string in order | "{} {}".format(value, value) |
Insert values into a string by position | "{0} {1}".format(value, value) |
Insert values into a string by name | "{name}".format(name="value") |
Format specification for precision of two decimal places | "{:.2f}".format(float) |
Order for format specification when using precision and comma separator | "{:,.2f}".format(float) |
Python 3.6 String Interpolation | f"Hello {variable}" |
Dates and Times Basics | Syntax |
---|---|
Import module | import datetime as dt |
Instantiating dt.datetime | dt.datetime(year, month, day) |
Creating dt.datetime from a string | dt.datetime.strptime("day/month/year", "%d/%m/%Y") |
Converting dt.datetime to a string | dt_object.strftime("%d/%m/%Y") |
Instantiating a dt.time | dt.time(hour=int, minute=int, second=int, microsecond=int) |
Retrieving a part of a date | dt_object.day |
Retrieving a date | dt_object.date() |
Instantiating a dt.timedelta | dt.timedelta(weeks=3) |
Dates and Times Math | Type |
---|---|
datetime - datetime | timedelta |
datetime - timedelta | datetime |
datetime + timedelta | datetime |
timedelta + timedelta | timedelta |
timedelta - timedelta | timedelta |
Format | Description |
---|---|
%d | Day of the month as a zero-padded number |
%A | Day of the week as a word |
%m | Month as a zero-padded number |
%Y | Year as four-digit number |
%y | Year as two-digit number with zero-padding |
%B | Month as a word |
%H | Hour in 24 hour time as zero-padded number |
%p | a.m. or p.m. |
%I | Hour in 12 hour time as zero-padded number |
%M | Minute as a zero-padded number |
JSON Basics | Syntax |
---|---|
Import module | import json |
JSON string to Object | json.loads('json') |
JSON file to Object | json.load(open('path')) |
Object to JSON string | json.dumps(obj, sort_keys=True, indent=4) |
Dictionary keys | obj.keys() |
Delete key | del obj[key] |
List Comprehansions and Lambdas | Syntax |
---|---|
Ranges (integers only) | range(min, max, interval) |
List comprehension | [i * 10 for i in [0,1,2,3,4,5] if i > 0] |
Functions on Objects | min|max|sorted(obj, key=function, reverse=True) | one argument function extracts scalar value |
Lambda function | f = lambda x, y: x * y |
Ternary operator | return <val> if <expression> else None |
class MyClass():
def __init__(self, param_1):
self.attribute_1 = param_1
def add_20(self):
self.attribute_1 += 20
mc = MyClass(10)
mc.add_20()
print(mc.attribute_1)
- asyncio for beginners
- asyncio on Real Python
- Task docs
- Stream docs
- Use both Multitasking and Asyncio
asyncio | Syntax |
---|---|
Import module | import asyncio |
Grab an event loop | asyncio.run(<coroutine>) |
Make an async function | async def func(): |
await coroutine | await func() |
Awaitable sleep | asyncio.sleep(n) |
Start an awaitable task (future) | asyncio.create_task(<coroutine>) |
Gather multiple coroutines | asyncio.gather([coroutines]) |
Asynchronous Queue | asyncio.Queue() |
Asynchronous Iterable | async for <iterable> |
Get Event Loop | asyncio.get_event_loop() |
Start Event Loop | try: loop.run_until_complete(main()) |
Close Event Loop | finally: loop.close() |
Currently pending tasks | asyncio.Task.all_tasks() |
Create Task Group | async with asyncio.TaskGroup() as tg |
Add Task to the Group | tg.create_task(func()) |
Description | Syntax |
---|---|
Create | py -m venv venv |
Activate | source venv\Scripts\activate |
Deactivate | deactivate |
Dependencies | pip install -r requirements.txt |
Description | Syntax |
---|---|
Init | alembic init alembic |
Create migration | alembic revision --autogenerate [-m "message"] |
Migrate upgrade | alembic upgrade head |
Migrate downgrade | alembic downgrade -1 |
History | alembic history |
Description | Syntax |
---|---|
Import modules | from PyQt5.QtWidgets import |
Create an instance (one per app) | app = QApplication(sys.argv) |
Start the event loop | app.exec_() |
Create window (no parent = window) | window = QWidget() |
Create main window | window = QMainWindow() |
Show window (hidden by default) | window.show() |
NumPy Selecting | Syntax |
---|---|
Import module | import numpy as np |
Convert a list of lists into a ndarray | np.array(list(csv.reader(open(file, "r")))) |
Selecting a row from an ndarray | ndarr[1] |
Selecting multiple rows from an ndarray | ndarr[1:] |
Selecting a specific item from an ndarray | ndarr[1,1] |
Selecting multiple columns | ndarr[:,1:3] | ndarr[:, [1,2]] |
Selecting a 2D slice | ndarr[1:4,:3] |
NumPy Boolean Indexing | Syntax |
---|---|
Reading in a CSV file | np.genfromtxt('.csv', delimiter=',', skip_header=1) |
Creating a Boolean array from filtering criteria | np.array([2,4,6,8]) < 5 |
Boolean filtering for 1D ndarray | a = np.array([2,4,6,8]) | a[a < 5] |
Boolean filtering for 2D ndarray | ndarr[ndarr[:,12] > 50] |
Assigning values in a 2D ndarray using indices | ndarr[1,1] = 1 | ndarr[:,0] = 1 | ndarr[:,7] = ndarr[:,7].mean() |
Assigning values using Boolean arrays | ndarr[ndarr[:,5] == 2, 15] = 1 |
NumPy 1D Statistics | Syntax |
---|---|
Vectorized math | + - * / |
Functions | ndarray .min() .max() .mean() .sum() |
NumPy Utils | Syntax |
---|---|
Advanced ranges | np.arange(min, max, interval) |
- Tidy data: each variable is a column, each observation is a row, and each type of observational unit is a table
- Imputation: The technical name for filling in a missing value with a replacement value
Pandas Information Basics | Syntax |
---|---|
Import module | import pandas as pd |
Reading a file into a dataframe | pd.read_csv('.csv', index_col=0, parse_dates=['col'], encoding='') |
Reading a JSON into a dataframe | pd.read_json() |
Exporting data | df.to_csv('.csv', index=False) |
Dataframe object info | df.info(memory_usage='deep') |
Describing a dataframe/series object | df.describe(include='all') | s.describe() |
Returning a dataframe/series data types | df.dtypes | s.dtype() |
Returning or setting column names | df.columns |
Returning the dimensions of a dataframe | dt.shape |
Create dataframe/series | pd.DataFrame({'col': []}, columns=cols) | pd.Series([]) |
Series to dataframe/list | s.to_frame('col') | s.tolist() |
Pandas Select Operations | Syntax |
---|---|
Selecting the first n rows | df.head(5) |
Selecting random n rows | df.sample(5, random_state=1) |
Selecting a single column | df['col'] |
Selecting multiple columns | df[['col', 'col2']] |
Shorthand Convention for columns | df['col'] | df[['col', 'col2']] |
Shorthand Convention for rows | df['row':'row3'] |
Selecting rows by label | df.loc[<row_labels>, [column_labels]] |
Selecting rows by index | df.iloc[<row_index>, [column_index]] |
Pandas Missing Values Handling | Syntax |
---|---|
Unique value counts for a dataframe/series | s.unique() | s.value_counts(dropna=False, bins=3, normalize=True) |
Selecting null and non-null values | s.isnull() | s.notnull() |
Selecting null rows | df.isnull().any(axis=1) |
Renaming an existing column | df.rename(columns={'src_name': 'dest_name'}, inplace=True) |
Dropping an existing column | df.drop(labels=['row'], columns=['col'], inplace=True) |
Dropping missing values | df.dropna(axis=0, thresh=number_of_records, inplace=True) |
Show duplicated rows | df.duplicated(cols) |
Drop duplicated rows | df.drop_duplicates(cols) |
Fill missing values | df.fillna(value) | s.fillna(value) |
Reset index column | df.reset_index(drop=True, inplace=True) |
Renaming an index | df.rename_axis(None, axis=0) |
Pandas Boolean Masks Operations | Syntax |
---|---|
In operator | df['col'].isin(['val1', 'val2']) |
Between method | df['col'].between(val1, val2) |
Updating values using Boolean filtering | s.loc[s['col'] == 0, 'col'] = np.nan |
Updating values using a Mapping dict | s.map({ 'src_name': 'dest_name' }) |
Updating values using mask method | s.mask(bool_mask, new_values) |
Pandas Sort and Convert Basics | Syntax |
---|---|
Sorting by index column | df.sort_index(ascending=False) |
Sorting by column values | df.sort_values(ascending=False) |
Converting column to datetime | pd.to_datetime(series, errors="coerce") |
Converting column to numeric | pd.to_numeric(series, errors="coerce") |
Converting column to float/int | s.astype(float/int) |
Stack multiple columns into one | df.stack(dropna=False) |
Pandas Vectorized Accessors | Syntax |
---|---|
Multi-dimensional numpy array | df.values |
Access datetime values in series | s.dt | s.dt.year |
Replace substring | s.str.replace('"', '') |
Extracting values from strings (first word) | s.str.split().str[0] |
Pandas Aggregation Methods | Syntax |
---|---|
Grouping | df.groupby('col') |
Group indexing | gr['col'] |
Select group data | gr.get_group('value') |
Groups and indexes | gr.groups |
Aggregations | gr.size() | mean, sum, count, min, max |
Multiple aggregations | gr.agg(functions_list) |
Aggregate | df.pivot_table(index=gr_cols, columns=gr_cols, values=val_cols, aggfunc=functions, margins=True) |
Pandas Transforming Data | Syntax |
---|---|
Apply function for each row in Series | s.map(func/dict) |
Apply function - Series: for each row, DataFrame: for each column | s.apply(func, args) | df.apply(func, args, axis=0) |
Apply function to every cell in the DataFrame | df.applymap(func) |
Unpivot | df.melt(id_vars=cols, value_vars=cols) | pd.melt() |
Pivot | df.pivot(index=cols, columns=cols, values=cols) |
List-like to a row (Pandas 0.25) | df.explot(column, ignore_index=True) |
-Pandas Combining DataFrames*
-
Union
- pd.concat(df_list, axis, ignore_index=True)
- df.append() # shortcut
-
Join
- pd.merge(left=df1, right=df2, how='inner', on='col', on_left='col1', on_right='col2', left_index = True, right_index = True, suffixes=('_x', '_y'))
- df.merge() # shortcut
- df.join() # using indexes
- Find The Graphic You Need
- The Data Visualisation Catalogue
- Plot With Pandas
- Charjunk, Data-ink ratio: effective data visualization
- Tableau Color Blind 10
- Kernel density estimation (KDE): better histogram
- Small multiple: series of similar graphs or charts using the same scale and axes
- Matplotlib styles
- df.plot(x='col', y='col', kind='scatter')
- c='color', color='color'
- figsize=(,), ax=ax1, grid=True
- label='', legend=True, title=''
- xlim=(,), xticks=[]
- rot=30, alpha=1
- autopct='%.1f%%' # -% String Formatting -.1 precision -f fixed point -% perc -% symbol
- secondary_y=False, marker='o'
- df.plot.bar(x='col', y='col')
- df.plot.kde()
- df.hist(bins=, range=(,), histtype='step')
- df.box(x='col', y='col')
- df.<graph>()
- from pandas.plotting import scatter_matrix
- scatter_matrix(cols, figsize(,))
Seaborn Basics | Syntax |
---|---|
Import module | import seaborn as sns |
Set background style | sns.set_style('darkgrid' | 'whitegrid' | 'dark' | 'white' | 'ticks') |
Remove spines | sns.despine(left=True, bottom=True) |
Histogram \w KDE | sns.distplot(y_values) |
Kernel Density Plot | sns.kdeplot(y_values, shade=True) |
Countplot (Clustered Bar) | sns.countplot(x='x_col', hue='y_col', data=df, order=[], hue_order=[]) |
Strip Plot (Narrow Scatter) | sns.stripplot(x='x_col', y='y_col', data=df, jitter=True) |
Box Plot (And Whisker) | sns.boxplot(x='x_col', y='y_col', data=df, whis=4, orient='vertical', width=.15) |
Heatmap | sns.heatmap(data, cmap='Blues', cbar=False, annot=False, yticklabels=False) |
Small Multiple | g = sns.FacetGrid(df, col='col1', row='col2', hue='col3', size=height) |
Fill Small Multiple | g.map(sns.kdeplot, 'y_col', shade=True) |
Matplotlib Basics | Syntax |
---|---|
Import module | import matplotlib.pyplot as plt |
Jupyter inline | %matplotlib inline |
Set plot style | plt.style.use('fivethirtyeight') |
Plot style list | plt.style.available |
Show plot | plt.show() |
Save plot | plt.savefig('file') | fig.savefig('file') |
Create figure | fig = plt.figure(figsize=(width_dpi, height_dpi)) |
Add plot to the figure | ax = fig.add_subplot(nrows, ncols, plot_number) |
Create figure and subplots | fig, axes = plt.subplots(nrows, ncols, figsize=(width, height)) |
Disable spines | ax.spines['side'].set_visible(False) | right, bottom, top, left |
Matplotlib Charts | Syntax |
---|---|
Line chart | plt.plot(x_values, y_values, c='color', label='', linewidth=3) |
Bar plot | plt.bar(bar_positions, bar_heights, [bar_width]) |
Horizontal bar plot | plt.barh(bar_positions, bar_widths, [bar_height]) |
Scatter plot | plt.scatter(x, y) |
Histogram | plt.hist(y_values, bins=int, range=(min, max)) |
Box plot | plt.boxplot(values) |
Matplotlib Plot and Axis | plt | ax | Arguments |
---|---|---|---|
Set Title | title | set_title | 'text' |
Add Legend | legend | legend | 'text', loc='upper left', fontsize=12 |
Set Axis Labels | xlabel | set_xlabel | 'text', size=12 |
Ticks and Their Labels | xticks | set_xticks, ax.set_xticklabels | [ticks], [labels], rotation=90, size=12 |
Batch Tick Parameters | tick_params | tick_params | bottom='off', top='off', left='off', right='off', labelbottom='off', labelsize=12 |
Set Axis Limit Range | xlim | set_xlim | min, max |
Add H/V Lines | axhline | axhline | y, label='', c='color', alpha=1 |
Add Text | text | text | x, y, 'text' |
- Data Query Language (DQL)
- Data Definition Language (DDL):
CREATE
,ALTER
, andDROP
- Data Control Language (DCL):
GRANT
,REVOKE
- Data Manipulation Language (DML):
SELECT
,INSERT
,UPDATE
,DELETE
- Data Definition Language (DDL):
- sqlstyle.guide
- trino.io: Fast distributed SQL query engine for big data analytics
select <column_name, ..., *>
from <table_name>
where <condition>
group by <column_name, ...>
having <condition>
order by <column_name | column_number, ...> [desc]
limit 10;
insert into <table_name> [(
<column_name>, ...
)] values (
<value_1>, ...
), ...;
update <table_name>
set <column_name> = <new_value>, ...
where <condition>;
delete from <table_name> where <condition>;
Description | Syntax |
---|---|
Execution order | from -> where -> group by -> having -> select -> order by -> limit |
String concatenation (SQLite) | || |
Unique | distinct |
Aggregation functions | count, sum, avg, min, max, len |
Rounding results | round(<column, value>, <n>) |
Casting types | cast(<column, value> as <type>) |
Case conversion | lower(), upper() |
Conditional logic | case when <expression> then <value1> [...] [else <value2>] end as <name> |
IN Operator | <column, value> in (<values>) |
Joining data | <inner, left, right, full (outer), cross> join on <condition> |
Combining rows | union [all], intersect, except |
Null operations | is [not] null |
Like pattern | like '[pattern] [%]' |
Named subquery | with <name> as <query> [...] |
create table <table_name> (
<column_name> <column_type> [primary key]
,primary key (<column_name>, ...)
,foreign key (<column_name>) references <table_name>(<column_name>)
);
alter table <table_name>
add column <column_name> <column_type>;
Description | Syntax |
---|---|
SQLite column types | text, integer, real, numeric, blob |
Creating a view | create view <name> as <query> |
Removing an object | drop <view, table> [if exists] <name> |
- A Simple Guide to Five Normal Forms in Relational Database Theory
- Database Normalization on Wikipedia
- First normal form: the values in each column of a table must be atomic
- Second normal form: every non candidate-key attribute must depend on the whole candidate key, not just part of it
- Third normal form: eliminating the transitive functional dependencies
#!conda install -yc conda-forge ipython-sql
%%capture
%load_ext sql
%sql sqlite:///sqlite_file.db
%%sql
<query>
Description | Syntax |
---|---|
Import module | import sqlite3 |
Connect to database | conn = connect(path) |
Close the connection | conn.close() |
Create a cusror | cursor = conn.cursor() |
Run the query | cursor.execute(sql_query) |
Return one row | cursor.fetchone() |
Return n rows | cursor.fetchmany(n) |
Return the full results | cursor.fetchall() |
No cursor shortcut | conn.execute(sql_query).fetchall() |
Description | Syntax |
---|---|
Open database | sqlite3 <dbname.db> |
Enable column headers | .headers on |
Enable column output | .mode column |
Help | .help |
Tables list | .tables |
Run in shell | .shell <command> |
Quit | .quit |
View the schema for a table | .schema <table_name> |
from sqlalchemy import create_engine
engine = create_engine(f'mysql://{LOGIN}:{PASS}@{URL}/{DB}?charset=utf8')
with engine.begin() as conn:
cursor = conn.execute('''
select value
from table
''')
df = pd.DataFrame(cursor.all())
select convert(varchar(8), getdate(), 112) as [DateKey]
select cast([YYYMMDD] as datetime) as [DateTime]
- Fuzzy language is vague language and it is common in the workspace
- Sought clarification: What is the reason behind the request? What is the right question to ask?
- Proxies: is a variable that stands in place of another variable (which is typically hard to get)
- Price dumping: occurs when manufacturers export a product to another country at a price below the normal price with an injuring effect (could be illegal)
- Line organization: most requests come directly from your manager
- Functional organization: requests can come from all over the company
- Majority rule: a decision rule that selects alternatives with more than half the votes
- Prototyping has several advantages like easier estimation, profitability decision, changes and goals flexibility
- Supply Chain Management (SCM): the management of the flow of goods and services
Measurements that help management track of the overall health of the business:
- Metrics are observed across time
- Metrics are calcullted separately at specific points in time
- Metrics are understood in a chronological context
A good metric should have the following characteristics:
- Accurate: do not create anything wrongly measured
- Simple and intelligible: easy to read for anyone
- Easy to drill down into: are we doing good or bad and why
- Actionable: ability to change things according to the measure
- Dynamic: metrics need to change over reasonable periods of time
- Standardized: everyone should see the same thing with no inconsistency
- Business Oriented: should be relevant for the business
Examples:
- Gross Domestic Product (GDP)
- Inflation
- Unemployment Rate
- Revenue
- Conversion Rate (CR)
- Average Order Value (AOV): reduce payback period and increase Return on Investment (ROI) in retail
- Net Promoter Score (NPS): quantifies customer satisfaction
- % Promoters - % Detractors = (# Promoters - # Detractors) / # Total
- Churn Rate is when a customer ceases to be a customer (subscription-based)
- # Churned Customers / # Total Customers
- The more customers you lose, the smaller the pool of potential customers becomes
- Current customers are more likely to buy the more expensive products than new customers (subscription based)
- Churn rate informs how happy customers are with your product
- Happy customers provide free advertising
- Rretaining existing customers is more profitable than acquiring new customers
- Introducing KaTeX: The fastest math typesetting library for the web
- Granularity: the level of detail at which data is stored
- Rule of thumb: principe based on practical experience rather than theory
- Population: the set of all individuals relevant to a particulas statistical question
- Sample: a smaller group selected from a population
- Parameter: a population metric
- Statistic: a sample metric
- Sampling error: difference between the metrics of a population and the metrics of a sample
- sampling error = parameter - statistic
- Representativeness: every individual in the population has an equal chance to be selected, leading to smaller sampling error
- Simple random sampling (SRS): a sampling method using random numbers to select a few sample units
# pd.sample()
- Stratified sampling: organize (stratify) data into different groups (stratums), and then sample randomly each group
- Maximize the variability between strata (different groups)
- Minimize the variability within each stratum
- The stratification criterion should be strongly correlated with the property you're trying to measure
- Сluster sampling: picking only a few of the individual data souces (clusters)
- Descriptive statistics: describing a sample or a population by measuring and visualizing stuff
- Inferential statistics: using a sample (infering) to draw conclusions about a population
Variable is a property with varying value. Can be divided into two categories:
- Quantitative variable: describes how much there is of something
- We can tell the size or direction of the difference
- e.g. height, age (date), points, experience
- Qualitative variable (Categorical): describes what or how
- We cannot tell the size and direction of the difference
- e.g. name, position, place, college
Scales of measurement is the system of rules that define how each variable is measured:
- The Nominal scale: measuring qualitative variables only
- An Ordinal scale: measuring quantitative variables only
- We can tell the direction of the difference
- We cannot tell the size of the difference (intervals between ranks could differ)
- We should be aware calculating averages for ordinal variables (different results with shifted encoding systems)
- An Interval or Ratio scales: measuring quantitative variables only
- Preserves the order between values and has well-defined intervals using real numbers
- On a Ratio scale, the zero point means "no quantity", while on an Interval scale it indicates the presence of a quantity
- Using a Ratio scale we can measure the difference in terms of ratios (division)
- Discrete variable: there is no possible intermediate value between any two adjacent values
- Continuous variable: contains an infinity of values between any two values
- Frequency Distribution Table shows how frequencies are distributed
- Grouped Frequency Distribution Talbes: each group (interval) is called a class interval
s.value_counts(bins=intervals)
pd.interval_range(start=0, end=100, freq=10)
- there should be a good balance between information and comprehensibility
Types of Frequencies:
- Absolute frequencies: absolute counts
# s.value_counts()
- Relative frequencies: proportions and percentages
# s.value_counts(normalize=True)
Percentiles and Quartiles:
- Percentile rank of a score is the percentage of scores in its distribution that are less than it
- Percentile and percentile rank are related terms, but percentile is measured in percentages
from scipy.stats import percentileofscore
percentileofscore(a=series, score=value, kind='weak')
- Quartiles: the three percentiles, 25th (lower quartile), the 50th (middle quartile), and the 75th (upper quartile), that divide the distribution in four equal parts
# s.describe(percentiles=[])
Types of Distributions:
- Skewed Distributions
- Left skewed (negatively skewed): the tail points in the direction of negative numbers
- Right skewed (positively skewed): the tail points in the direction of positive numbers
- Symmetrical Distributions
- Normal distribution (Gaussian distribution): the values pile up in the middle and gradually decrease toward both ends
- Uniform distribution: the values are distributed uniformly
Visualizing Distributions:
- Nominal and Ordinal variables is common to visualize using bar plot, pie chart (better sense for the relative frequencies)
- The most commonly used graph for visualizing distributions is the histogram
- Smoothed histogram that display densities (probabilities) instead of frequencies is called Kernel Density Estimate (KDE) plot
- When we need to compare multiple (> 4) distributions, it is better to use strip plot or box plot
-
Quartiles
- Lower quartile index:
$Qi_1 = (n+1) * 0.25$ - Upper quartile index:
$Qi_3 = (n+1) * 0.75$ - Interquartile range:
$\text{IQR} = \text{upper quartile} - \text{lower quartile}$
- Lower quartile index:
-
Outliers are values in the distribution that are much larger or much lower than the rest of the values
- Lower bound:
$\min = Q_1 - 1.5* \text{IQR}$ - Upper bound:
$\max = Q_3 + 1.5 * \text{IQR}$
- Lower bound:
-
Quartiles
-
Arithmetic Mean μ (Parameter): total sum divided by total number of values (distances belove and above are the same)
$\dfrac{1}{N}(\sum_{i=1}^N x_i)$ - Sample Mean x̄ (Statistics): there are three possible scenarios: overestimation, underestimation, equal estimation (when x̄>μ and x̄<μ, sampling error occurs)
-
Sampling Error:
$μ - x̄$ - Sample Representativity: the more representative a sample is, the closer x̄ will be to μ
- Sample Size: the larger the sample, the more chances we have to get a representative sample and less sampling error
-
Unbiased Estimator: statistic that are on average equal to the parameter it estimates
- This is true for any distribution of real numbers with equal sample size
-
Weighted Mean: takes into account the different weights
$\dfrac{\sum_{i=1}^{N} x_i w_i}{\sum_{i=1}^{N} w_i}$ np.average(houses_per_year['Mean Price'], weights=houses_per_year['Houses Sold'])
- Open-Ended Distribution: distribution with open boundary, for example "10 or more / 10+"
-
Median: the middle value in a sorted distribution (
$Q_2$ ), resistant to outliers (robust statistics)# s.median()
-
Mode: the most frequent value in the destribution
# s.mode()
- The best option for discrete values, because it gives you the whole number
- The distribution could be unimodal, bimodal or even multimodal (in case of more than one mode)
-
Range of Distribution: measure the variability of a distribution (average distance, dispersion)
# s.std()
$\text{mean absolute deviation} = \dfrac{\sum_{i=1}^{N} |x_i - \mu|}{N}$ $\text{mean squared deviation (variance)} = \dfrac{\sum_{i=1}^{N} (x_i - \mu)^2}{N}$ $\text{standard deviation} = \sqrt{\text{variance}}$ - Bessel's correction suggests to divide by n-1, instead of n, to prevent sample underestimation
# np.std(list, ddof=1)
Can be used for | Can't be used for | Ideal for | |
---|---|---|---|
Mean | Interval or Ratio Continuous Ordinal |
Nominal Non-numeric Ordinal For different weights use weighted mean |
Summarizing numerical distributions with each value in the distribution |
Median | Interval or Ratio Numeric Ordinal |
Nominal Non-numeric Ordinal |
Summarizing numerical distributions with outliers Open-ended distributions |
Mode | Interval or Ratio Nominal or Ordinal |
Uniform distributions Continuous Ordinal |
Nominal or Non-numeric Ordinal Discrete values |
Value | Reporting to non-technical audiences | |
---|---|---|
Mean | 1.04 | The average house has 1.04 kitchens |
Median | 1 | The average house has one kitchen |
Mode | 1 | The typical house has one kitchen |
-
Correlation: attributes relations [-1, 1]
-
Pearson correlation coefficient:
df.corr()
-
Pearson correlation coefficient:
-
Weighted sum model (WSM): is the best known and simplest multi-criteria decision analysis (MCDA)
$A_i^{WSM-score} = \sum_{j=1}^{n} w_j a_{ij} \text {, for i = 1, 2, 3, \dots, m}$
-
Min-Max Feature scaling (Normalization): compare different scales in a meaningful way [0, 1]
$x' = \frac {x-\min(x)} {\max(x) - \min(x)}$
Modelling
Touchpad = Rotate
Shift + Touchpad = Move
Ctrl + Touchpad = Zoom
Drag Corner = Join / New Window
Tab = Object / Edit
Shift + A = New Object
Shift + D = Duplicate
Shift + Tab (Ctrl) = Snap (Fact Project + Individual Elements)
Ctrl + Tab = Mode Wheel
Ctrl + A = Apply Stuff (scale, rotation)
~ = View Wheel
N = Properties
X = Delete
O = Proportional Editing (wheel)
Ctrl + P = Parent Selected
P = Separate
Select -> Select Random
L = Select Linked
Ctrl + L = Link Stuff (Shared Material)
Ctrl + R = Loop Cut
Ctrl + B = Bevel Edge
H = Hide
Alt + H = Unhide All
Alt + G = Reset Location
Alt + Z = X-Ray
Alt + Click = Edge Loop
G = Grab
R = Rotate
S = Scale
Alt + S = Directional Scale
x/y/z = Choose Axis
// Modifiers
Subdivision - create extra faces
Solidify - add a volume
Shrinkwrap - wrap aroud inner mesh
Displace - randomize stuff using textures
// Sculpting
F = Radius
Shift + F = Strength
X = Draw
Ctrl + X = Reverse Draw
G = Grab
I = Inflate
Shift + S = Smooth Details
Brushes -> Stroke -> Airbrush = Click and Hold
// Animation
I = Insert Keyframe
ffplay - play video
ffprobe - get metadata
ffmpeg -i vids.txt -f concat
ffmpeg -i input.mp4 output.mp4
-c:v [mpeg4, libx264, libx265] -c:a [mp3, aac]
-c copy (skip re-encoding)
-b:v 7500k
-r 30
-s 1920x1080
-ss 00:00:00 -to 00:00:00
-ac 2 (downmix to stereo, do not use -c:a copy)
-map_chapters -1 (remove chapters)
-map 0 (keep all audio channels)
-sn (remove subtitiles)