Statistics
Basic Definitions
If you're new to probability as a concept, start with some basics that can be explained in simple visuals. You'll want to keep in mind that there's two main types of data: quantitative and qualitative.
Data Visualization
There's quite a few options out there for seeing data with your eyes since humans are bad at understanding numbers intuitively. We all have our favorite plot types:
- Bar Chart
- Box and Whisker Plots
- Dotplots
- Histogram
- Line Graph
- Pie Chart
- Scatter Plot
- Stemplots
Data Exploration
There are a lot of numerical ways to get a feel for data in addition to visual methods. Consider 1, 2, or 3 different ways of finding an "average". Find out the span your data covers, but remember to keep extremes in check! Also keep an eye on how far away data points usually are from the center.
Permutations and Combinations
These concepts are the hardest for my mind to wrap itself around. I can't figure out how counting works, apparently.
Distributions
There are some commonalities found among distributions, but here's a few of the many distributions out there that I have notes on:
- Distribution of any Continuous Data
- Distribution of any Discrete Data
- Normal Distribution
- Binomial Distribution
- Geometric Distribution
- Poisson Distribution
- Chi Squared Distribution
Experimentation (or Applying Theory to Life)
In the real world, we run experiments to answer questions about the world around us using data. Other times, we're asked to answer a question about existing data.
Most of the time, we need to select our data from the population or we're trying to estimate the population from the data we have.
Then we have to test our question, and our final answer is always an estimate (we can't ever know anything for certain, but we can be really really sure).
Predicting the Future (under construction 🏗️)
When we want to determine the patterns in data, there are a ton of methods available. To start, prime yourself on what correlation and regression mean, and afterwards get your machete - we're heading into the weeds.
Here's a flowchart to help select which models to use based on your data.
Parametric Models
These models make assumptions about
- Linear Regression is a form of supervised learning usually for a quantitative response variable. It comes in two flavors: simple and multiple. Keep these considerations in mind.
Non-Parametric Models
These models don't make assumptions about
- The Bayes Classifier is the unattainable gold standard for classification models
- K-Nearest Neighbors
Classification Models
- Generative Models for Classification
- Logistic Regression is useful when linear regression isn't. Useful for classification problems.
- Linear Discriminant Analysis for p eq1
- Linear Discriminant Analysis for p greater than 1
- Quadratic Discriminant Analysis
- Naive Bayes
- Generalized Linear Models for when the response isn't quantitative or qualitative (think counts)
Comparing Models
Linear Regression vs K-Nearest Neighbors
A Comparison of Classification Methods
LDA vs QDA
Assessing Models
Remember to assess your models with your test data!
Resampling Methods
Use to repeatedly draw samples from a training set and refitting a model multiple times to get more information about the fitted model.
model assessment: the process of evaluating a model's performance
model selection: the process of selecting the proper level of flexibility for a model
- Cross Validation: Use to estimate the test error rate with training data by holding out a subset of training observations.
- Bootstrap
References
- How To Make Math Equations - my personal note on using LaTeX
Hi! ignore the following callout. In my full vault, this helps me find pages that should (or shouldn't) live on this page.
These notes point directly to this note. But this note doesn't point back.
- Bimodal (def)
- Blog Index
- Continuous Data (def)
- Cumulative Frequency (def)
- Data Science
- Data Science Book Resources
- Discrete Data (def)
- Frequency (def)
- Frequency Density (def)
- Library
- Modal Class (def)
- Observation (def)
- Percentile (def)
- Python
- Random Variable (def)
- Sampling distributions - the difference between two means
- Sampling distributions - the difference between two proportions
- Skewed Data (def)
- Standard Deviation (def)
- Standard Score
- T-Distribution