Data Science Terms: The Definitive List You Need to Know

Data Science Terms: The Definitive List You Need to Know

Demand for data scientists is on the rise, in the past five years, job postings in the field have increased by 300 percent as companies look to leverage the rapid influx of relevant information and turn big data into better action.

But there's a disconnect. According to the Harvard Business Review, while advanced analytics tools have the capability to deliver enhanced insight "efforts fall short in the last mile, when it comes time to explain the stuff to decision makers."

The results speak for themselves: More than 80 percent of data science projects never make it to production.

To bridge the gap and reap the benefit of analytics at scale, you need to have a basic understanding of what data scientists mean when they pitch a new project or discuss current concerns.

It's time for a crash course — here's the definitive list of data science terms you need to know.


Analyst — Data handlers on the front lines. They collect and process data for statistical analysis.

Data Scientist — Specialized data analysts with the skills to create new algorithms and uncover hidden data correlations. Probably don't wear lab coats.

Data Engineer — Responsible for building massive data reservoirs and testing data architectures.

Business Analytics — The umbrella term for gaining actionable insight with analytics. Includes skills, technologies and applications.

Dashboard — Big data visual aid. Displays critical information and can be customized by analysts.

Machine Learning — Algorithms that find patterns in data and use these patterns to "learn" over time. No, they're not coming for your job.

Model — [ML] Is an algorithm that has some initial factors that are unknown and estimates from existing data, to make a prediction.

Accuracy — Ratio of correct predictions to incorrect predictions for machine learning tools. Higher accuracy is better.

Confidence — Level of certainty around data insights. A combination of accuracy and reliability of the data source.

Prediction — Potential outcome based on data analysis and modeling.

Correlation — How variables relate. They may increase together, decrease together or go opposite ways. Does not mean one causes the other.

Python — An open source programming language that's easy to learn and widely used. No fangs.

R — An open source programming language used for statistical and graphical computing.

SPSS — Software for text and statistical analysis using machine learning.

SAS (Statistical Analysis System) — A software suite capable of mining, altering, and managing data from different sources.

Database — Where you keep your data. Organized for easy access, management, and updating.

MIS (Management Information System) — A combination of software, hardware, data, and people management tools to help streamline analytics and data storage.

Download Now: How to Target Your Best Customers and Keep Them


Descriptive Analytics — A summary of historical data. Answers the question "what happened?"

Predictive Analytics — Analyzing current data through mining and statistical modeling to make future predictions. No crystal balls needed.

Prescriptive Analytics — Using data to determine the best action for a specific case, or a "prescription" for best results. Use as directed =).

Deep Learning — Subset of machine learning that uses models that are convoluted to the point humans can't see how it works. Possibly the start of skynet.

Big Data — Data that is so large that "normal" SQL Databases can't handle it.

Regression — Technique used to evaluate how changing one (or more) variables impacts another variable.

ETL (Extract Transform Load) — A combination of three database functions in one tool: Extracting data from a tool or database, transforming that data into the correct form, and then loading the data into another database.

Decision Tree — Questions applied to data — typically "yes or no" — by machine learning algorithms. Similar to a flow chart, but with bigger datasets.

Neural Network — A machine learning framework designed to mimic the human brain and learn by incorporating new data.

Clustering — Machine learning method that creates groups of data points that are similar to one another and different from those in other groups.

Recommender — Recommender systems offer relevant suggestions to users based on data relevance and similarity. Think Netflix.

Forecasting — A subset of prediction that uses specific time points — both historic and current — to determine the likelihood of specific outcomes.

On-Premise — Analytics tools and solutions running locally and hosted on your own servers.

Cloud Computing — Big data applications and services delivered via offsite servers "in the cloud." Often takes the form of software-as-a-service (SaaS).

NLP (Natural Language Processing) — Algorithms that improve machine understanding of human speech. Think ChatBots.

Feature Engineering — Refines data into predictor variables to help improve machine learning.

Automatic Feature Engineering — The automatic generation of refined data for analytics.


Map Reduce — A software framework used to write applications capable of processing large datasets.

Hadoop — A software library that allows datasets to be processed across multiple computer clusters.

NoSQL (Not Only SQL) — Databases that allow free-form information storage.

Classification Model — Frameworks that analyze and classify data inputs and attempt to predict outcomes.

Regression Model — Framework used to predict the value of one variable based on the behavior of another.

Supervised/Unsupervised Learning — Supervised learning occurs when you are trying to predict something that has a known target, such as tomorrow's weather. Unsupervised learning occurs when you do not have a known or guaranteed target, such as grouping customers together based on demographic data.

Feature Selection — Automatically or manually selecting machine learning features that benefit your desired outcome, and to remove those that are not important.

PCA (Principle Component Analysis) — A statistical process that converts original variables into new data that captures the information with the original data.

Dimensionality Reduction — Process to reduce the number of random variables and improve machine learning outcomes, typically through feature selection or PCA.

Version Control — The process of tracking changes made to data or data analysis rules. Used to see changes in software, etc.

GIT — A free, open source version control system.

AUC (Area Under Curve) — The area found under the curving line of a line graph that describes data classification. The higher the value of the AUC, the better your classification model.

Logloss — The logarithm of the product of all probabilities. The lower this value, the better your model.

Multi-table AFE — The automatic combination of multiple tables while creating new features.

Bonus Round

Word2Vec — This technique "vectors" words by assigning them a numerical value. Vectors close in value are more strongly associated, allowing improved word recognition. For example, king is closest to man, so king - man + woman = queen.

GAN (Generative Adversarial Network) — GANs use two neural networks to improve differentiation between fake and real data. The generator creates data an and the discriminator evaluates it. Over time, the generator learns to create more authentic-looking data as the discriminator learns to distinguish more granular differences. This is the basis for deep fakes.

Singularity — A hypothetical super-intelligence that surpasses human cognitive and analytical ability. Able to fully analyze all data everywhere. Will absolutely take your job, and is skynet.

Armed with our definitive list of data science terms, you're now better prepared to understand analyst proposals, evaluate their potential benefit — and recognize where current analytics tools may be lacking.

Keyence is here to help: Get better data analysis and drive improved decision-making with Ki, an AI powered automated analytics software. Let's talk!

LATEST POST : Data Management Best Practices: Six Ways to Improve Analytics Impact