Introduction to Libraries: Pandas

Learn basics about Pandas for data analysis.


Introduction to Libraries: Pandas

Pandas is a powerful and versatile Python library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. It is essential for anyone working with data in Python.

What is Pandas?

Pandas allows you to load, clean, transform, analyze, and manipulate data efficiently. It excels at handling structured (tabular) data, making it perfect for tasks like:

  • Data cleaning and preprocessing
  • Exploratory data analysis (EDA)
  • Data wrangling
  • Statistical analysis
  • Time series analysis

Basics of Pandas

This section covers the fundamental concepts and functionalities of Pandas. We'll focus on the two core data structures: Series and DataFrames.

1. Series

A Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). Think of it as a single column in a spreadsheet.

Creating a Series:

 import pandas as pd

# Creating a Series from a list
data = [10, 20, 30, 40, 50]
s = pd.Series(data)
print(s) 

This will output:

 0    10
1    20
2    30
3    40
4    50
dtype: int64 

Notice the index (0, 1, 2, 3, 4) on the left. You can customize the index:

 import pandas as pd

# Creating a Series with a custom index
data = [10, 20, 30, 40, 50]
index = ['a', 'b', 'c', 'd', 'e']
s = pd.Series(data, index=index)
print(s) 

Output:

 a    10
b    20
c    30
d    40
e    50
dtype: int64 

Accessing Data in a Series:

 import pandas as pd

data = [10, 20, 30, 40, 50]
index = ['a', 'b', 'c', 'd', 'e']
s = pd.Series(data, index=index)

# Accessing data by label (index)
print(s['b'])  # Output: 20

# Accessing data by position (integer index)
print(s[1])   # Output: 20 

2. DataFrame

A DataFrame is a two-dimensional labeled data structure with columns of potentially different types. It is like a spreadsheet or SQL table, or a dict of Series objects.

Creating a DataFrame:

 import pandas as pd

# Creating a DataFrame from a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 28],
    'City': ['New York', 'London', 'Paris']
}

df = pd.DataFrame(data)
print(df) 

This will output:

 Name  Age      City
0    Alice   25  New York
1      Bob   30    London
2  Charlie   28     Paris 

Accessing Data in a DataFrame:

 import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 28],
    'City': ['New York', 'London', 'Paris']
}

df = pd.DataFrame(data)

# Accessing a column
print(df['Name'])

# Accessing a row using .loc (label-based)
print(df.loc[0])

# Accessing a specific cell using .loc
print(df.loc[0, 'Name']) #Output: Alice

# Accessing a row using .iloc (integer-based)
print(df.iloc[0])

# Accessing a specific cell using .iloc
print(df.iloc[0, 0]) #Output: Alice 

Key DataFrame Operations:

  • df.head(): Displays the first few rows of the DataFrame.
  • df.tail(): Displays the last few rows of the DataFrame.
  • df.info(): Provides information about the DataFrame, including data types and missing values.
  • df.describe(): Provides descriptive statistics (mean, median, standard deviation, etc.) for numerical columns.
  • df.shape: Returns the dimensions (rows, columns) of the DataFrame.

This is just a basic introduction to Pandas. There's much more to explore, including data cleaning, manipulation, merging, grouping, and visualization. Continue practicing and exploring the Pandas documentation to deepen your understanding.