Pandas is a powerful data manipulation and analysis library for Python. It is widely used for data wrangling, cleaning, and analysis due to its intuitive data structures and easy-to-use functions.
Before using Pandas, you need to install it. You can install Pandas using pip:
pip install pandas
To use Pandas, you need to import it into your Python script or Jupyter Notebook:
import pandas as pd
Pandas primarily uses two data structures:
A Series can be created from a list, NumPy array, or a dictionary.
import pandas as pd
# Creating a Series from a list
data = [1, 2, 3, 4, 5]
series = pd.Series(data)
print(series)
# Creating a Series from a dictionary
data = {'a': 1, 'b': 2, 'c': 3}
series = pd.Series(data)
print(series)
A DataFrame can be created from a dictionary, list of dictionaries, or a NumPy array.
# Creating a DataFrame from a dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'San Francisco', 'Los Angeles']
}
df = pd.DataFrame(data)
print(df)
# Creating a DataFrame from a list of dictionaries
data = [
{'Name': 'Alice', 'Age': 25, 'City': 'New York'},
{'Name': 'Bob', 'Age': 30, 'City': 'San Francisco'},
{'Name': 'Charlie', 'Age': 35, 'City': 'Los Angeles'}
]
df = pd.DataFrame(data)
print(df)
print(df.head()) # Default is 5
print(df.tail(2)) # View last 2 rows
print(df.info())
print(df.describe())
print(df['Name']) # Select single column
print(df[['Name', 'City']]) # Select multiple columns
print(df.loc[0]) # Select first row by index
print(df.loc[0:1]) # Select first two rows by index
print(df.iloc[0]) # Select first row by position
print(df.iloc[0:2]) # Select first two rows by position
print(df[df['Age'] > 25]) # Select rows where Age > 25
df['Country'] = ['USA', 'USA', 'USA']
print(df)
df['Age'] = df['Age'] + 1
print(df)
df = df.drop('Country', axis=1)
print(df)
print(df.isnull())
print(df.isnull().sum())
df = df.dropna()
print(df)
df = df.fillna(0)
print(df)
Grouping data is useful for aggregating information based on certain criteria.
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Charlie'],
'Year': [2020, 2020, 2020, 2021, 2021, 2021],
'Sales': [250, 300, 400, 200, 350, 300]
}
df = pd.DataFrame(data)
grouped = df.groupby('Name').sum()
print(grouped)
Merging allows you to combine two DataFrames based on a common column or index.
data1 = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]
}
data2 = {
'Name': ['Alice', 'Bob', 'David'],
'Salary': [50000, 60000, 70000]
}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
merged = pd.merge(df1, df2, on='Name', how='inner')
print(merged)
Pandas can read data from various file formats including CSV, Excel, and SQL databases.
# Reading from a CSV file
df = pd.read_csv('data.csv')
# Reading from an Excel file
df = pd.read_excel('data.xlsx')
# Reading from a SQL database
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql_query('SELECT * FROM table_name', conn)
Pandas can write data to various file formats as well.
# Writing to a CSV file
df.to_csv('output.csv', index=False)
# Writing to an Excel file
df.to_excel('output.xlsx', index=False)
# Writing to a SQL database
df.to_sql('table_name', conn, if_exists='replace', index=False)
This guide provides an overview of using Pandas for data manipulation and analysis. By mastering these basic and advanced operations, you can efficiently handle and analyze large datasets in Python. Practice with different datasets and explore Pandas documentation for more functionalities and use cases. Happy coding!