Welcome to an interesting library in Python! I'm Brahma π, a passionate software developer. I am documenting my learning journey through a series of blog posts. Stay tuned!!
Introduction
Pandas, unlike the ones in the cover picture π, can be complicated at times to understand and implement. So, I am going to share Top 20 must know codes to call yourself a LinkedIn Pandas expert π. Jokes aside let's delve into this.
Creating a Pandas DataFrame
The first step before performing any operation on some data is importing it. Ofc!!π
The if-you-don't-have-data method:
Sounds funny π. Coz it is!!
data = pd.DataFrame(np.arange(10).reshape(5,2), index=['Row1','Row2','Row3','Row4','Row5'], columns=['Column1','Column2']) data.head()
So, let's assume Kaggle is an alien for you then you can use the above method to generate a demo dataset. Ofc won't advise that but it's okay if you don't have a dataset π« .
The all-customised method:
Using dictionaries and lists to create a dataframe.import pandas as pd # Creating a DataFrame from a dictionary data = { 'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'City': ['New York', 'Los Angeles', 'Chicago'] } df = pd.DataFrame(data) print(df) # Creating a DataFrame from a list of lists data = [ ['Alice', 25, 'New York'], ['Bob', 30, 'Los Angeles'], ['Charlie', 35, 'Chicago'] ] df = pd.DataFrame(data, columns=['Name', 'Age', 'City']) print(df)
No wonder why would you do that π« .
The famous method:
Importing data from a csv file.
# Reading the CSV file into a DataFrame df = pd.read_csv('data.csv') print(df) df = pd.read_csv('data.csv', index_col=0) # basically makes the 1st column as the index column print(df)
The jsonified method:
As the name suggests its importing from a json file.
import pandas as pd # Read JSON file into a DataFrame df = pd.read_json('data.json') # Display the DataFrame print(df)
Accessing the Data
The
loc
method:df.loc[row_label, column_label]
The
iloc
method (my fav π):df.iloc[row_index, column_index]
Inspecting the Data
head
&tail
:import pandas as pd # Create a sample DataFrame data = { 'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'], 'Age': [25, 30, 35, 40, 45], 'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Miami'] } df = pd.DataFrame(data) # View the first few rows print("First few rows:") print(df.head()) # View the last few rows print("\nLast few rows:") print(df.tail())
This is probably the 1st thing anyone does on receiving a dataset.
The Summary:
info
anddescribe
# Get information about the DataFrame print("DataFrame info:") print(df.info()) # Get summary statistics for numerical columns print("\nSummary statistics:") print(df.describe())
Miscelaneous:
import pandas as pd # Create a sample DataFrame data = { 'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'], 'Age': [25, 30, 35, 40, 45], 'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Miami'] } df = pd.DataFrame(data) # Get the values of the DataFrame as a NumPy array print("DataFrame values:") print(df.values) # Get the counts of unique values print("Value counts:") print(df['Name'].value_counts()) # Get the unique values in the 'City' column print("Unique cities:") print(df['City'].unique()) # Inspect data types of columns print("Data types of columns:") print(df.dtypes)
Cleaning the Data
The Null Detector:
import pandas as pd # Create a sample DataFrame with missing values data = { 'Name': ['Alice', 'Bob', 'Charlie', None, 'Emily'], 'Age': [25, None, 35, 40, 45], 'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', None] } df = pd.DataFrame(data) # Detect missing values print("Missing values:") print(df.isnull())
The Null Remover:
# Drop rows with any missing values print("Drop rows with any missing values:") print(df.dropna()) # Drop columns with any missing values print("Drop columns with any missing values:") print(df.dropna(axis=1))
The Null Filler:
# Fill missing values with a specified value print("Fill missing values with 0:") print(df.fillna(0)) # Forward fill missing values print("Forward fill missing values:") print(df.fillna(method='ffill'))
The Convertor:
Convert data types of columns.
# Convert the 'Age' column to integers print("Convert 'Age' column to integers:") df['Age'] = df['Age'].fillna(0) # Fill missing values first print(df['Age'].astype(int))
The Doglapan Detector:**
Doglapan aka**duplicates**
# Create a sample DataFrame with duplicates data = { 'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Emily'], 'Age': [25, 30, 35, 25, 45], 'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Miami'] } df = pd.DataFrame(data) # Detect duplicate rows print("Duplicate rows:") print(df.duplicated())
The Doglapan Remover:**
# Drop duplicate rows print("Drop duplicate rows:") print(df.drop_duplicates())
Manupulating the Data
The filter:
import pandas as pd # Sample DataFrame data = { 'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'], 'Age': [25, 30, 35, 40, 45], 'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Miami'] } df = pd.DataFrame(data) # Filter rows where Age is greater than 30 filtered_df = df[df['Age'] > 30] print("Filtered DataFrame:\n", filtered_df)
The sorted method:
# Sort by Age in ascending order sorted_df = df.sort_values(by='Age') print("DataFrame sorted by Age:\n", sorted_df) # Sort by index in descending order sorted_index_df = df.sort_index(ascending=False) print("\nDataFrame sorted by index:\n", sorted_index_df)
Adding & Removing Columns:
# Add a new column df['Salary'] = [70000, 80000, 90000, 100000, 110000] print("DataFrame with new column:\n", df) # Remove a column df = df.drop(columns=['City']) print("\nDataFrame after removing column:\n", df)
The Aggregator:
# Aggregation using sum sum_df = df.groupby('Name').sum() print("Sum Aggregation:\n", sum_df) # Aggregation using mean mean_df = df.groupby('Name').mean() print("\nMean Aggregation:\n", mean_df) # Aggregation using count count_df = df.groupby('Name').count() print("\nCount Aggregation:\n", count_df)
Transforming the Data
apply
andmap
:import pandas as pd # Sample DataFrame df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [5, 6, 7, 8]}) # Applying a function to each element of a column df['A_squared'] = df['A'].apply(lambda x: x ** 2) # Mapping values of a column to new values df['B_mapped'] = df['B'].map({5: 'Five', 6: 'Six', 7: 'Seven', 8: 'Eight'}) print(df)
The Vectors:
# Vectorized addition of two columns df['A_plus_B'] = df['A'] + df['B'] print(df)
Conclusion
So, that's all folks. These are the 20 important Pandas (sounds weird π).
Keep coding, keep learning, and enjoy the endless possibilities that Python has to offer!
That's all folks. Leave a like and some lovely critics in the commentsπ.
Signing off!!!π