pandas_basics

# Pandas Basics Here are some basic concepts and tips for using Pandas given in [7] to [10]. ```py # Load a dataset (CSV file) titanic_df = pd.read_csv('titanic.csv') df = pd.DataFrame(load_iris().data, columns=load_iris().feature_names) # Export the DataFrame to a CSV file df.to_csv('output_dataset.csv', index=False) # Display basic information df.info() # Display the first few rows titanic_df.head() # Generate descriptive statistics df.describe() titanic_df.describe(include = 'all') ``` The `describe()` method provides an overview of key statistics such as mean, standard deviation, and quartiles for numerical columns. Adding “include = all” shows the summary for qualitative (string/object variables). ```py # find missing values titanic_df.isnull().sum() # Fill missing values with a specific value titanic_df['Age'] = titanic_df['Age'].fillna(titanic_df['Age'].mean()) # Filter data based on a condition titanic_df.loc[titanic_df['Age'] > 30] # Sort data by a specific column titanic_df_sorted = titanic_df.sort_values(by='Fare') ``` ```py # Convert a column to datetime format df['Date'] = pd.to_datetime(df['Date']) # Extract month from the 'Date' column pdf['Month'] = df['Date'].dt.month # Remove duplicate rows based on selected columns df_no_duplicates = titanic_df.drop_duplicates(subset=['PassengerId']) # Rename columns for clarity titanic_df.rename(columns={'SibSp': 'sibbling_spouse'}, inplace=True) # Convert the 'Price' column to numeric df['Price'] = pd.to_numeric(df['Price'], errors='coerce') # Apply a custom function to a column df['Discounted_Price'] = df['Price'].apply(lambda x: x * 0.9) # Convert a column to categorical type df['Category'] = df['Category'].astype('category') ``` ```py # Group data by a categorical variable and calculate the mean titanic_df.groupby('Sex')['Survived'].mean() # Create new column based on existing columns titanic_df['total_relative'] = titanic_df['SibSp'] + titanic_df['Parch'] # Merge two DataFrames based on a common column merged_df = pd.merge(df1, df2, on='ID') # Pivot the data to reshape it titanic_df_pivot = titanic_df.pivot_table(index='Survived', columns='Sex', values='Age', aggfunc='mean') ``` ## Convert to Best Data Types Automatically When we load data as Pandas dataframe, Pandas automatically assigns a datatype to the variables/columns in the dataframe which usually means the datatypes would be `int`, `float` and `object` datatypes [1]. But we can make Pandas infer the best datatypes for the variables in a dataframe [1]. We will use the Pandas `convert_dtypes()` function and convert the to best data types automatically. Another big advantage of using convert_dtypes() is that it supports Pandas new type for missing values `pd.NA`. ```py import pandas as pd # check version print(pd.__version__) data_url = "https://raw.githubusercontent.com/cmdlinetips/data/master/gapminder-FiveYearData.csv" df = pd.read_csv(data_url) print(df.info()) print(df.dtypes) df = df.convert_dtypes() print(df.dtypes) ``` By default, `convert_dtypes` will attempt to convert a Series (or each Series in a DataFrame) to dtypes that support `pd.NA`. By using the options convert_string, convert_integer, and convert_boolean, it is possible to turn off individual conversions to StringDtype, the integer extension types, or BooleanDtype, respectively. ## Dates [11 Essential Tricks To Demystify Dates in Pandas](https://towardsdatascience.com/11-essential-tricks-to-demystify-dates-in-pandas-8644ec591cf1) [Dealing With Dates in Pandas](https://towardsdatascience.com/dealing-with-dates-in-pandas-6-common-operations-you-should-know-1ea6057c6f4f) ## Iteration [How To Loop Through Pandas Rows](https://cmdlinetips.com/2018/12/how-to-loop-through-pandas-rows-or-how-to-iterate-over-pandas-rows/amp/) ## String [String Operations on Pandas DataFrame](https://blog.devgenius.io/string-operations-on-pandas-dataframe-88af220439d1) ## Indexes [How To Convert a Column to Row Name/Index in Pandas](https://cmdlinetips.com/2018/09/how-to-convert-a-column-to-row-name-index-in-pandas/amp/) [8 Quick Tips on Manipulating Index with Pandas](https://towardsdatascience.com/8-quick-tips-on-manipulating-index-with-pandas-c10ef9d1b44f) ## Functions [apply() vs map() vs applymap() in Pandas](https://towardsdatascience.com/apply-vs-map-vs-applymap-pandas-529acdf6d744) [How to Combine Data in Pandas](https://towardsdatascience.com/how-to-combine-data-in-pandas-5-functions-you-should-know-651ac71a94d6) ## Aggregate [6 Lesser-Known Pandas Aggregate Functions](https://towardsdatascience.com/6-lesser-known-pandas-aggregate-functions-c9831b366f21) [Pandas Groupby and Sum](https://cmdlinetips.com/2020/07/pandas-groupby-and-sum/amp/) ## Pivot [5 Minute Guide to Pandas Pivot Tables](https://towardsdatascience.com/5-minute-guide-to-pandas-pivot-tables-df2d02786886) ## References [1]: [20 Pandas Codes To Elevate Your Data Analysis Skills][https://medium.com/codex/20-pandas-codes-to-elevate-your-data-analysis-skills-b62671682190] [2]: [Practical Pandas Tricks - Part 1: Import and Create DataFrame](https://towardsdatascience.com/introduction-to-pandas-part-1-import-and-create-dataframe-e53326b6e2b1) [3]: [4 Must-Know Parameters in Python Pandas](https://towardsdatascience.com/4-must-know-parameters-in-python-pandas-6a4e36f6ddaf) [4]: [How To Change Column Type in Pandas DataFrames](https://towardsdatascience.com/how-to-change-column-type-in-pandas-dataframes-d2a5548888f8)