Basics of Pandas

If you are here then I am assuming that you know what is pandas and the use cases where it can make your life easy. So I will directly introduce you to some basic functions so that you can start working on this.

1. Install pandas with the help of pip. – Run the below command in the terminal and it will install pandas for you in no time.

pip3 install pandas

2. Read the CSV file. Add the below code in your python file.

import pandas as pd 
data = pd.read_csv("/file_path/file_name.csv")

3. Print the top 5 rows of CSV file those you recently loaded into a pandas data frame.

print(data.head())

4. Print the name of columns in your CSV file.

print(data.columns.values)

5. Rename the column of pandas data frame.

data.rename(columns={'SRVC_Account__c':'Account ID'}, inplace=True)

In the above line of code, I changed the name of the column from SRVC_Account__c to ‘Account ID’.

6. Merge two CSV files on the basis of one column. for eg. I have two CSV file and both the files have one common column name account number so I want to join both the CSV files data into one data frame. you can understand it as a join in SQL.

merged_dataset = pd.merge(data_1, data_2, on=['Account ID'], how='left')

data_1 is the first data frame and data_2 is the second data frame. ‘Account ID’ is the key on the basis of we want to join the column. Type of join needs to be performed – ‘left’, ‘right’, ‘outer’, ‘inner’, Default is inner join. If we won’t define the ‘how’ parameter then it will perform the inner join.

7. Drop a column in the data frame.

new_data = data.drop(['Case ID', 'Churn Date'], axis=1)

Here we are trying to delete the two columns from the data frame.

8. Change the values of the pandas data frame column on the basis of other columns

df = pd.read_csv("test.csv")  df['output'] = np.where((df['id_1']==1) | (df['id_2']==1) | (df['id_3']==1), 1, 0)

Here we are updating the value of output column on the basis of id_1,id_2 and id_3 columns.

9. Count of values in a column of the pandas data frame.

dataset['IsWon'].value_counts()

Above line of code prints all the distinct values present in the column IsWon and their frequency in that column.
False 252
True 68
Name: IsWon, dtype: int64

10. Merge two columns in pandas and store them in another column.

df["unique_id"] = df["Opportunity_Id"].map(str) +"-"+df["End_Customer_Key"]

11. Get Unique values in the column.

dataset['education'].unique()

12. List all the columns in the dataset have null value.

titanic_train_file.columns[titanic_train_file.isnull().any()].tolist()

13. Filter dataset from the basis of one column

titanic_train_file.loc[titanic_train_file['Survived'] == 1]

dataset1 = dataset[(dataset['IsClosed'] == True)]

14. Group by and aggregate together on the pandas dataframe.

titanic_train_file.loc[titanic_train_file['Survived'] == 1].groupby(['Sex', 'Survived']).agg({'Survived': 'count'})

15. Number of rows in the pandas dataframe:

row_count = titanic_train_file.shape[0]

16. Convert True/False value column into 0/1.

dataset1['HasOpenActivity']=np.where(dataset1['HasOpenActivity'] == True, 1, dataset1['HasOpenActivity']) dataset1['HasOpenActivity']=np.where(dataset1['HasOpenActivity'] != 1, 0, dataset1['HasOpenActivity'])

17. Log transformation on the column of pandas dataframe:

dataset1["Amount"] = dataset1["Amount"].apply(np.log)

18. Difference between two dates in terms of days.

# first convert datetime into date and then subtract them and get no of days. 
dataset1['CloseDate'] = pd.to_datetime(dataset1['CloseDate']).dt.date dataset1['CreatedDate'] = pd.to_datetime(dataset1['CreatedDate']).dt.date dataset1['no_of_days'] = (dataset1['CloseDate'] - dataset1['CreatedDate']).dt.days

19. Get quarter number from the date in pandas dataframe.

dataset1['CreatedDate'] = pd.to_datetime(dataset1['CreatedDate']) dataset1['created_quartor'] = dataset1['CreatedDate'].dt.quarter

20. Set the limit of display no of rows and column in pandas. By default pandas doesn’t show all the rows and columns if they are large in numbers.

pd.set_option('display.max_rows', 200) 
pd.set_option('display.max_columns', 200)

21. Replace nan with space in pandas

dataset2 = dataset2.replace(np.nan, '', regex=True)
df['AC_NBR'] = MBSR01_df['AC_NBR'].str.replace('-','')

22. Replace infinity values with 0 in pandas dataframe.

data["followrVsFriend"].replace([np.inf, -np.inf], 0,regex=True)

23. Append two pandas dataframes.

data = positive.append(negative)

24. Select n records from the pandas dataframe.

negative = negative[2:3105]

25. Filter data from the pandas dataframe.

positive = data[data['y'].isin([1])]
# you will get only those rows which have column y value is 1.

26. Get the random n records from a pandas dataframe.

data = data.sample(n=3)
# you will get 3 records from the dataframe.

27. Shuffle the records in pandas dataframe

data = data.sample(frac=1)

28. Add new column in pandas dataframe:

dataset_part1['updated_stage_name'] = 'null'

29. Sort pandas dataframe:

# sort dataframe based on two columns opportunity ID and CreatedDate 
df = df.sort_values(by=['OpportunityId','CreatedDate'])

30. Iterate dataframe in pandas:

for index, row in df.iterrows():
    print(row['c1'], row['c2'])

31. Show more columns and rows in pandas displayed output:

pd.set_option('display.max_rows', 500) 
pd.set_option('display.max_columns', 500) 
pd.set_option('display.width', 1000)

32. How to get a value from a Pandas DataFrame and not the index and object type .

df[df.Letters=='C'].Letters.item()

use item() in the end to get only value, not index and object type.

33. Count number of null or NaN values in a column of pandas.

dataset['BillableFees'].isna().sum()

34. Check whether two columns are the same or not in pandas.

dataset1['col1'].equals(dataset1['col2'])

It will return true or false results based on the output.

35. Get the maximum length of the string in the pandas column python.

import numpy as np

measurer = np.vectorize(len)

res1 = measurer(MBSR01_df.part_identifier.values.astype(str)).max(axis=0)

res1

36. Add a column to store row number.

AW01_df['ID'] = np.arange(len(AW01_df))

36. Basic Preprocessing on the column names like change to lower case, remove special character, convert space into underscore.

# Clean Column Names, convert column names in small letter, replace space with underscore
df.columns = [c.lower() for c in df.columns.values.tolist()]
df.columns = [c.strip() for c in df.columns.values.tolist()]
df.columns = [c.replace(" ", "") for c in df.columns.values.tolist()] 
df.columns = [c.replace("/", "") for c in df.columns.values.tolist()] 
df.columns = [c.replace("(", "") for c in df.columns.values.tolist()] 
df.columns = [c.replace(")", "") for c in df.columns.values.tolist()] 
df.columns = [c.replace(".", "") for c in df.columns.values.tolist()] 
df.columns = [c.replace("…", "") for c in df.columns.values.tolist()] 
df.columns = [c.replace("'", "") for c in df.columns.values.tolist()] 
df.columns = [c.replace("#", "number") for c in df.columns.values.tolist()] 
df.columns = [c.replace("  ", "") for c in df.columns.values.tolist()]

37. Store Pandas dataframe into csv file.

# index False, won't write the index column
df.to_csv('../data/csv/df_updated.csv',index=False)

38. Convert Float column into int type and fill Null value with -1.

df['column_name'] = df['column_name'].fillna(-1).astype(int)

39. Drop rows which have null values.

# Drop rows which have null value
dataset = dataset.dropna()
# Drop rows having Null value for OpportunityKey column
oppty_hash_df_wc = oppty_hash_df_wc.dropna(subset=['OpportunityKey'])

40. Add Interaction Term in the model.

import stats model
import statsmodels.api as sm
from patsy import dmatrices

Y, X = dmatrices('is_won ~ x1+x2*(x3)+x4+x5', 
                  data=dataset, return_type='dataframe')
# Output column in X will be x1,x2,x3,x2*x3,x4,x5

40. Strip spaces in the column string.

df['column1'] = df['column1'].str.strip()

41. Check data frame column has specific substring.

# check all the rows which has substring '.0' in column1.
df[df['column1'].str.contains('\.0')]

42. Add zero in the start of the string to make it fix length.

# this code will add extra zero in the start of the 
df['column1'] = df['column1'].str.zfill(15)

43. Concatenate two-columns and create a new column.

# Create Part Number and Serial Number Combinations
MBSR01_df["key"] = MBSR01_df["Rcvd_PN"] + "_" + MBSR01_df["RCVD_SN"]

45. Split a string column using some character and take first part only.

MBSR01_df["part_group_clean"] = MBSR01_df['Part_Group'].str.split('-').str[0]

46. Copy one dataframe into another dataframe.

boeing_allrem = boeing.copy()

47. If same record multiple time then keep the latest record and drop duplicate.

boeing = boeing.sort_values(['Rmvl_Record_ID', 'file_date'], ascending=[False, False])
boeing.drop_duplicates(subset ="Rmvl_Record_ID", keep = "first", inplace = True)

48. Get substring based on position of character.

oppty_hash_df_wc['creation_month'] = oppty_hash_df_wc['DateCreatedKey'].str[4:6]

49. Convert string field into the numeric field.

df = df[pd.to_numeric(df.OpportunityKey, errors='coerce').notnull()]
df.OpportunityKey = df.OpportunityKey.astype(int)

50. Get the length of string column pandas.

phone_df['phone_date_length'] = phone_df.CreatedOnDateKey.str.len()

51. Join two dataframe based on opportunity ID and count the number of phone calls based on date condition, count only phone calls made between opportunity creation and opportunity application date.

""" 
     Count the number of Phone calls made for each opportunity between Lead creation date and Applied Date, 
     If opportunity never reached to applied stage then count all the Phone Calls for that.
     1. Find all the phone calls for the opportunites present and then filter the phone calls those were made after applied date
     2. Count the record of phone calls belong to each opportunity.
 """
df = pd.merge(df, phone_df_wc, on='OpportunityKey', how='inner')
df['CreatedOnDateKey'] = df['CreatedOnDateKey'].apply(str)
df['CreatedOnDateKey_fomatted'] = pd.to_datetime(df['CreatedOnDateKey'])
df['days_diff'] = df['DateAppliedKey_formatted'].sub(df['CreatedOnDateKey_fomatted'], axis = 0)
df['days_diff_int'] = df['days_diff'] / np.timedelta64(1, 'D')
df = df[(oppty_hash_df_wc_temp2_phone['days_diff_int'] > 0)]
df_grouped = df.groupby('OpportunityKey')
df_grouped_count = df_grouped['OpportunityKey'].agg(['count'])
df_grouped_count = df_grouped_count.reset_index(level=['OpportunityKey'])

Curious Mind

Leave a Reply Cancel reply