https://www.youtube.com/watch?v=Z9ekw2Ou3s0
df.describe? ## In jupyter pops up a help
df.corr?? # in jupyter brins up source code
df.memory_usage(deep=True).sum()
New syntax idea for using Pandas (for chaining):
(df
.select_dtypes('float64')
.describe()
)
impression and engagement are columns with Floats. We can convert them with:
(df
.assign(impressions=df.impressions.astype(int),
engagements=df.engagements.astype(int)
)
)
To fix column names with spaces:
(df
.rename(columns=lambda col_name: col_name.replace(' ', '_'))
)
to filter columns with string ‘promoted’
df.filter(regex=r'promoted')
also possible to define custom functions and use them via the pipe():
def drop_col(df_, pattern):
return df_.drop(columns=[c for c in df_.columns if pattern in c])
(df
.rename(columns=lambda col_name: col_name.replace(' ', '_'))
.pipe(drop_col, pattern='promoted')
.drop(columns=['Tweet_id', 'permalink_clicks', 'app_opens', 'app_installs', 'email_tweet', 'dial_phone'])
)
same could be done with #.pipe(lambda df_: df_.drop(columns=[c for c in df_.columns if ‘promoted’ in c]))
using double star notation and lambdas for some deeper magic:
(df
.rename(columns=lambda col_name: col_name.replace(' ', '_'))
.pipe(lambda df_: df_.drop(columns=[c for c in df_.columns if 'promoted' in c]))
.drop(columns=['Tweet_id', 'permalink_clicks', 'app_opens', 'app_installs', 'email_tweet', 'dial_phone'])
.assign(impressions=df.impressions.astype('uint32'),
engagements=df.engagements.astype('uint16'),
**{c:lambda df_, c=c:df_[c].astype('uint8') for c in ['replies', 'hashtag_clicks', 'follows']}, # less than 255
**{c:lambda df_, c=c:df_[c].astype('uint16') for c in ['retweets', 'likes', 'user_profile_clicks', 'url_clicks',
'detail_expands', 'media_views', 'media_engagements']} # less than 65,535
)
.describe()
)
to categorize a text column:
Tweet_permalink=lambda df_: pd.Series('https://twitter.com/__mharrison__/status/', dtype='category',
index=df_.index),
if we have time column data like this:
0 2020-01-02 03:44:00+00:00 1 2020-01-02 03:52:00+00:00 2 2020-01-02 05:56:00+00:00 3 2020-01-03 01:41:00+00:00 4 2020-01-03 02:16:00+00:00
we can do this:
.assign(
time=lambda df_: df_.time.dt.tz_convert('America/Denver')
)