Matt Harrison’s course on youtube

https://www.youtube.com/watch?v=Z9ekw2Ou3s0


df.describe?  ## In jupyter pops up a help

df.corr?? # in jupyter brins up source code

df.memory_usage(deep=True).sum()

New syntax idea for using Pandas (for chaining):

(df
 .select_dtypes('float64')
 .describe()
)

impression and engagement are columns with Floats. We can convert them with:

(df
    .assign(impressions=df.impressions.astype(int),
            engagements=df.engagements.astype(int)    
    )
)

To fix column names with spaces:

(df
    .rename(columns=lambda col_name: col_name.replace(' ', '_'))
)

to filter columns with string ‘promoted’

df.filter(regex=r'promoted')

also possible to define custom functions and use them via the pipe():

def drop_col(df_, pattern):
    return df_.drop(columns=[c for c in df_.columns if pattern in c])

(df
    .rename(columns=lambda col_name: col_name.replace(' ', '_'))
    .pipe(drop_col, pattern='promoted')
    .drop(columns=['Tweet_id', 'permalink_clicks', 'app_opens', 'app_installs', 'email_tweet', 'dial_phone'])
)

same could be done with #.pipe(lambda df_: df_.drop(columns=[c for c in df_.columns if ‘promoted’ in c]))

Fancy

using double star notation and lambdas for some deeper magic:

(df
 .rename(columns=lambda col_name: col_name.replace(' ', '_'))
 .pipe(lambda df_: df_.drop(columns=[c for c in df_.columns if 'promoted' in c]))
 .drop(columns=['Tweet_id', 'permalink_clicks', 'app_opens', 'app_installs', 'email_tweet', 'dial_phone'])
 .assign(impressions=df.impressions.astype('uint32'),
         engagements=df.engagements.astype('uint16'),
         **{c:lambda df_, c=c:df_[c].astype('uint8') for c in ['replies', 'hashtag_clicks', 'follows']},  # less than 255
         **{c:lambda df_, c=c:df_[c].astype('uint16') for c in ['retweets', 'likes', 'user_profile_clicks', 'url_clicks', 
                                          'detail_expands', 'media_views', 'media_engagements']}  # less than 65,535
        )
 .describe()
)

to categorize a text column:

         Tweet_permalink=lambda df_: pd.Series('https://twitter.com/__mharrison__/status/', dtype='category', 
                                               index=df_.index),

if we have time column data like this:

0 2020-01-02 03:44:00+00:00 1 2020-01-02 03:52:00+00:00 2 2020-01-02 05:56:00+00:00 3 2020-01-03 01:41:00+00:00 4 2020-01-03 02:16:00+00:00

we can do this:

.assign(
       time=lambda df_: df_.time.dt.tz_convert('America/Denver')
)