top of page

DATA MANIPULATION OVERVIEW

Writer's picture: Angela YuAngela Yu


PART 1: Data Cleaning

In this section, we are going to discuss Data Cleaning using either sample Twitter data call "data.csv" or personal information data "person_info.csv".


Download the sample twitter data here:

In this data, there are 5 rows (observations) and 9 columns (variables), namely:

  • id: int, user id number (we changed this to 1-5 to protect privacy)

  • from_user: str, user twitter name (we randomly added "oscr" to the real user names to protect privacy)

  • text: str, content of tweet (when begin with RT @..., means this tweet is a retweet)

  • created_at: str/datetime, time the original post is created (not RT - retweet)

  • time: str/datetime, the tweet is created (including RT and original post)

  • user_lang: str, abbreviation for user's language (There are two fields here - "en" for English and "es" for Spanish)

  • user_followers_count: int, number of tweeter accounts that follow this accountuser_freinds_count: int, number of tweeter accounts that followed by this account (manipulated)

  • user_location: str, geographic location (manipulated)


Download the personal information data here:



  • Column 1-10 are self-explanatory

  • car_1: the license plate

  • gpa: the students' college ID

  • year: the students' year in college

  • class_of: the students' class in college (corresponding to year)

  • online_signiture: simulated students' signature

As you can see above, some missing values or irregular values are introduced. They will be talked about in later sessions.


For data cleaning, we want to accomplish the following tasks:


How to Check & Clean Missing Values / NA? - Using Python | R | Excel

How to Check & Convert Data Types? - Using Python | R | Tableau | Excel

How to Check & Clean Outliers? - Using Python | R

Please click on the links above to see the solution to each question in different programming languages or software.


PART 2: Data Wrangling

We use the "person_info" data for this section, as discussed above. For the purposes of dealing with data types common in digital humanities and social sciences researches, we focus on:


Date time data manipulation

String manipulation

Checking duplicates (Excel only)

Please check the pages for each of the languages/software for more information.


17 views0 comments

Comments


bottom of page