DATA MANIPULATION OVERVIEW

Angela Yu

Oct 12, 20192 min read

PART 1: Data Cleaning

In this section, we are going to discuss Data Cleaning using either sample Twitter data call "data.csv" or personal information data "person_info.csv".

Download the sample twitter data here:

In this data, there are 5 rows (observations) and 9 columns (variables), namely:

id: int, user id number (we changed this to 1-5 to protect privacy)
from_user: str, user twitter name (we randomly added "oscr" to the real user names to protect privacy)
text: str, content of tweet (when begin with RT @..., means this tweet is a retweet)
created_at: str/datetime, time the original post is created (not RT - retweet)
time: str/datetime, the tweet is created (including RT and original post)
user_lang: str, abbreviation for user's language (There are two fields here - "en" for English and "es" for Spanish)
user_followers_count: int, number of tweeter accounts that follow this accountuser_freinds_count: int, number of tweeter accounts that followed by this account (manipulated)
user_location: str, geographic location (manipulated)

Download the personal information data here:

Column 1-10 are self-explanatory
car_1: the license plate
gpa: the students' college ID
year: the students' year in college
class_of: the students' class in college (corresponding to year)
online_signiture: simulated students' signature

As you can see above, some missing values or irregular values are introduced. They will be talked about in later sessions.

For data cleaning, we want to accomplish the following tasks:

How to Check & Clean Missing Values / NA? - Using Python | R | Excel

How to Check & Convert Data Types? - Using Python | R | Tableau | Excel

How to Check & Clean Outliers? - Using Python | R

Please click on the links above to see the solution to each question in different programming languages or software.

PART 2: Data Wrangling

We use the "person_info" data for this section, as discussed above. For the purposes of dealing with data types common in digital humanities and social sciences researches, we focus on:

Date time data manipulation

String manipulation

Checking duplicates (Excel only)

Please check the pages for each of the languages/software for more information.

DATA MANIPULATION OVERVIEW

Download the sample twitter data here:

Download the personal information data here:

For data cleaning, we want to accomplish the following tasks:

We use the "person_info" data for this section, as discussed above. For the purposes of dealing with data types common in digital humanities and social sciences researches, we focus on:

Comments