PART 1: Data Cleaning
In this section, we are going to discuss Data Cleaning using either sample Twitter data call "data.csv" or personal information data "person_info.csv".
Download the sample twitter data here:
In this data, there are 5 rows (observations) and 9 columns (variables), namely:
id: int, user id number (we changed this to 1-5 to protect privacy)
from_user: str, user twitter name (we randomly added "oscr" to the real user names to protect privacy)
text: str, content of tweet (when begin with RT @..., means this tweet is a retweet)
created_at: str/datetime, time the original post is created (not RT - retweet)
time: str/datetime, the tweet is created (including RT and original post)
user_lang: str, abbreviation for user's language (There are two fields here - "en" for English and "es" for Spanish)
user_followers_count: int, number of tweeter accounts that follow this accountuser_freinds_count: int, number of tweeter accounts that followed by this account (manipulated)
user_location: str, geographic location (manipulated)
Download the personal information data here:
Column 1-10 are self-explanatory
car_1: the license plate
gpa: the students' college ID
year: the students' year in college
class_of: the students' class in college (corresponding to year)
online_signiture: simulated students' signature
As you can see above, some missing values or irregular values are introduced. They will be talked about in later sessions.
For data cleaning, we want to accomplish the following tasks:
Please click on the links above to see the solution to each question in different programming languages or software.
PART 2: Data Wrangling
We use the "person_info" data for this section, as discussed above. For the purposes of dealing with data types common in digital humanities and social sciences researches, we focus on:
Date time data manipulation
String manipulation
Checking duplicates (Excel only)
Please check the pages for each of the languages/software for more information.
Comments