Data Preparation

(Lab 1) Restaurant Recommendations Data Prep

The data that is collected by survey responses of Data Analysis & Design students will be downloaded and prepared. The data-set is prepared by:

  1. Rename the columns.
  2. Break apart the lists in the Food Types Field.
    • We need to create a new calculated field to break up the Food Types.
    • Click the down triangle again and ‘Create Calculated Field’
    • Click the arrow to the left if you can’t see the options.
    • We’ll make a new column of just the dominant food type. Use this formula: SPLIT([Type], ‘,’, 1)
    • Now rename this column
  3. Get first names only and standardize the orthography We’ve made the assumption here is that there cannot be spaces in a first name. This is a bad practice and can be avoided by asking better questions i.e., for First Name and Last Name.
    • We need to create a new calculated field to break apart the names. We’ll make this field all sentence case.
    • Click the down triangle again and ‘Create Calculated Field’
    • Click the arrow to the left if you can’t see the options.
    • We’ll make a new column of just the first name (we assume). Use this formula: SPLIT([Name], ‘ ‘, 1)
    • Now rename this column. **If you need to rename multiple columns, follow this tutorial.
  4. Transform those addresses into latitude and longitude.
    • On the second page of the workbook is a key [LatLon]. Download this as a csv file.
    • Select ‘Add’ next to ‘Connections’ and select ‘Text File’ (csv’s are read as text files – NOT Excel files – because of the way the files are formatted)
    • You will be prompted to do a join. Select a Right Join. Even if a borough is not represented (i.e., sometimes Staten Island gets forgotten even though we don’t want to forget it in our visualization)
    • Close the dialog box.
      • We actually can do this in Tableau, but it is typically not great. There are a variety of geocoders that work much better census gps texas A&M. Unfortunately, the way we asked this question will not allow us to use an online geocoder. Again, it goes back to asking questions in a way that gets us the answers we want. One thing we can do is visualize how many of these restaurants are in each borough.
  5. Typos

By default, Google Sheets connects in ‘Live’ mode, meaning that the data is constantly being updated. If you found a typo (maybe someone misspelled a word, then you have to fix it in the original, not here). Since we are all using the same dataset, there may not be any typos left for you to fix. Some ways to solve this include what we did for the Borough Names. We were interested in data about NYC, so we offered our respondents a shortlist of acceptable responses. You can use checkboxes and similar formats to constrain the answers you get.
You can also extract the data, which gives you a snapshot copy of the data in your workspace that will not be updated when new entries are found. It’s really a matter of what your goals are for your project.

The tutorial was written by Michelle McSweeney, Ph.D. for Introduction to Data Visualization, a course in the M.A. in Digital Humanities at the Graduate Center at CUNY.