HANS-ON WRAPUP

The following statements help you make sure that you identify the take away knowledge from the hands on exercises done in the Lab sessions.

  • The pivot data types used by the Pandas library for manipulating data are: DataFrame and Series
  • DataFrame tabular representation in terms of
    • columns that can be identified by a name and of different type
    • rows actual records/tuples containing values for every column of the table.
    • It 2-dimensional labeled data structure. You can think of it like a spreadsheet.
    • Be sure that you know how to define a DataFrame by hand and by getting data from a CSV file
  • Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).
    • The axis labels are collectively referred to as the index
    • In [3]: s =pd.Series(np.random.randn(5), index=['a', 'b', 'c','d', 'e'])
      In [4]: s
      Out[4]: 
      a    0.4691
      b   -0.2829
      c   -1.5091
      d   -1.1356
      e    1.2121
      dtype: float64
      

Exploring the content of DataFrames: three types of operations can be applied on DataFrames.  Given a DataFrame d

  • Selection: an operator of this type returns a  DataFrame containing the subset of the rows contained in d according to a specific criterion (e.g., d.tail(), d.head(), d[0:10]).
  • Projection: an operator of this type returns a Series containing all the rows of the DataFrame d but only containing the values of the columns specified by the operator (d[‘column’]).
  • Filter: an operator of this type filters rows from a DataFrame according to some criterion and it results in a Series for example select row by its integer location d.iloc[loc]

Manipulating DataFrames: two types of operations can be applied to manipulate DataFrames. Given a DataFrame d

  • Add or drop rows from a DataFrame d.
  • Modify the value of one or several columns of one or several rows of a DataFrame d specifying the new value with a constant, applying an operation (e.g., Mathematical operation).
  • Add or drop a column to/from a DataFrame d. Note that this modifies the structure of the DataFrame d.

Once you understand the usefulness of both data structures for processing data collections and that you know their properties  you are ready to start getting acquainted to data collections content through statistics and unsupervised learning.

  • Given a CSV file what are the steps to be performed for getting the data ready to be quantitatively analysed?
  • Which are the statistic tools used for understanding the content of a data collection? What are we looking for when applying these tools?
  • What type of questions can be answered by a data collection by applying descriptive statistics? Which are the statistics measures used for answering such types of questions? How are these questions actually expressed in Python using the adapted libraries?
  • Which are the concepts in statistics that can participate in a data cleaning process? Recall that data cleaning is related with missing, null, default values, too different values from those in a given set of values… Be sure that you identify Python code that implements this cleaning process.
  • How is the observation of data distribution important for understanding a data collection content and what can of analysis can be performed by observing data values distribution. Given a DataFrame, which are the steps to apply the notion of distribution for understanding its content? Be sure that you identify Python code that implements this process.
  • In which cases does a Data Scientist need to normalise a histogram? Be sure that you identify Python code that implements this process.
  • Which are the steps to plot a histogram using Python? What about plotting point in a bi-dimensional space?
  • Which are the type of questions that can be answered by K-Means that are not answered applying statistics tools?
  • Given a data collection stored in a CSV, which are the steps to prepare it to be ready to apply K-Means? Be sure that you identify Python code that implements this process.
  • Which are the parameters used by the K-means function and what do they represent ?
  • Which are the measures used for evaluating a K-Means result? Be sure that you identify Python code that implements this process.
  • How can you use plotting in the process of applying K-Means for obtaining knowledge out of a data collection? Be sure that you identify Python code that implements this process.
  • How can you compute centroids using the libraries used in the lab? Be sure that you identify Python code that implements this process.
  • Once you have a clustering result you have a model of a phenomenon represented by the data collection used as input. How can you use this model if new data arrive?
  • Be sure that you can proposed cases in which you can apply K-Means.