Note

This notebook can be downloaded here: 00_Introduction_Panda.ipynb

Introduction to the Pandas module¶

Code author: Emile Roux emile.roux@univ-smb.fr

RISE Slideshow

Scope¶

This notebook gives some key functions to work with data base using the panda module (https://pandas.pydata.org/)

The web gives you a lot of exemples and documentations on this module:

http://pandas.pydata.org/pandas-docs/stable/10min.html

http://www.python-simple.com/python-pandas/panda-intro.php

#Setup
%load_ext autoreload
%matplotlib nbagg
%autoreload 2
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl

Load data and creat a dataframe from csv file¶

More explaination can be found here : https://chrisalbon.com/python/data_wrangling/pandas_dataframe_importing_csv/

df = pd.read_csv('./_DATA/Note_csv.csv',delimiter=";")

Display the dataframe¶

# return the beginning of the dataframe
df.head()

	section	groupe	name	ET	CC
0	MM	A	ami	14.5	11.75
1	MM	A	joyce	8.5	11.50
2	MM	C	lola	9.5	13.25
3	MM	B	irma	7.5	6.00
4	IAI	D	florence	14.5	13.25

# return the end of the dataframe
df.tail()

	section	groupe	name	ET	CC
90	MM	A	james	13.75	12.75
91	IAI	D	richard	15.25	7.00
92	MM	A	caprice	18.25	15.00
93	IAI	D	al	12.50	9.75
94	MM	B	constance	3.00	7.00

Selecting data in a dataframe¶

 # get data from index 2
df.loc[2]

section       MM
groupe         C
name        lola
ET           9.5
CC         13.25
Name: 2, dtype: object

# get name from index 2
df.name[2]

'lola'

# Sliccing is also working
df.name[2:6]

      lola
      irma
  florence
        vi
Name: name, dtype: object

Get one of row of the dataframe¶

df.name

           ami
         joyce
          lola
          irma
      florence
            vi
         brian
    antoinette
          fred
        gaston
       samuel
       arnaud
        annie
    roosevelt
        sarah
        simon
        louis
           an
      jacques
      charles
       sigrid
        lasse
         king
        marco
      patrick
          liv
        diane
         bill
      jessica
       gilles
         ...
      jeannot
      fernand
         lise
       ursula
         dona
    dominique
       platon
        eugen
        pedro
          bob
      marquis
        j곩mie
         karl
     lucienne
  timothꥻ4.75
         avis
         mari
         rose
       porter
     philippe
          vin
     jeunesse
     victoire
       joseph
          fꭩx
        james
      richard
      caprice
           al
    constance
Name: name, Length: 95, dtype: object

Get the number of student in groupe A and B¶

df.groupe.value_counts()

B    25
A    24
D    23
C    23
Name: groupe, dtype: int64

Get the proportion of student between groupe A and B¶

df.groupe.value_counts(normalize=True)

B    0.263158
A    0.252632
D    0.242105
C    0.242105
Name: groupe, dtype: float64

Display the proportion of student between groupe A and B¶

*Using the plot function of panda:*

visualization optin of pandas can be found here : http://pandas.pydata.org/pandas-docs/version/0.18/visualization.html

fig = plt.figure()
df.groupe.value_counts(normalize=True).plot.pie(labels=['A', 'B', 'C', 'D'], colors= ['r', 'g', 'b', 'y'], autopct='%.1f')
plt.show()

<IPython.core.display.Javascript object>

*Using the plot function of matplotlib:*

val = df.groupe.value_counts(normalize=True).values
explode = (0.1, 0, 0, 0)
labels = 'A', 'B', 'C', 'D'
fig1, ax1 = plt.subplots()
ax1.pie(val, explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.

plt.show()

<IPython.core.display.Javascript object>

Get student list who are in groupe A¶

df[df.groupe=="A"]

	section	groupe	name	ET	CC
0	MM	A	ami	14.50	11.75
1	MM	A	joyce	8.50	11.50
23	MM	A	marco	12.50	13.00
27	MM	A	bill	11.00	12.75
28	MM	A	jessica	16.50	12.50
37	MM	A	denis	13.25	16.00
38	MM	A	jenny	12.75	17.50
40	MM	A	christian	12.50	12.50
43	MM	A	rita	13.75	8.50
44	MM	A	orlando	14.00	15.25
48	MM	A	chant	4.50	9.00
50	MM	A	val	15.00	11.25
53	MM	A	ana	15.00	13.50
59	MM	A	clarisse	12.50	13.50
63	MM	A	isabelle	14.00	7.50
65	MM	A	jeannot	14.75	14.00
66	MM	A	fernand	8.00	10.00
75	MM	A	marquis	8.50	13.00
85	MM	A	vin	11.00	13.00
86	MM	A	jeunesse	12.00	10.50
87	MM	A	victoire	11.75	12.00
89	MM	A	fꭩx	13.00	14.50
90	MM	A	james	13.75	12.75
92	MM	A	caprice	18.25	15.00

Make calulation on data¶

df.ET.mean() # the mean of ET note over all student

11.043010752688172

df.ET[df.groupe=="A"].mean() # the mean of note1 over student from A groupe

12.552083333333334

df.groupby(['groupe']).mean() # compte the mean of each note for each groupe

	ET	CC
groupe
A	12.552083	12.531250
B	9.720000	10.093750
C	10.630435	11.913043
D	11.345238	9.076087

df.groupby(['section']).mean() # compte the mean of each note for each section

	ET	CC
section
IAI	10.804688	9.786765
MM	11.168033	11.550000

Display the notes with a histogram plot¶

# CC notes
fig = plt.figure()
df.CC.plot.hist(alpha=0.5, bins=np.arange(1,20))
plt.show()

<IPython.core.display.Javascript object>

# ET notes
fig = plt.figure()
df.ET.plot.hist(alpha=0.5, bins=np.arange(1,20))
plt.show()

<IPython.core.display.Javascript object>

fig = plt.figure()
df.plot.hist(alpha=.5, bins=np.arange(1,20))
plt.show()

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Let’s compute the mean of both notes¶

We need first to add a new row to a data frame¶

df["FinalNote"] = 0.0 # add  row filled with 0.0

df.head()

	section	groupe	name	ET	CC
0	MM	A	ami	14.5	11.75
1	MM	A	joyce	8.5	11.50
2	MM	C	lola	9.5	13.25
3	MM	B	irma	7.5	6.00
4	IAI	D	florence	14.5	13.25

Let’s compute the mean¶

df["FinalNote"]=df.mean(axis=1)
# the axis option alows comptuting the mean over lines or rows

df.head()

	section	groupe	name	ET	CC	FinalNote
0	MM	A	ami	14.5	11.75	8.750000
1	MM	A	joyce	8.5	11.50	6.666667
2	MM	C	lola	9.5	13.25	7.583333
3	MM	B	irma	7.5	6.00	4.500000
4	IAI	D	florence	14.5	13.25	9.250000

fig = plt.figure()
df.FinalNote.plot.hist(alpha=0.5, bins=np.arange(1,20))
plt.show()

<IPython.core.display.Javascript object>

What is the overall mean ?¶

df.FinalNote.mean()

10.812762277994366