Note
This notebook can be downloaded here: 00_Introduction_Panda.ipynb
Introduction to the Pandas module¶
Code author: Emile Roux emile.roux@univ-smb.fr
RISE Slideshow
Scope¶
This notebook gives some key functions to work with data base using the panda module (https://pandas.pydata.org/)
The web gives you a lot of exemples and documentations on this module:
http://pandas.pydata.org/pandas-docs/stable/10min.html
http://www.python-simple.com/python-pandas/panda-intro.php
#Setup
%load_ext autoreload
%matplotlib nbagg
%autoreload 2
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
Load data and creat a dataframe from csv file¶
More explaination can be found here : https://chrisalbon.com/python/data_wrangling/pandas_dataframe_importing_csv/
df = pd.read_csv('./_DATA/Note_csv.csv',delimiter=";")
Display the dataframe¶
# return the beginning of the dataframe
df.head()
section | groupe | name | ET | CC | |
---|---|---|---|---|---|
0 | MM | A | ami | 14.5 | 11.75 |
1 | MM | A | joyce | 8.5 | 11.50 |
2 | MM | C | lola | 9.5 | 13.25 |
3 | MM | B | irma | 7.5 | 6.00 |
4 | IAI | D | florence | 14.5 | 13.25 |
# return the end of the dataframe
df.tail()
section | groupe | name | ET | CC | |
---|---|---|---|---|---|
90 | MM | A | james | 13.75 | 12.75 |
91 | IAI | D | richard | 15.25 | 7.00 |
92 | MM | A | caprice | 18.25 | 15.00 |
93 | IAI | D | al | 12.50 | 9.75 |
94 | MM | B | constance | 3.00 | 7.00 |
Selecting data in a dataframe¶
# get data from index 2
df.loc[2]
section MM
groupe C
name lola
ET 9.5
CC 13.25
Name: 2, dtype: object
# get name from index 2
df.name[2]
'lola'
# Sliccing is also working
df.name[2:6]
2 lola
3 irma
4 florence
5 vi
Name: name, dtype: object
Get one of row of the dataframe¶
df.name
0 ami
1 joyce
2 lola
3 irma
4 florence
5 vi
6 brian
7 antoinette
8 fred
9 gaston
10 samuel
11 arnaud
12 annie
13 roosevelt
14 sarah
15 simon
16 louis
17 an
18 jacques
19 charles
20 sigrid
21 lasse
22 king
23 marco
24 patrick
25 liv
26 diane
27 bill
28 jessica
29 gilles
...
65 jeannot
66 fernand
67 lise
68 ursula
69 dona
70 dominique
71 platon
72 eugen
73 pedro
74 bob
75 marquis
76 j곩mie
77 karl
78 lucienne
79 timothꥻ4.75
80 avis
81 mari
82 rose
83 porter
84 philippe
85 vin
86 jeunesse
87 victoire
88 joseph
89 fꭩx
90 james
91 richard
92 caprice
93 al
94 constance
Name: name, Length: 95, dtype: object
Get the number of student in groupe A and B¶
df.groupe.value_counts()
B 25
A 24
D 23
C 23
Name: groupe, dtype: int64
Get the proportion of student between groupe A and B¶
df.groupe.value_counts(normalize=True)
B 0.263158
A 0.252632
D 0.242105
C 0.242105
Name: groupe, dtype: float64
Display the proportion of student between groupe A and B¶
*Using the plot function of panda:*
visualization optin of pandas can be found here : http://pandas.pydata.org/pandas-docs/version/0.18/visualization.html
fig = plt.figure()
df.groupe.value_counts(normalize=True).plot.pie(labels=['A', 'B', 'C', 'D'], colors= ['r', 'g', 'b', 'y'], autopct='%.1f')
plt.show()
<IPython.core.display.Javascript object>
*Using the plot function of matplotlib:*
val = df.groupe.value_counts(normalize=True).values
explode = (0.1, 0, 0, 0)
labels = 'A', 'B', 'C', 'D'
fig1, ax1 = plt.subplots()
ax1.pie(val, explode=explode, labels=labels, autopct='%1.1f%%',
shadow=True, startangle=90)
ax1.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()
<IPython.core.display.Javascript object>
Get student list who are in groupe A¶
df[df.groupe=="A"]
section | groupe | name | ET | CC | |
---|---|---|---|---|---|
0 | MM | A | ami | 14.50 | 11.75 |
1 | MM | A | joyce | 8.50 | 11.50 |
23 | MM | A | marco | 12.50 | 13.00 |
27 | MM | A | bill | 11.00 | 12.75 |
28 | MM | A | jessica | 16.50 | 12.50 |
37 | MM | A | denis | 13.25 | 16.00 |
38 | MM | A | jenny | 12.75 | 17.50 |
40 | MM | A | christian | 12.50 | 12.50 |
43 | MM | A | rita | 13.75 | 8.50 |
44 | MM | A | orlando | 14.00 | 15.25 |
48 | MM | A | chant | 4.50 | 9.00 |
50 | MM | A | val | 15.00 | 11.25 |
53 | MM | A | ana | 15.00 | 13.50 |
59 | MM | A | clarisse | 12.50 | 13.50 |
63 | MM | A | isabelle | 14.00 | 7.50 |
65 | MM | A | jeannot | 14.75 | 14.00 |
66 | MM | A | fernand | 8.00 | 10.00 |
75 | MM | A | marquis | 8.50 | 13.00 |
85 | MM | A | vin | 11.00 | 13.00 |
86 | MM | A | jeunesse | 12.00 | 10.50 |
87 | MM | A | victoire | 11.75 | 12.00 |
89 | MM | A | fꭩx | 13.00 | 14.50 |
90 | MM | A | james | 13.75 | 12.75 |
92 | MM | A | caprice | 18.25 | 15.00 |
Make calulation on data¶
df.ET.mean() # the mean of ET note over all student
11.043010752688172
df.ET[df.groupe=="A"].mean() # the mean of note1 over student from A groupe
12.552083333333334
df.groupby(['groupe']).mean() # compte the mean of each note for each groupe
ET | CC | |
---|---|---|
groupe | ||
A | 12.552083 | 12.531250 |
B | 9.720000 | 10.093750 |
C | 10.630435 | 11.913043 |
D | 11.345238 | 9.076087 |
df.groupby(['section']).mean() # compte the mean of each note for each section
ET | CC | |
---|---|---|
section | ||
IAI | 10.804688 | 9.786765 |
MM | 11.168033 | 11.550000 |
Display the notes with a histogram plot¶
# CC notes
fig = plt.figure()
df.CC.plot.hist(alpha=0.5, bins=np.arange(1,20))
plt.show()
<IPython.core.display.Javascript object>
# ET notes
fig = plt.figure()
df.ET.plot.hist(alpha=0.5, bins=np.arange(1,20))
plt.show()
<IPython.core.display.Javascript object>
fig = plt.figure()
df.plot.hist(alpha=.5, bins=np.arange(1,20))
plt.show()
<IPython.core.display.Javascript object>
<IPython.core.display.Javascript object>
Let’s compute the mean of both notes¶
We need first to add a new row to a data frame¶
df["FinalNote"] = 0.0 # add row filled with 0.0
df.head()
section | groupe | name | ET | CC | FinalNote | |
---|---|---|---|---|---|---|
0 | MM | A | ami | 14.5 | 11.75 | 0.0 |
1 | MM | A | joyce | 8.5 | 11.50 | 0.0 |
2 | MM | C | lola | 9.5 | 13.25 | 0.0 |
3 | MM | B | irma | 7.5 | 6.00 | 0.0 |
4 | IAI | D | florence | 14.5 | 13.25 | 0.0 |
Let’s compute the mean¶
df["FinalNote"]=df.mean(axis=1)
# the axis option alows comptuting the mean over lines or rows
df.head()
section | groupe | name | ET | CC | FinalNote | |
---|---|---|---|---|---|---|
0 | MM | A | ami | 14.5 | 11.75 | 8.750000 |
1 | MM | A | joyce | 8.5 | 11.50 | 6.666667 |
2 | MM | C | lola | 9.5 | 13.25 | 7.583333 |
3 | MM | B | irma | 7.5 | 6.00 | 4.500000 |
4 | IAI | D | florence | 14.5 | 13.25 | 9.250000 |
fig = plt.figure()
df.FinalNote.plot.hist(alpha=0.5, bins=np.arange(1,20))
plt.show()
<IPython.core.display.Javascript object>