[23파이썬특강] 2강. 데이터 프레임과 분석 기초(코드 검색)

2024-01-01 23 분 소요

2강. 데이터 프레임과 분석 기초

둘째마당 (04-05)
pp.074-130 (57쪽)

# 그래프 해상도 설정
import matplotlib.pyplot as plt
plt.rcParams.update({'figure.dpi' : '100'})
%config InlineBackend.figure_format = 'retina'

04. 데이터 프레임의 세계로

04-1. 데이터 프레임 이해하기(77-80쪽)

04-2. 데이터 프레임 만들기(81-84쪽)

[Do it! 실습] 데이터 입력해 데이터 프레임 만들기(81쪽)

import pandas as pd

# '딕셔너리' 형식('{}')의 데이터를 이용해서 만들기
## -> 딕셔너리에 관하여는 교재 '17-5. 딕셔너리'(418-421쪽) 참조
# DataFrame -> D, F가 대문자

df = pd.DataFrame({'name'    : ['김지훈', '이유진', '박동현', '김민지'], 
                   'english' : [90, 80, 60, 70], 
                   'math'    : [50, 60, 100, 20]})
df

	name	english	math
0	김지훈	90	50
1	이유진	80	60
2	박동현	60	100
3	김민지	70	20

[Do it! 실습] 데이터 프레임으로 분석하기(82쪽)

특정 변수의 값 추출하기

# df의 영어 점수를 출력하라: df 변수의 english 변수 <- 2차원 데이터
df['english'] 

0    90
1    80
2    60
3    70
Name: english, dtype: int64

변수의 값으로 합계 구하기

# 영어 점수 합계
sum(df['english'])

# 수학 점수 합계
sum(df['math'])

변수의 값으로 평균 구하기

# 영어 점수 평균
sum(df['english']) / 4

75.0

# 수학 점수 평균
sum(df['math']) / 4

57.5

[개인 실습] 혼자서 해보기(84쪽)

Q1

다음 표의 내용을 데이터 프레임으로 만들어 출력해 보세요

제품	가격	판매량
사과	1800	24
딸기	1500	38
수박	3000	13

# A1: 변수 fruit 사용
fruit = pd.DataFrame({'제품'   : ['사과', '딸기', '수박'],
                      '가격'   : [1800, 1500, 3000],
                      '판매량' : [24, 38, 13]})
fruit

	제품	가격	판매량
0	사과	1800	24
1	딸기	1500	38
2	수박	3000	13

Q2

앞에서 만든 데이터 프레임을 이용해 과일의 가격 평균과 판매량 평균을 구하라

# A2: 가격 평균: 변수 fruit_average_price 사용
fruit_average_price = sum(fruit['가격']) / 3
fruit_average_price

2100.0

# 판매량 평균: 변수 fruit_average_quantity 사용
fruit_average_quantity = sum(fruit['판매량']) / 3
fruit_average_quantity

25.0

04-3. 외부 데이터 이용하기(85-92)

[Do it! 실습] 엑셀 파일 불러오기(85쪽)

# excel_exam.xlsx 파일 불러와서 변수 df_exam에 넣어라
df_exam = pd.read_excel('excel_exam.xlsx')
df_exam

	id	nclass	math	english	science
0	1	1	50	98	50
1	2	1	60	97	60
2	3	1	45	86	78
3	4	1	30	98	58
4	5	2	25	80	65
5	6	2	50	89	98
6	7	2	80	90	45
7	8	2	90	78	25
8	9	3	20	98	15
9	10	3	50	98	45
10	11	3	65	65	65
11	12	3	45	85	32
12	13	4	46	98	65
13	14	4	48	87	12
14	15	4	75	56	78
15	16	4	58	98	65
16	17	5	65	68	98
17	18	5	80	78	90
18	19	5	89	68	87
19	20	5	78	83	58

# 영어 점수 평균
sum(df_exam['english']) / 20

84.9

# 과학 점수 평균
sum(df_exam['science']) / 20

59.45

# len()의 기능 1: 리스트의 원소 개수

x = [1, 2, 3, 4, 5]
len(x)

# len()의 기능 2: 데이터 프레임의 행 개수

df = pd.DataFrame({'a' : [1,2,3],
                   'b' : [4,5,6]})
df
# len(df)

	a	b
0	1	4
1	2	5
2	3	6

# df_exam 개수
len(df_exam)

# 영어 점수 평균 구하기 (len 사용)
sum(df_exam['english']) / len(df_exam)

84.9

# 과학 점수 평균 구하기 (len 사용)
sum(df_exam['science']) / len(df_exam)

59.45

엑셀 파일의 첫 번째 행이 변수명이 아니라면?

# excel_exam_novar.xlsx 불러오기. 변수 df_exam_novar 사용
df_exam_novar = pd.read_excel('excel_exam_novar.xlsx')
df_exam_novar

	1	1.1	50	98	50.1
0	2	1	60	97	60
1	3	2	25	80	65
2	4	2	50	89	98
3	5	3	20	98	15
4	6	3	50	98	45
5	7	4	46	98	65
6	8	4	48	87	12

# 첫 번째 행이 변수명이 아닐 때 파일을 불러오는 코드: header=None -- 변수명은 df_exam_novar로
df_exam_novar = pd.read_excel('excel_exam_novar.xlsx', header = None)
df_exam_novar

	0	1	2	3	4
0	1	1	50	98	50
1	2	1	60	97	60
2	3	2	25	80	65
3	4	2	50	89	98
4	5	3	20	98	15
5	6	3	50	98	45
6	7	4	46	98	65
7	8	4	48	87	12

엑셀 파일에 시트가 여러 개 있다면?

특정 시트 ‘Sheet2’의 데이터만 불러오려 할 때

# excel_exam.xlsx 파일 불러와서 변수 df_exam에 넣되, 
# 시트 이름 입력하는 방법: sheet_name 파라미터 이용
df_exam = pd.read_excel('excel_exam.xlsx', sheet_name = 'Sheet2')
df_exam

	id	history
0	1	95
1	2	100
2	3	99

[Do it! 실습] CSV 파일 불러오기(90쪽)

# CSV 파일 'exam.csv' 불러오기. 변수명은 df_csv_exam로
df_csv_exam = pd.read_csv('exam.csv')
df_csv_exam

	id	nclass	math	english	science
0	1	1	50	98	50
1	2	1	60	97	60
2	3	1	45	86	78
3	4	1	30	98	58
4	5	2	25	80	65
5	6	2	50	89	98
6	7	2	80	90	45
7	8	2	90	78	25
8	9	3	20	98	15
9	10	3	50	98	45
10	11	3	65	65	65
11	12	3	45	85	32
12	13	4	46	98	65
13	14	4	48	87	12
14	15	4	75	56	78
15	16	4	58	98	65
16	17	5	65	68	98
17	18	5	80	78	90
18	19	5	89	68	87
19	20	5	78	83	58

[Do it! 실습] 데이터 프레임을 CSV 파일로 저장하기(91쪽)

# 저장하기 위한 데이터 프레임 만들기

df_midterm = pd.DataFrame({'english': [90, 80, 60, 70],
                           'math'   : [50, 60, 100, 20],
                           'nclass' : [1, 1, 2, 2]})
df_midterm

	english	math	nclass
0	90	50	1
1	80	60	1
2	60	100	2
3	70	20	2

# df_midterm을 output_newdata.csv라는 파일명으로 저장하기 (인덱스 번호 포함/제외하기): index=True/False
df_midterm.to_csv('output_newdata.csv', index=False)

# 다시 불러오기: output_newdata.csv를 불러와서 변수 reload에 담기
reload = pd.read_csv('output_newdata.csv')
reload

	english	math	nclass
0	90	50	1
1	80	60	1
2	60	100	2
3	70	20	2

05. 데이터 분석 기초

05-1. 데이터 파악하기(99-106쪽)

head(), tail(), shape, info(), describe()

# exam.csv 불러와서 변수 exam에 담기
import pandas as pd

exam = pd.read_csv('exam.csv')
exam

	id	nclass	math	english	science
0	1	1	50	98	50
1	2	1	60	97	60
2	3	1	45	86	78
3	4	1	30	98	58
4	5	2	25	80	65
5	6	2	50	89	98
6	7	2	80	90	45
7	8	2	90	78	25
8	9	3	20	98	15
9	10	3	50	98	45
10	11	3	65	65	65
11	12	3	45	85	32
12	13	4	46	98	65
13	14	4	48	87	12
14	15	4	75	56	78
15	16	4	58	98	65
16	17	5	65	68	98
17	18	5	80	78	90
18	19	5	89	68	87
19	20	5	78	83	58

# 앞부분 5개 행 출력
exam.head()

	id	nclass	math	english	science
0	1	1	50	98	50
1	2	1	60	97	60
2	3	1	45	86	78
3	4	1	30	98	58
4	5	2	25	80	65

# 앞부분 10개 행 출력
exam.head(10)

	id	nclass	math	english	science
0	1	1	50	98	50
1	2	1	60	97	60
2	3	1	45	86	78
3	4	1	30	98	58
4	5	2	25	80	65
5	6	2	50	89	98
6	7	2	80	90	45
7	8	2	90	78	25
8	9	3	20	98	15
9	10	3	50	98	45

# 뒷부분 5개 행 출력
exam.tail()

	id	nclass	math	english	science
15	16	4	58	98	65
16	17	5	65	68	98
17	18	5	80	78	90
18	19	5	89	68	87
19	20	5	78	83	58

# 뒷부분 10개 행 출력
exam.tail(10)

	id	nclass	math	english	science
10	11	3	65	65	65
11	12	3	45	85	32
12	13	4	46	98	65
13	14	4	48	87	12
14	15	4	75	56	78
15	16	4	58	98	65
16	17	5	65	68	98
17	18	5	80	78	90
18	19	5	89	68	87
19	20	5	78	83	58

# 행, 열 개수출력
exam.shape

(20, 5)

# 변수 속성 출력
exam.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   id       20 non-null     int64
 1   nclass   20 non-null     int64
 2   math     20 non-null     int64
 3   english  20 non-null     int64
 4   science  20 non-null     int64
dtypes: int64(5)
memory usage: 932.0 bytes

# 요약 통계량 출력
exam.describe()

	id	nclass	math	english	science
count	20.00000	20.000000	20.000000	20.000000	20.000000
mean	10.50000	3.000000	57.450000	84.900000	59.450000
std	5.91608	1.450953	20.299015	12.875517	25.292968
min	1.00000	1.000000	20.000000	56.000000	12.000000
25%	5.75000	2.000000	45.750000	78.000000	45.000000
50%	10.50000	3.000000	54.000000	86.500000	62.500000
75%	15.25000	4.000000	75.750000	98.000000	78.000000
max	20.00000	5.000000	90.000000	98.000000	98.000000

# mpg.csv 데이터 불러와서 변수 mpg에 담기
'''
cf) mpg : mpg(Mile Per Gallon) 데이터는 미국 환경 보호국(US Environmental Protection Agency)에서 공개한 자료로, 1999~2008년 사이 미국에서 출시된 자동차 234종의 연비 관련 정보를 담고 있다.
https://m.blog.naver.com/eunha4685/221496862666
'''
mpg = pd.read_csv('mpg.csv')
mpg

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	category
0	audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact
1	audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact
2	audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact
3	audi	a4	2.0	2008	4	auto(av)	f	21	30	p	compact
4	audi	a4	2.8	1999	6	auto(l5)	f	16	26	p	compact
...	...	...	...	...	...	...	...	...	...	...	...
229	volkswagen	passat	2.0	2008	4	auto(s6)	f	19	28	p	midsize
230	volkswagen	passat	2.0	2008	4	manual(m6)	f	21	29	p	midsize
231	volkswagen	passat	2.8	1999	6	auto(l5)	f	16	26	p	midsize
232	volkswagen	passat	2.8	1999	6	manual(m5)	f	18	26	p	midsize
233	volkswagen	passat	3.6	2008	6	auto(s6)	f	17	26	p	midsize

234 rows × 11 columns

# mpg 앞부분 5개 행 출력
mpg.head()

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	category
0	audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact
1	audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact
2	audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact
3	audi	a4	2.0	2008	4	auto(av)	f	21	30	p	compact
4	audi	a4	2.8	1999	6	auto(l5)	f	16	26	p	compact

# mpg 뒷부분 5개 행 출력
mpg.tail()

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	category
229	volkswagen	passat	2.0	2008	4	auto(s6)	f	19	28	p	midsize
230	volkswagen	passat	2.0	2008	4	manual(m6)	f	21	29	p	midsize
231	volkswagen	passat	2.8	1999	6	auto(l5)	f	16	26	p	midsize
232	volkswagen	passat	2.8	1999	6	manual(m5)	f	18	26	p	midsize
233	volkswagen	passat	3.6	2008	6	auto(s6)	f	17	26	p	midsize

# mpg 행과 열 개수 출력
mpg.shape

(234, 11)

# mpg 변수 속성 출력
mpg.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 234 entries, 0 to 233
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   manufacturer  234 non-null    object 
 1   model         234 non-null    object 
 2   displ         234 non-null    float64
 3   year          234 non-null    int64  
 4   cyl           234 non-null    int64  
 5   trans         234 non-null    object 
 6   drv           234 non-null    object 
 7   cty           234 non-null    int64  
 8   hwy           234 non-null    int64  
 9   fl            234 non-null    object 
 10  category      234 non-null    object 
dtypes: float64(1), int64(4), object(6)
memory usage: 20.2+ KB

# mpg 요약 통계량 출력 (숫자로 된 변수만)
mpg.describe()

	displ	year	cyl	cty	hwy
count	234.000000	234.000000	234.000000	234.000000	234.000000
mean	3.471795	2003.500000	5.888889	16.858974	23.440171
std	1.291959	4.509646	1.611534	4.255946	5.954643
min	1.600000	1999.000000	4.000000	9.000000	12.000000
25%	2.400000	1999.000000	4.000000	14.000000	18.000000
50%	3.300000	2003.500000	6.000000	17.000000	24.000000
75%	4.600000	2008.000000	8.000000	19.000000	27.000000
max	7.000000	2008.000000	8.000000	35.000000	44.000000

# mpg 요약 통계량 출력 (문자로 된 변수만. include = )
mpg.describe(include = 'object')

	manufacturer	model	trans	drv	fl	category
count	234	234	234	234	234	234
unique	15	38	10	3	5	7
top	dodge	caravan 2wd	auto(l4)	f	r	suv
freq	37	11	83	106	168	62

# mpg 요약 통계량 출력 (문자, 숫자 변수 모두. include = )
mpg.describe(include = 'all')

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	category
count	234	234	234.000000	234.000000	234.000000	234	234	234.000000	234.000000	234	234
unique	15	38	NaN	NaN	NaN	10	3	NaN	NaN	5	7
top	dodge	caravan 2wd	NaN	NaN	NaN	auto(l4)	f	NaN	NaN	r	suv
freq	37	11	NaN	NaN	NaN	83	106	NaN	NaN	168	62
mean	NaN	NaN	3.471795	2003.500000	5.888889	NaN	NaN	16.858974	23.440171	NaN	NaN
std	NaN	NaN	1.291959	4.509646	1.611534	NaN	NaN	4.255946	5.954643	NaN	NaN
min	NaN	NaN	1.600000	1999.000000	4.000000	NaN	NaN	9.000000	12.000000	NaN	NaN
25%	NaN	NaN	2.400000	1999.000000	4.000000	NaN	NaN	14.000000	18.000000	NaN	NaN
50%	NaN	NaN	3.300000	2003.500000	6.000000	NaN	NaN	17.000000	24.000000	NaN	NaN
75%	NaN	NaN	4.600000	2008.000000	8.000000	NaN	NaN	19.000000	27.000000	NaN	NaN
max	NaN	NaN	7.000000	2008.000000	8.000000	NaN	NaN	35.000000	44.000000	NaN	NaN

05-2. 변수명 바꾸기(113-115쪽)

목적 : 변수명을 이해하기 쉬운 단어로 변경 -> 데이터를 수월하게 다룰 수 있음
df.rename()

# 데이터 프레임 생성

df_raw = pd.DataFrame({'var1':[1, 2, 1],
                       'var2':[2, 3, 2],})
df_raw

	var1	var2
0	1	2
1	2	3
2	1	2

# 데이터 변형 위해 복사본 만들기
df_new = df_raw.copy()
df_new

	var1	var2
0	1	2
1	2	3
2	1	2

# 변수 var2를 v2로 수정해서 df_new에 담음: 함수 - rename(), 파라미터 - columns
df_new = df_new.rename(columns = {'var2' : 'v2'})
df_new

	var1	v2
0	1	2
1	2	3
2	1	2

[개인 실습] 혼자서 해보기(115쪽)

mpg 데이터의 변수명은 긴 단어를 짧게 줄인 축약어로 되어 있다. cty는 도시 연비, hwy는 고속도로 연비를 의미한다. 변수명을 이해하기 쉬운 단어로 바꾸려 한다.

Q1 : mpg 데이터를 불러와 복사본을 만드세요.

# 변수 mpg_new에 저장
mpg_new = mpg.copy()
mpg_new.head(3)

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	category
0	audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact
1	audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact
2	audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact

Q2 : 복사본 데이터를 이용해 cty는 city로, hwy는 highway로 수정하세요

# 결과를 다시 변수 mpg_new에 저장
mpg_new = mpg_new.rename(columns = {'cty' : 'city', 
                                    'hwy' : 'highway'})
mpg_new.head(3)

	manufacturer	model	displ	year	cyl	trans	drv	city	highway	fl	category
0	audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact
1	audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact
2	audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact

05-3. 파생변수 만들기(116-128쪽)

# 데이터 프레임 생성

import pandas as pd

df = pd.DataFrame({'var1' : [4, 3, 8], 
                   'var2' : [2, 6, 1]})
df

	var1	var2
0	4	2
1	3	6
2	8	1

# 변수 var1과 var2의 값을 더하여 var_sum 변수에 저장
df['var_sum'] = df['var1'] + df['var2']
df

	var1	var2	var_sum
0	4	2	6
1	3	6	9
2	8	1	9

# 변수 var1과 var2의 평균을 구하여 var_mean 변수에 저장
df['var_mean'] = (df['var1'] + df['var2']) / 2
df

	var1	var2	var_sum	var_mean
0	4	2	6	3.0
1	3	6	9	4.5
2	8	1	9	4.5

[Do it! 실습] mpg 통합 연비 변수 만들기 (117쪽)

# mpg.csv 데이터 불러와서 변수 mpg에 담기
mpg = pd.read_csv('mpg.csv')
mpg.head(3)

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	category
0	audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact
1	audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact
2	audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact

# cty, hwy 두 변수를 더해 2로 나눠 도로 유형을 통합한 연비 변수를 만들어 total 변수에 저장
mpg['total'] = (mpg['cty'] + mpg['hwy']) / 2

# mpg 앞부분 5행 출력
mpg.head()

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	category	total
0	audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact	23.5
1	audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact	25.0
2	audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact	25.5
3	audi	a4	2.0	2008	4	auto(av)	f	21	30	p	compact	25.5
4	audi	a4	2.8	1999	6	auto(l5)	f	16	26	p	compact	21.0

# 통합 연비 변수의 평균 구하기: sum(), len() 함수 이용
sum(mpg['total']) / len(mpg)

20.14957264957265

# 통합 연비 변수의 평균 구하기: mean() 함수 이용
mpg['total'].mean()

20.14957264957265

[Do it! 실습] 조건문을 활용해 파생변수 만들기 (118쪽)

1. 기준값 정하기

# mpg 통합 연비 변수 total의 요약통계량 출력
mpg['total'].describe()

count    234.000000
mean      20.149573
std        5.050290
min       10.500000
25%       15.500000
50%       20.500000
75%       23.500000
max       39.500000
Name: total, dtype: float64

# 히스토그램으로 자동차들의 통합 연비 분포 파악하기: df.plot.hist() 이용
mpg['total'].plot.hist()

<Axes: ylabel='Frequency'>

2. 합격 판정 변수 만들기

import numpy as np

# total이 20 이상이면 pass, 그렇지 않으면 fail을 부여 -> test 변수에 저장
import numpy as np
mpg['test'] = np.where(mpg['total'] >= 20, 'pass', 'fail')

# mpg 앞부분 5개 출력
mpg.head()

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	category	total	test
0	audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact	23.5	pass
1	audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact	25.0	pass
2	audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact	25.5	pass
3	audi	a4	2.0	2008	4	auto(av)	f	21	30	p	compact	25.5	pass
4	audi	a4	2.8	1999	6	auto(l5)	f	16	26	p	compact	21.0	pass

3. 빈도표로 합격 판정 자동차 수 살펴보기

# df.value_counts() 이용
mpg['test'].value_counts()

test
pass    128
fail    106
Name: count, dtype: int64

4. 막대 그래프로 빈도 표현하기

# 연비 합격 빈도표를 변수 count_test에 할당
count_test = mpg['test'].value_counts()
count_test

test
pass    128
fail    106
Name: count, dtype: int64

# 연비 합격 빈도 막대 그래프 만들기: df.plot.bar() 이용
count_test.plot.bar()

<Axes: xlabel='test'>

# 축 이름 수평으로 만들기
count_test.plot.bar(rot = 0)

<Axes: xlabel='test'>

[Do it! 실습] 중첩 조건문 활용하기 (123쪽)

1. 연비 등급 변수 만들기

# total 기준으로 A, B, C 등급 부여해서 grade 변수에 저장: 30이상 A, 20이상 B, 20미만 C
mpg['grade'] = np.where(mpg['total'] >= 30, 'A', 
               np.where(mpg['total'] >= 20, 'B', 'C'))
# mpg 앞부분 5개 출력
mpg.head()

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	category	total	test	grade
0	audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact	23.5	pass	B
1	audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact	25.0	pass	B
2	audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact	25.5	pass	B
3	audi	a4	2.0	2008	4	auto(av)	f	21	30	p	compact	25.5	pass	B
4	audi	a4	2.8	1999	6	auto(l5)	f	16	26	p	compact	21.0	pass	B

2. 빈도표와 막대 그래프로 연비 등급 살펴보기

# 등급 빈도표 만들어 변수 count_grade에 저장
count_grade = mpg['grade'].value_counts()
count_grade

grade
B    118
C    106
A     10
Name: count, dtype: int64

# 등급 빈도 막대 그래프 만들기
count_grade.plot.bar(rot=0)

<Axes: xlabel='grade'>

그래프에서 알파벳 순으로 막대 정렬하기

# 등급 빈도표를 알파벳 순으로 정렬해서 변수 count_grade에 저장
count_grade = mpg['grade'].value_counts().sort_index()
count_grade

grade
A     10
B    118
C    106
Name: count, dtype: int64

# 이렇게 저장한 count_grade를 막대 그래프로 표현
count_grade.plot.bar(rot=0)

<Axes: xlabel='grade'>

필요한 만큼 범주 만들기

A, B, C, D 등급 변수 만들기

# np.where() 두 번 중첩 사용해서(total 기준으로 A, B, C, D 등급 부여해서) 
# 그 결과를 변수 grade2에 저장. 30이상 A, 25이상 B, 20이상 C, 20미만 D
mpg['grade2'] = np.where(mpg['total'] >= 30, 'A', 
                np.where(mpg['total'] >= 25, 'B',
                np.where(mpg['total'] >= 20, 'C', 'D')))
mpg.head()

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	category	total	test	grade	grade2
0	audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact	23.5	pass	B	C
1	audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact	25.0	pass	B	B
2	audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact	25.5	pass	B	B
3	audi	a4	2.0	2008	4	auto(av)	f	21	30	p	compact	25.5	pass	B	B
4	audi	a4	2.8	1999	6	auto(l5)	f	16	26	p	compact	21.0	pass	B	C

# 등급 빈도표를 알파벳 순으로 정렬해서 변수 count_grade_2에 저장
grade_2 = mpg['grade2'].value_counts().sort_index()
grade_2

grade2
A     10
B     33
C     85
D    106
Name: count, dtype: int64

# 이렇게 저장한 count_grade_2를 막대 그래프로 표현
grade_2.plot.bar(rot=0);

[Do it! 실습] 목록에 해당하는 행으로 변수 만들기(128쪽)

# 여러 조건 중 하나에 해당하면 특정 값을 부여해서 파생변수를 생성: "|"(or) 사용
# np.where()에 여러 조건 입력할 땐 각 조건을 괄호에 입력해야 함.
# category가 compact, subcompact, 2seater이면 -> small, 그렇지 않으면 large를 부여하여 파생변수 size를 생성
mpg['size'] = np.where((mpg['category'] == 'compact')|(mpg['category'] == 'subcompact')|(mpg['category'] == '2seater'),
                       'small', 'large')
mpg.head()

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	category	total	test	grade	grade2	size
0	audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact	23.5	pass	B	C	small
1	audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact	25.0	pass	B	B	small
2	audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact	25.5	pass	B	B	small
3	audi	a4	2.0	2008	4	auto(av)	f	21	30	p	compact	25.5	pass	B	B	small
4	audi	a4	2.8	1999	6	auto(l5)	f	16	26	p	compact	21.0	pass	B	C	small

# 파생변수 빈도표 출력
mpg['size'].value_counts()

size
large    147
small     87
Name: count, dtype: int64

# size 파생변수 생성 코드를 df.isin() 이용해서 간략화하기
mpg['size'] = np.where(mpg['category'].isin(['compact', 'subcompact', '2seater']), 'small', 'large')
mpg.head()

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	category	total	test	grade	grade2	size
0	audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact	23.5	pass	B	C	small
1	audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact	25.0	pass	B	B	small
2	audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact	25.5	pass	B	B	small
3	audi	a4	2.0	2008	4	auto(av)	f	21	30	p	compact	25.5	pass	B	B	small
4	audi	a4	2.8	1999	6	auto(l5)	f	16	26	p	compact	21.0	pass	B	C	small

# 파생변수 빈도표 출력
mpg['size'].value_counts()

size
large    147
small     87
Name: count, dtype: int64

[분석 도전]

midwest.csv는 미국 동북중부(East North Central States) 437개 지역의 인구통계 정보를 담고 있습니다. midwest.csv를 이용해 데이터 분석 문제를 해결해 보세요.
midwest 데이터 출처: bit.ly/easypy_52

문제1

midwest.csv를 불러와 midwest 변수에 넣고, 데이터의 특징을 파악하세요

# 불러오기
midwest = pd.read_csv('midwest.csv')
midwest.head(3)

	PID	county	state	area	poptotal	popdensity	popwhite	popblack	popamerindian	popasian	...	percollege	percprof	poppovertyknown	percpovertyknown	percbelowpoverty	percchildbelowpovert	percadultpoverty	percelderlypoverty	category
0	561	ADAMS	IL	0.052	66090	1270.961540	63917	1702	98	249	...	19.631392	4.355859	63628	96.274777	13.151443	18.011717	11.009776	12.443812	AAR
1	562	ALEXANDER	IL	0.014	10626	759.000000	7054	3496	19	48	...	11.243308	2.870315	10529	99.087145	32.244278	45.826514	27.385647	25.228976	LHR
2	563	BOND	IL	0.022	14991	681.409091	14477	429	35	16	...	17.033819	4.488572	14235	94.956974	12.068844	14.036061	10.852090	12.697410	AAR

3 rows × 28 columns

# 행과 열 개수
midwest.shape

(437, 28)

# 변수 속성 출력
midwest.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 437 entries, 0 to 436
Data columns (total 28 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   PID                   437 non-null    int64  
 1   county                437 non-null    object 
 2   state                 437 non-null    object 
 3   area                  437 non-null    float64
 4   poptotal              437 non-null    int64  
 5   popdensity            437 non-null    float64
 6   popwhite              437 non-null    int64  
 7   popblack              437 non-null    int64  
 8   popamerindian         437 non-null    int64  
 9   popasian              437 non-null    int64  
 10  popother              437 non-null    int64  
 11  percwhite             437 non-null    float64
 12  percblack             437 non-null    float64
 13  percamerindan         437 non-null    float64
 14  percasian             437 non-null    float64
 15  percother             437 non-null    float64
 16  popadults             437 non-null    int64  
 17  perchsd               437 non-null    float64
 18  percollege            437 non-null    float64
 19  percprof              437 non-null    float64
 20  poppovertyknown       437 non-null    int64  
 21  percpovertyknown      437 non-null    float64
 22  percbelowpoverty      437 non-null    float64
 23  percchildbelowpovert  437 non-null    float64
 24  percadultpoverty      437 non-null    float64
 25  percelderlypoverty    437 non-null    float64
 26  inmetro               437 non-null    int64  
 27  category              437 non-null    object 
dtypes: float64(15), int64(10), object(3)
memory usage: 95.7+ KB

# 요약통계량 출력
midwest.describe()

	PID	area	poptotal	popdensity	popwhite	popblack	popamerindian	popasian	popother	percwhite	...	perchsd	percollege	percprof	poppovertyknown	percpovertyknown	percbelowpoverty	percchildbelowpovert	percadultpoverty	percelderlypoverty	inmetro
count	437.000000	437.000000	4.370000e+02	437.000000	4.370000e+02	4.370000e+02	437.000000	437.000000	437.000000	437.000000	...	437.000000	437.000000	437.000000	4.370000e+02	437.000000	437.000000	437.000000	437.000000	437.000000	437.000000
mean	1437.338673	0.033169	9.613030e+04	3097.742985	8.183992e+04	1.102388e+04	343.109840	1310.464531	1612.931350	95.558441	...	73.965546	18.272736	4.447259	9.364228e+04	97.110267	12.510505	16.447464	10.918798	11.389043	0.343249
std	876.390266	0.014679	2.981705e+05	7664.751786	2.001966e+05	7.895827e+04	868.926751	9518.394189	18526.540699	7.087358	...	5.843177	6.261908	2.408427	2.932351e+05	2.749863	5.150155	7.228634	5.109166	3.661259	0.475338
min	561.000000	0.005000	1.701000e+03	85.050000	4.160000e+02	0.000000e+00	4.000000	0.000000	0.000000	10.694087	...	46.912261	7.336108	0.520291	1.696000e+03	80.902441	2.180168	1.918955	1.938504	3.547067	0.000000
25%	670.000000	0.024000	1.884000e+04	622.407407	1.863000e+04	2.900000e+01	44.000000	35.000000	20.000000	94.886032	...	71.325329	14.113725	2.997957	1.836400e+04	96.894572	9.198715	11.624088	7.668009	8.911763	0.000000
50%	1221.000000	0.030000	3.532400e+04	1156.208330	3.447100e+04	2.010000e+02	94.000000	102.000000	66.000000	98.032742	...	74.246891	16.797562	3.814239	3.378800e+04	98.169562	11.822313	15.270164	10.007610	10.869119	0.000000
75%	2059.000000	0.038000	7.565100e+04	2330.000000	7.296800e+04	1.291000e+03	288.000000	401.000000	345.000000	99.074935	...	77.195345	20.549893	4.949324	7.284000e+04	98.598636	15.133226	20.351878	13.182182	13.412162	1.000000
max	3052.000000	0.110000	5.105067e+06	88018.396600	3.204947e+06	1.317147e+06	10289.000000	188565.000000	384119.000000	99.822821	...	88.898674	48.078510	20.791321	5.023523e+06	99.860384	48.691099	64.308477	43.312464	31.161972	1.000000

8 rows × 25 columns

문제2

poptotal(전체 인구) 변수를 total로, popasian(아시아 인구) 변수를 asian으로 수정하세요.

# midwest를 복사해서 midwest_new 변수에 저장하기
midwest_new = midwest.copy()
midwest_new.head(3)

	PID	county	state	area	poptotal	popdensity	popwhite	popblack	popamerindian	popasian	...	percollege	percprof	poppovertyknown	percpovertyknown	percbelowpoverty	percchildbelowpovert	percadultpoverty	percelderlypoverty	category
0	561	ADAMS	IL	0.052	66090	1270.961540	63917	1702	98	249	...	19.631392	4.355859	63628	96.274777	13.151443	18.011717	11.009776	12.443812	AAR
1	562	ALEXANDER	IL	0.014	10626	759.000000	7054	3496	19	48	...	11.243308	2.870315	10529	99.087145	32.244278	45.826514	27.385647	25.228976	LHR
2	563	BOND	IL	0.022	14991	681.409091	14477	429	35	16	...	17.033819	4.488572	14235	94.956974	12.068844	14.036061	10.852090	12.697410	AAR

3 rows × 28 columns

# poptotal 변수는 total로 변경하고, popasian은 asian으로 변경해서 midwest_new 변수에 다시 넣기
midwest_new = midwest_new.rename(columns = {'poptotal' : 'total', 
                                            'popasian' : 'asian'})
midwest_new.head(3)

	PID	county	state	area	total	popdensity	popwhite	popblack	popamerindian	asian	...	percollege	percprof	poppovertyknown	percpovertyknown	percbelowpoverty	percchildbelowpovert	percadultpoverty	percelderlypoverty	category
0	561	ADAMS	IL	0.052	66090	1270.961540	63917	1702	98	249	...	19.631392	4.355859	63628	96.274777	13.151443	18.011717	11.009776	12.443812	AAR
1	562	ALEXANDER	IL	0.014	10626	759.000000	7054	3496	19	48	...	11.243308	2.870315	10529	99.087145	32.244278	45.826514	27.385647	25.228976	LHR
2	563	BOND	IL	0.022	14991	681.409091	14477	429	35	16	...	17.033819	4.488572	14235	94.956974	12.068844	14.036061	10.852090	12.697410	AAR

3 rows × 28 columns

문제3

total, asian 변수를 이용해 ‘전체 인구 대비 아시아 인구 백분율’ 파생변수를 추가하고, 히스토그램을 만들어 분포를 살펴보세요.

# 아시아 인구 백분율을 파생변수 asian_percent에 저장
midwest_new['asian_percent'] = midwest_new['asian'] / midwest_new['total'] * 100
midwest_new.head(3)

	PID	county	state	area	total	popdensity	popwhite	popblack	popamerindian	asian	...	percprof	poppovertyknown	percpovertyknown	percbelowpoverty	percchildbelowpovert	percadultpoverty	percelderlypoverty	category	asian_percent
0	561	ADAMS	IL	0.052	66090	1270.961540	63917	1702	98	249	...	4.355859	63628	96.274777	13.151443	18.011717	11.009776	12.443812	AAR	0.376759
1	562	ALEXANDER	IL	0.014	10626	759.000000	7054	3496	19	48	...	2.870315	10529	99.087145	32.244278	45.826514	27.385647	25.228976	LHR	0.451722
2	563	BOND	IL	0.022	14991	681.409091	14477	429	35	16	...	4.488572	14235	94.956974	12.068844	14.036061	10.852090	12.697410	AAR	0.106731

3 rows × 29 columns

# 아시아 인구 백분율의 분포를 히스토그램으로 파악하기
midwest_new['asian_percent'].plot.hist();

문제4

아시아 인구 백분율 전체 평균을 구하고, 평균을 초과하면 ‘large’, 그 외에는 ‘small’을 부여한 파생변수를 만들어 보세요.

# 아시아 인구 백분율 전체 평균을 구해서 average라는 변수에 넣어라
average = midwest_new['asian_percent'].mean()
average

0.4872461834357345

# 아시아 인구 백분율이 전체 평균을 초과하면 large, 그 외에는 small 값을 파생변수 group에 저장
midwest_new['group'] = np.where(midwest_new['asian_percent'] > average, 'large', 'small')
midwest_new.head(3)

	PID	county	state	area	total	popdensity	popwhite	popblack	popamerindian	asian	...	percpovertyknown	percbelowpoverty	percchildbelowpovert	percadultpoverty	percelderlypoverty	category	asian_percent	average	group
0	561	ADAMS	IL	0.052	66090	1270.961540	63917	1702	98	249	...	96.274777	13.151443	18.011717	11.009776	12.443812	AAR	0.376759	0.487246	small
1	562	ALEXANDER	IL	0.014	10626	759.000000	7054	3496	19	48	...	99.087145	32.244278	45.826514	27.385647	25.228976	LHR	0.451722	0.487246	small
2	563	BOND	IL	0.022	14991	681.409091	14477	429	35	16	...	94.956974	12.068844	14.036061	10.852090	12.697410	AAR	0.106731	0.487246	small

3 rows × 31 columns

문제 5:

‘large’와 ‘small’에 해당하는 지역이 얼마나 많은지 빈도표와 빈도 막대 그래프를 만들어 확인해 보세요

# 빈도표를 count_group 변수에 저장
count_group = midwest_new['group'].value_counts()
count_group

group
small    318
large    119
Name: count, dtype: int64

# 빈도 막대 그래프 작성
count_group.plot.bar(rot=0);

The End of Note

Twitter Facebook LinkedIn

	id	nclass	math	english	science
0	1	1	50	98	50
1	2	1	60	97	60
2	3	1	45	86	78
3	4	1	30	98	58
4	5	2	25	80	65
5	6	2	50	89	98
6	7	2	80	90	45
7	8	2	90	78	25
8	9	3	20	98	15
9	10	3	50	98	45
10	11	3	65	65	65
11	12	3	45	85	32
12	13	4	46	98	65
13	14	4	48	87	12
14	15	4	75	56	78
15	16	4	58	98	65
16	17	5	65	68	98
17	18	5	80	78	90
18	19	5	89	68	87
19	20	5	78	83	58

	1	1.1	50	98	50.1
0	2	1	60	97	60
1	3	2	25	80	65
2	4	2	50	89	98
3	5	3	20	98	15
4	6	3	50	98	45
5	7	4	46	98	65
6	8	4	48	87	12

	0	1	2	3	4
0	1	1	50	98	50
1	2	1	60	97	60
2	3	2	25	80	65
3	4	2	50	89	98
4	5	3	20	98	15
5	6	3	50	98	45
6	7	4	46	98	65
7	8	4	48	87	12

	id	nclass	math	english	science
0	1	1	50	98	50
1	2	1	60	97	60
2	3	1	45	86	78
3	4	1	30	98	58
4	5	2	25	80	65
5	6	2	50	89	98
6	7	2	80	90	45
7	8	2	90	78	25
8	9	3	20	98	15
9	10	3	50	98	45
10	11	3	65	65	65
11	12	3	45	85	32
12	13	4	46	98	65
13	14	4	48	87	12
14	15	4	75	56	78
15	16	4	58	98	65
16	17	5	65	68	98
17	18	5	80	78	90
18	19	5	89	68	87
19	20	5	78	83	58

	id	nclass	math	english	science
0	1	1	50	98	50
1	2	1	60	97	60
2	3	1	45	86	78
3	4	1	30	98	58
4	5	2	25	80	65
5	6	2	50	89	98
6	7	2	80	90	45
7	8	2	90	78	25
8	9	3	20	98	15
9	10	3	50	98	45
10	11	3	65	65	65
11	12	3	45	85	32
12	13	4	46	98	65
13	14	4	48	87	12
14	15	4	75	56	78
15	16	4	58	98	65
16	17	5	65	68	98
17	18	5	80	78	90
18	19	5	89	68	87
19	20	5	78	83	58

	id	nclass	math	english	science
0	1	1	50	98	50
1	2	1	60	97	60
2	3	1	45	86	78
3	4	1	30	98	58
4	5	2	25	80	65
5	6	2	50	89	98
6	7	2	80	90	45
7	8	2	90	78	25
8	9	3	20	98	15
9	10	3	50	98	45

	id	nclass	math	english	science
10	11	3	65	65	65
11	12	3	45	85	32
12	13	4	46	98	65
13	14	4	48	87	12
14	15	4	75	56	78
15	16	4	58	98	65
16	17	5	65	68	98
17	18	5	80	78	90
18	19	5	89	68	87
19	20	5	78	83	58

	id	nclass	math	english	science
0	1	1	50	98	50
1	2	1	60	97	60
2	3	1	45	86	78
3	4	1	30	98	58
4	5	2	25	80	65
5	6	2	50	89	98
6	7	2	80	90	45
7	8	2	90	78	25
8	9	3	20	98	15
9	10	3	50	98	45
10	11	3	65	65	65
11	12	3	45	85	32
12	13	4	46	98	65
13	14	4	48	87	12
14	15	4	75	56	78
15	16	4	58	98	65
16	17	5	65	68	98
17	18	5	80	78	90
18	19	5	89	68	87
19	20	5	78	83	58

	1	1.1	50	98	50.1
0	2	1	60	97	60
1	3	2	25	80	65
2	4	2	50	89	98
3	5	3	20	98	15
4	6	3	50	98	45
5	7	4	46	98	65
6	8	4	48	87	12

	0	1	2	3	4
0	1	1	50	98	50
1	2	1	60	97	60
2	3	2	25	80	65
3	4	2	50	89	98
4	5	3	20	98	15
5	6	3	50	98	45
6	7	4	46	98	65
7	8	4	48	87	12

	id	nclass	math	english	science
0	1	1	50	98	50
1	2	1	60	97	60
2	3	1	45	86	78
3	4	1	30	98	58
4	5	2	25	80	65
5	6	2	50	89	98
6	7	2	80	90	45
7	8	2	90	78	25
8	9	3	20	98	15
9	10	3	50	98	45
10	11	3	65	65	65
11	12	3	45	85	32
12	13	4	46	98	65
13	14	4	48	87	12
14	15	4	75	56	78
15	16	4	58	98	65
16	17	5	65	68	98
17	18	5	80	78	90
18	19	5	89	68	87
19	20	5	78	83	58

	id	nclass	math	english	science
0	1	1	50	98	50
1	2	1	60	97	60
2	3	1	45	86	78
3	4	1	30	98	58
4	5	2	25	80	65
5	6	2	50	89	98
6	7	2	80	90	45
7	8	2	90	78	25
8	9	3	20	98	15
9	10	3	50	98	45
10	11	3	65	65	65
11	12	3	45	85	32
12	13	4	46	98	65
13	14	4	48	87	12
14	15	4	75	56	78
15	16	4	58	98	65
16	17	5	65	68	98
17	18	5	80	78	90
18	19	5	89	68	87
19	20	5	78	83	58

	id	nclass	math	english	science
0	1	1	50	98	50
1	2	1	60	97	60
2	3	1	45	86	78
3	4	1	30	98	58
4	5	2	25	80	65
5	6	2	50	89	98
6	7	2	80	90	45
7	8	2	90	78	25
8	9	3	20	98	15
9	10	3	50	98	45

	id	nclass	math	english	science
10	11	3	65	65	65
11	12	3	45	85	32
12	13	4	46	98	65
13	14	4	48	87	12
14	15	4	75	56	78
15	16	4	58	98	65
16	17	5	65	68	98
17	18	5	80	78	90
18	19	5	89	68	87
19	20	5	78	83	58

2강. 데이터 프레임과 분석 기초

04. 데이터 프레임의 세계로

04-1. 데이터 프레임 이해하기(77-80쪽)

04-2. 데이터 프레임 만들기(81-84쪽)

[Do it! 실습] 데이터 입력해 데이터 프레임 만들기(81쪽)

[Do it! 실습] 데이터 프레임으로 분석하기(82쪽)

특정 변수의 값 추출하기

변수의 값으로 합계 구하기

변수의 값으로 평균 구하기

[개인 실습] 혼자서 해보기(84쪽)

Q1

Q2

04-3. 외부 데이터 이용하기(85-92)

[Do it! 실습] 엑셀 파일 불러오기(85쪽)

엑셀 파일의 첫 번째 행이 변수명이 아니라면?

엑셀 파일에 시트가 여러 개 있다면?

[Do it! 실습] CSV 파일 불러오기(90쪽)

[Do it! 실습] 데이터 프레임을 CSV 파일로 저장하기(91쪽)

05. 데이터 분석 기초

05-1. 데이터 파악하기(99-106쪽)

05-2. 변수명 바꾸기(113-115쪽)

[개인 실습] 혼자서 해보기(115쪽)

Q1 : mpg 데이터를 불러와 복사본을 만드세요.

Q2 : 복사본 데이터를 이용해 cty는 city로, hwy는 highway로 수정하세요

05-3. 파생변수 만들기(116-128쪽)

[Do it! 실습] mpg 통합 연비 변수 만들기 (117쪽)

[Do it! 실습] 조건문을 활용해 파생변수 만들기 (118쪽)

1. 기준값 정하기

2. 합격 판정 변수 만들기

3. 빈도표로 합격 판정 자동차 수 살펴보기

4. 막대 그래프로 빈도 표현하기

[Do it! 실습] 중첩 조건문 활용하기 (123쪽)

1. 연비 등급 변수 만들기

2. 빈도표와 막대 그래프로 연비 등급 살펴보기

그래프에서 알파벳 순으로 막대 정렬하기

필요한 만큼 범주 만들기

[Do it! 실습] 목록에 해당하는 행으로 변수 만들기(128쪽)

[분석 도전]

문제1

문제2

문제3

문제4

문제 5:

The End of Note

공유하기

댓글남기기

참고

[개벽의 사회주의] 01. 사회주의, 개벽, TNA

[개벽의 사회주의] 00. 환경설정

[23파이썬특강] 7강. TNA 5단계

[23파이썬특강] 6-8강. 『개벽』 데이터 분석

	id	nclass	math	english	science
0	1	1	50	98	50
1	2	1	60	97	60
2	3	1	45	86	78
3	4	1	30	98	58
4	5	2	25	80	65
5	6	2	50	89	98
6	7	2	80	90	45
7	8	2	90	78	25
8	9	3	20	98	15
9	10	3	50	98	45
10	11	3	65	65	65
11	12	3	45	85	32
12	13	4	46	98	65
13	14	4	48	87	12
14	15	4	75	56	78
15	16	4	58	98	65
16	17	5	65	68	98
17	18	5	80	78	90
18	19	5	89	68	87
19	20	5	78	83	58

	1	1.1	50	98	50.1
0	2	1	60	97	60
1	3	2	25	80	65
2	4	2	50	89	98
3	5	3	20	98	15
4	6	3	50	98	45
5	7	4	46	98	65
6	8	4	48	87	12

	0	1	2	3	4
0	1	1	50	98	50
1	2	1	60	97	60
2	3	2	25	80	65
3	4	2	50	89	98
4	5	3	20	98	15
5	6	3	50	98	45
6	7	4	46	98	65
7	8	4	48	87	12

	id	nclass	math	english	science
0	1	1	50	98	50
1	2	1	60	97	60
2	3	1	45	86	78
3	4	1	30	98	58
4	5	2	25	80	65
5	6	2	50	89	98
6	7	2	80	90	45
7	8	2	90	78	25
8	9	3	20	98	15
9	10	3	50	98	45
10	11	3	65	65	65
11	12	3	45	85	32
12	13	4	46	98	65
13	14	4	48	87	12
14	15	4	75	56	78
15	16	4	58	98	65
16	17	5	65	68	98
17	18	5	80	78	90
18	19	5	89	68	87
19	20	5	78	83	58

	id	nclass	math	english	science
0	1	1	50	98	50
1	2	1	60	97	60
2	3	1	45	86	78
3	4	1	30	98	58
4	5	2	25	80	65
5	6	2	50	89	98
6	7	2	80	90	45
7	8	2	90	78	25
8	9	3	20	98	15
9	10	3	50	98	45
10	11	3	65	65	65
11	12	3	45	85	32
12	13	4	46	98	65
13	14	4	48	87	12
14	15	4	75	56	78
15	16	4	58	98	65
16	17	5	65	68	98
17	18	5	80	78	90
18	19	5	89	68	87
19	20	5	78	83	58

	id	nclass	math	english	science
0	1	1	50	98	50
1	2	1	60	97	60
2	3	1	45	86	78
3	4	1	30	98	58
4	5	2	25	80	65
5	6	2	50	89	98
6	7	2	80	90	45
7	8	2	90	78	25
8	9	3	20	98	15
9	10	3	50	98	45

	id	nclass	math	english	science
10	11	3	65	65	65
11	12	3	45	85	32
12	13	4	46	98	65
13	14	4	48	87	12
14	15	4	75	56	78
15	16	4	58	98	65
16	17	5	65	68	98
17	18	5	80	78	90
18	19	5	89	68	87
19	20	5	78	83	58