반응형
Notice
Recent Posts
Recent Comments
Link
| 일 | 월 | 화 | 수 | 목 | 금 | 토 |
|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| 8 | 9 | 10 | 11 | 12 | 13 | 14 |
| 15 | 16 | 17 | 18 | 19 | 20 | 21 |
| 22 | 23 | 24 | 25 | 26 | 27 | 28 |
| 29 | 30 | 31 |
Tags
- BERTopic
- 데이터리안
- KeyBert
- 피파온라인 API
- Optimizer
- 코사인 유사도
- Roberta
- 조축회
- 데이터넥스트레벨챌린지
- 클래스 분류
- 문맥을 반영한 토픽모델링
- 포아송분포
- CTM
- 구글 스토어 리뷰
- 데벨챌
- Tableu
- 블루 아카이브
- 자연어 모델
- 옵티마이저
- 다항분포
- geocoding
- 붕괴 스타레일
- 원신
- 개체명 인식
- 트위치
- SBERT
- NLP
- 토픽 모델링
- LDA
- 블루아카이브 토픽모델링
Archives
- Today
- Total
분석하고싶은코코
제주도 교통량 예측(1)_EDA 본문
반응형
제주도 교통량 측정을 위한 EDA
전부 실행하면 업로드가 안되어서 주요 코드만 실행함.
In [1]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import re
import warnings
from collections import Counter
train = pd.read_csv('/Users/seokholee/Downloads/open/train.csv')
test = pd.read_csv('/Users/seokholee/Downloads/open/test.csv')
submission = pd.read_csv('/Users/seokholee/Downloads/open/sample_submission.csv')
In [2]:
pd.set_option("display.max_seq_items", None)
In [3]:
train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4701217 entries, 0 to 4701216 Data columns (total 24 columns): # Column Dtype --- ------ ----- 0 id object 1 base_date int64 2 day_of_week object 3 base_hour int64 4 road_in_use int64 5 lane_count int64 6 road_rating int64 7 road_name object 8 multi_linked int64 9 connect_code int64 10 maximum_speed_limit float64 11 vehicle_restricted float64 12 weight_restricted float64 13 height_restricted float64 14 road_type int64 15 start_node_name object 16 start_latitude float64 17 start_longitude float64 18 start_turn_restricted object 19 end_node_name object 20 end_latitude float64 21 end_longitude float64 22 end_turn_restricted object 23 target float64 dtypes: float64(9), int64(8), object(7) memory usage: 860.8+ MB
In [4]:
train.isnull().sum()
Out[4]:
id 0 base_date 0 day_of_week 0 base_hour 0 road_in_use 0 lane_count 0 road_rating 0 road_name 0 multi_linked 0 connect_code 0 maximum_speed_limit 0 vehicle_restricted 0 weight_restricted 0 height_restricted 0 road_type 0 start_node_name 0 start_latitude 0 start_longitude 0 start_turn_restricted 0 end_node_name 0 end_latitude 0 end_longitude 0 end_turn_restricted 0 target 0 dtype: int64
In [5]:
train.head(10)
Out[5]:
| id | base_date | day_of_week | base_hour | road_in_use | lane_count | road_rating | road_name | multi_linked | connect_code | ... | road_type | start_node_name | start_latitude | start_longitude | start_turn_restricted | end_node_name | end_latitude | end_longitude | end_turn_restricted | target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | TRAIN_0000000 | 20220623 | 목 | 17 | 0 | 1 | 106 | 지방도1112호선 | 0 | 0 | ... | 3 | 제3교래교 | 33.427747 | 126.662612 | 없음 | 제3교래교 | 33.427749 | 126.662335 | 없음 | 52.0 |
| 1 | TRAIN_0000001 | 20220728 | 목 | 21 | 0 | 2 | 103 | 일반국도11호선 | 0 | 0 | ... | 0 | 광양사거리 | 33.500730 | 126.529107 | 있음 | KAL사거리 | 33.504811 | 126.526240 | 없음 | 30.0 |
| 2 | TRAIN_0000002 | 20211010 | 일 | 7 | 0 | 2 | 103 | 일반국도16호선 | 0 | 0 | ... | 0 | 창고천교 | 33.279145 | 126.368598 | 없음 | 상창육교 | 33.280072 | 126.362147 | 없음 | 61.0 |
| 3 | TRAIN_0000003 | 20220311 | 금 | 13 | 0 | 2 | 107 | 태평로 | 0 | 0 | ... | 0 | 남양리조트 | 33.246081 | 126.567204 | 없음 | 서현주택 | 33.245565 | 126.566228 | 없음 | 20.0 |
| 4 | TRAIN_0000004 | 20211005 | 화 | 8 | 0 | 2 | 103 | 일반국도12호선 | 0 | 0 | ... | 0 | 애월샷시 | 33.462214 | 126.326551 | 없음 | 애월입구 | 33.462677 | 126.330152 | 없음 | 38.0 |
| 5 | TRAIN_0000005 | 20210913 | 월 | 7 | 0 | 2 | 107 | 경찰로 | 0 | 0 | ... | 0 | 시청입구2 | 33.249949 | 126.505664 | 없음 | 서호2차현대맨션203동 | 33.252183 | 126.506069 | 없음 | 28.0 |
| 6 | TRAIN_0000006 | 20220106 | 목 | 0 | 0 | 2 | 107 | - | 0 | 0 | ... | 0 | 가동 | 33.418412 | 126.268029 | 없음 | 나동 | 33.414175 | 126.269378 | 없음 | 39.0 |
| 7 | TRAIN_0000007 | 20211213 | 월 | 16 | 0 | 2 | 107 | 외도천교 | 0 | 0 | ... | 3 | 외도천교 | 33.482392 | 126.441622 | 없음 | 외도천교 | 33.482332 | 126.442266 | 없음 | 28.0 |
| 8 | TRAIN_0000008 | 20211004 | 월 | 15 | 0 | 2 | 107 | 경찰로 | 0 | 0 | ... | 0 | 신성교회 | 33.253074 | 126.506393 | 없음 | 서호2차현대맨션203동 | 33.252183 | 126.506069 | 없음 | 14.0 |
| 9 | TRAIN_0000009 | 20211208 | 수 | 2 | 0 | 1 | 103 | 일반국도16호선 | 0 | 0 | ... | 0 | 양수장 | 33.361717 | 126.766958 | 없음 | 제2가시교 | 33.364336 | 126.769409 | 없음 | 52.0 |
10 rows × 24 columns
In [6]:
for col in list(train):
remainders = train.drop_duplicates([col]).shape[0]
if remainders != train.shape[0]:
print(f'중복된 값이 있는 Column : {col}')
'''
Dataframe.duplicates : 데이터프레임에 대해서 중복된 행에 대해서 True/False를 마크하여 Series 반환
Dataframe.drop_duplicates : 중복된 내용을 제거한 데이터프레임을 반환
'''
중복된 값이 있는 Column : base_date 중복된 값이 있는 Column : day_of_week 중복된 값이 있는 Column : base_hour 중복된 값이 있는 Column : road_in_use 중복된 값이 있는 Column : lane_count 중복된 값이 있는 Column : road_rating 중복된 값이 있는 Column : road_name 중복된 값이 있는 Column : multi_linked 중복된 값이 있는 Column : connect_code 중복된 값이 있는 Column : maximum_speed_limit 중복된 값이 있는 Column : vehicle_restricted 중복된 값이 있는 Column : weight_restricted 중복된 값이 있는 Column : height_restricted 중복된 값이 있는 Column : road_type 중복된 값이 있는 Column : start_node_name 중복된 값이 있는 Column : start_latitude 중복된 값이 있는 Column : start_longitude 중복된 값이 있는 Column : start_turn_restricted 중복된 값이 있는 Column : end_node_name 중복된 값이 있는 Column : end_latitude 중복된 값이 있는 Column : end_longitude 중복된 값이 있는 Column : end_turn_restricted 중복된 값이 있는 Column : target
Out[6]:
'\n\nDataframe.duplicates : 데이터프레임에 대해서 중복된 행에 대해서 True/False를 마크하여 Series 반환\nDataframe.drop_duplicates : 중복된 내용을 제거한 데이터프레임을 반환\n\n'
In [7]:
train.corr()
Out[7]:
| base_date | base_hour | road_in_use | lane_count | road_rating | multi_linked | connect_code | maximum_speed_limit | vehicle_restricted | weight_restricted | height_restricted | road_type | start_latitude | start_longitude | end_latitude | end_longitude | target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| base_date | 1.000000 | -0.008645 | -0.001800 | 0.011463 | 0.018547 | 0.000832 | -0.010633 | -0.018713 | NaN | -0.011030 | NaN | -0.004599 | -0.016818 | -0.004954 | -0.016786 | -0.004972 | -0.033997 |
| base_hour | -0.008645 | 1.000000 | -0.001188 | -0.029194 | 0.031658 | 0.005711 | -0.002649 | -0.036756 | NaN | -0.003231 | NaN | -0.007880 | -0.021599 | -0.011478 | -0.021597 | -0.011489 | -0.159407 |
| road_in_use | -0.001800 | -0.001188 | 1.000000 | 0.008773 | -0.033396 | -0.000806 | -0.001880 | -0.003815 | NaN | -0.014873 | NaN | -0.018760 | -0.027831 | 0.018197 | -0.028571 | 0.018275 | 0.026095 |
| lane_count | 0.011463 | -0.029194 | 0.008773 | 1.000000 | -0.095717 | -0.026555 | -0.029290 | 0.384002 | NaN | -0.177224 | NaN | -0.050715 | 0.182674 | -0.094806 | 0.182330 | -0.094732 | -0.144256 |
| road_rating | 0.018547 | 0.031658 | -0.033396 | -0.095717 | 1.000000 | 0.024218 | -0.054160 | -0.327474 | NaN | -0.118630 | NaN | -0.125618 | -0.204793 | 0.007401 | -0.204843 | 0.007386 | -0.261693 |
| multi_linked | 0.000832 | 0.005711 | -0.000806 | -0.026555 | 0.024218 | 1.000000 | -0.001111 | -0.020245 | NaN | -0.008790 | NaN | 0.042977 | -0.014906 | 0.026895 | -0.014907 | 0.026896 | -0.008408 |
| connect_code | -0.010633 | -0.002649 | -0.001880 | -0.029290 | -0.054160 | -0.001111 | 1.000000 | -0.015190 | NaN | -0.020491 | NaN | -0.025846 | 0.036623 | -0.045695 | 0.037163 | -0.044853 | 0.048348 |
| maximum_speed_limit | -0.018713 | -0.036756 | -0.003815 | 0.384002 | -0.327474 | -0.020245 | -0.015190 | 1.000000 | NaN | 0.085080 | NaN | 0.059511 | 0.253147 | -0.033018 | 0.252958 | -0.032907 | 0.425715 |
| vehicle_restricted | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| weight_restricted | -0.011030 | -0.003231 | -0.014873 | -0.177224 | -0.118630 | -0.008790 | -0.020491 | 0.085080 | NaN | 1.000000 | NaN | 0.792803 | -0.128291 | 0.034926 | -0.128305 | 0.034915 | 0.294092 |
| height_restricted | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| road_type | -0.004599 | -0.007880 | -0.018760 | -0.050715 | -0.125618 | 0.042977 | -0.025846 | 0.059511 | NaN | 0.792803 | NaN | 1.000000 | -0.043420 | 0.033684 | -0.043430 | 0.033664 | 0.200840 |
| start_latitude | -0.016818 | -0.021599 | -0.027831 | 0.182674 | -0.204793 | -0.014906 | 0.036623 | 0.253147 | NaN | -0.128291 | NaN | -0.043420 | 1.000000 | 0.127042 | 0.999106 | 0.127005 | 0.036280 |
| start_longitude | -0.004954 | -0.011478 | 0.018197 | -0.094806 | 0.007401 | 0.026895 | -0.045695 | -0.033018 | NaN | 0.034926 | NaN | 0.033684 | 0.127042 | 1.000000 | 0.126900 | 0.999219 | -0.001168 |
| end_latitude | -0.016786 | -0.021597 | -0.028571 | 0.182330 | -0.204843 | -0.014907 | 0.037163 | 0.252958 | NaN | -0.128305 | NaN | -0.043430 | 0.999106 | 0.126900 | 1.000000 | 0.127098 | 0.036139 |
| end_longitude | -0.004972 | -0.011489 | 0.018275 | -0.094732 | 0.007386 | 0.026896 | -0.044853 | -0.032907 | NaN | 0.034915 | NaN | 0.033664 | 0.127005 | 0.999219 | 0.127098 | 1.000000 | -0.001000 |
| target | -0.033997 | -0.159407 | 0.026095 | -0.144256 | -0.261693 | -0.008408 | 0.048348 | 0.425715 | NaN | 0.294092 | NaN | 0.200840 | 0.036280 | -0.001168 | 0.036139 | -0.001000 | 1.000000 |
In [8]:
sns.heatmap(train.corr())
Out[8]:
<AxesSubplot:>
도식화를 편하게 하기 위한 사용자 함수
In [9]:
def print_mode(df, col):
# coollection Counter 함수를 통해서 가장 흔한(많은)값을 찾음
cnt = Counter(df[col])
list_cnt = cnt.most_common(3)
for idx, value in enumerate(list_cnt):
print(f'{col} 최빈값 : {idx+1} 순위 : {value[0]} & {value[-1]}개')
def print_statics(df,col):
max = df[col].max()
min = df[col].min()
mean = df[col].mean()
median = df[col].median()
print(f'{col} MAX : {max}')
print(f'{col} MIN : {min}')
print(f'{col} MEAN : {mean}')
print(f'{col} MIDEAN : {median}')
print_mode(df,col)
In [10]:
def identify_hist(df, col):
sns.histplot(data=train[col], kde=True)
print_statics(df,col)
def identify_count(df, col):
print(df[col].value_counts())
sns.countplot(data=df, x=col)
plt.show()
In [11]:
def value_hist(df, col, target='target'):
for value in df[col].unique():
cond = (df[col]==value)
cond_df = df.loc[cond]
print(f'{value} 개수 : {cond_df.shape[0]} ')
print_statics(cond_df, target)
fig, ax = plt.subplots(ncols = 2, figsize=(13,6))
sns.histplot(data = cond_df, x='target', ax=ax[0])
ax[0].set_title(f"{col}'s {value} histogram")
ax[0].set_xticks(range(0, int(df[target].max()+1),20))
sns.boxplot(data=cond_df, x='target', ax=ax[1])
ax[1].set_title(f"{col}'s {value} BoxPlot'")
plt.show()
value_hist(train,'day_of_week')
목 개수 : 674070 target MAX : 95.0 target MIN : 1.0 target MEAN : 42.76834453395048 target MIDEAN : 43.0 target 최빈값 : 1 순위 : 48.0 & 16247개 target 최빈값 : 2 순위 : 49.0 & 15861개 target 최빈값 : 3 순위 : 50.0 & 15737개
/Users/seokholee/opt/anaconda3/lib/python3.9/site-packages/matplotlib/backends/backend_agg.py:240: RuntimeWarning: Glyph 47785 missing from current font. font.set_text(s, 0.0, flags=flags) /Users/seokholee/opt/anaconda3/lib/python3.9/site-packages/matplotlib/backends/backend_agg.py:203: RuntimeWarning: Glyph 47785 missing from current font. font.set_text(s, 0, flags=flags)
일 개수 : 673632 target MAX : 103.0 target MIN : 1.0 target MEAN : 43.17929967697496 target MIDEAN : 43.0 target 최빈값 : 1 순위 : 48.0 & 16700개 target 최빈값 : 2 순위 : 49.0 & 16071개 target 최빈값 : 3 순위 : 47.0 & 15885개
/Users/seokholee/opt/anaconda3/lib/python3.9/site-packages/matplotlib/backends/backend_agg.py:240: RuntimeWarning: Glyph 51068 missing from current font. font.set_text(s, 0.0, flags=flags) /Users/seokholee/opt/anaconda3/lib/python3.9/site-packages/matplotlib/backends/backend_agg.py:203: RuntimeWarning: Glyph 51068 missing from current font. font.set_text(s, 0, flags=flags)
금 개수 : 684024 target MAX : 112.0 target MIN : 1.0 target MEAN : 42.450327766277205 target MIDEAN : 42.0 target 최빈값 : 1 순위 : 49.0 & 15937개 target 최빈값 : 2 순위 : 48.0 & 15819개 target 최빈값 : 3 순위 : 54.0 & 15703개
/Users/seokholee/opt/anaconda3/lib/python3.9/site-packages/matplotlib/backends/backend_agg.py:240: RuntimeWarning: Glyph 44552 missing from current font. font.set_text(s, 0.0, flags=flags) /Users/seokholee/opt/anaconda3/lib/python3.9/site-packages/matplotlib/backends/backend_agg.py:203: RuntimeWarning: Glyph 44552 missing from current font. font.set_text(s, 0, flags=flags)
화 개수 : 662498 target MAX : 113.0 target MIN : 1.0 target MEAN : 42.699197582483265 target MIDEAN : 43.0 target 최빈값 : 1 순위 : 48.0 & 15599개 target 최빈값 : 2 순위 : 52.0 & 15574개 target 최빈값 : 3 순위 : 49.0 & 15337개
/Users/seokholee/opt/anaconda3/lib/python3.9/site-packages/matplotlib/backends/backend_agg.py:240: RuntimeWarning: Glyph 54868 missing from current font. font.set_text(s, 0.0, flags=flags) /Users/seokholee/opt/anaconda3/lib/python3.9/site-packages/matplotlib/backends/backend_agg.py:203: RuntimeWarning: Glyph 54868 missing from current font. font.set_text(s, 0, flags=flags)
월 개수 : 661643 target MAX : 95.0 target MIN : 1.0 target MEAN : 42.76136526797684 target MIDEAN : 43.0 target 최빈값 : 1 순위 : 48.0 & 15644개 target 최빈값 : 2 순위 : 49.0 & 15435개 target 최빈값 : 3 순위 : 50.0 & 15371개
/Users/seokholee/opt/anaconda3/lib/python3.9/site-packages/matplotlib/backends/backend_agg.py:240: RuntimeWarning: Glyph 50900 missing from current font. font.set_text(s, 0.0, flags=flags) /Users/seokholee/opt/anaconda3/lib/python3.9/site-packages/matplotlib/backends/backend_agg.py:203: RuntimeWarning: Glyph 50900 missing from current font. font.set_text(s, 0, flags=flags)
수 개수 : 675583 target MAX : 96.0 target MIN : 1.0 target MEAN : 42.77037166417746 target MIDEAN : 43.0 target 최빈값 : 1 순위 : 50.0 & 16005개 target 최빈값 : 2 순위 : 49.0 & 15848개 target 최빈값 : 3 순위 : 48.0 & 15806개
/Users/seokholee/opt/anaconda3/lib/python3.9/site-packages/matplotlib/backends/backend_agg.py:240: RuntimeWarning: Glyph 49688 missing from current font. font.set_text(s, 0.0, flags=flags) /Users/seokholee/opt/anaconda3/lib/python3.9/site-packages/matplotlib/backends/backend_agg.py:203: RuntimeWarning: Glyph 49688 missing from current font. font.set_text(s, 0, flags=flags)
토 개수 : 669767 target MAX : 90.0 target MIN : 1.0 target MEAN : 42.89411690931324 target MIDEAN : 43.0 target 최빈값 : 1 순위 : 48.0 & 16229개 target 최빈값 : 2 순위 : 49.0 & 15755개 target 최빈값 : 3 순위 : 47.0 & 15361개
/Users/seokholee/opt/anaconda3/lib/python3.9/site-packages/matplotlib/backends/backend_agg.py:240: RuntimeWarning: Glyph 53664 missing from current font. font.set_text(s, 0.0, flags=flags) /Users/seokholee/opt/anaconda3/lib/python3.9/site-packages/matplotlib/backends/backend_agg.py:203: RuntimeWarning: Glyph 53664 missing from current font. font.set_text(s, 0, flags=flags)
In [12]:
identify_count(train, 'base_hour')
15 214541 13 214297 14 214182 12 211833 19 209870 11 208515 16 208420 17 208377 18 207500 10 206316 9 205327 20 205059 21 203585 8 201875 22 200629 7 199061 6 189418 23 184229 1 182353 5 181128 2 169322 4 165284 3 155938 0 154158 Name: base_hour, dtype: int64
In [ ]:
identify_count(train,'road_in_use')
In [ ]:
value_hist(train,'road_in_use')
In [ ]:
identify_count(train, 'lane_count')
In [ ]:
value_hist(train, 'lane_count')
In [ ]:
identify_count(train, 'road_rating')
In [ ]:
value_hist(train, 'lane_count')
In [ ]:
identify_count(train, 'road_rating')
In [ ]:
value_hist(train, 'road_rating')
In [ ]:
identify_count(train, 'multi_linked')
In [ ]:
value_hist(train, 'multi_linked')
In [ ]:
identify_count(train, 'multi_linked')
In [ ]:
value_hist(train, 'multi_linked')
In [ ]:
identify_count(train, 'connect_code')
In [ ]:
value_hist(train, 'connect_code')
In [ ]:
identify_count(train, 'maximum_speed_limit')
In [ ]:
value_hist(train, 'maximum_speed_limit')
In [ ]:
identify_count(train, 'weight_restricted')
In [ ]:
value_hist(train, 'weight_restricted')
In [ ]:
identify_count(train, 'height_restricted')
In [ ]:
value_hist(train, 'height_restricted')
In [ ]:
identify_count(train, 'road_type')
In [ ]:
value_hist(train, 'road_type')
In [ ]:
identify_count(train, 'start_turn_restricted')
In [ ]:
value_hist(train, 'start_turn_restricted')
In [ ]:
identify_count(train, 'end_turn_restricted')
In [ ]:
value_hist(train, 'end_turn_restricted')
In [ ]:
identify_count(train, 'vehicle_restricted')
In [ ]:
value_hist(train, 'vehicle_restricted')
반응형
'데이터분석' 카테고리의 다른 글
| 텍스트 마이닝 - Bag of words / TF-IDF (0) | 2022.12.12 |
|---|---|
| 이디야는 스타벅스 근처에 입점한다? (0) | 2022.11.25 |
| 데이터분석(4) - 타이타닉 생존자 구하기 (0) | 2022.06.27 |
| 데이터분석(3) - 랜덤 포레스트 (0) | 2022.06.07 |
| 데이터 분석(2) - 따릉이 수요량 예측 (0) | 2022.06.06 |