While looking around for a data set to use, I stumbled upon this kaggle entry on metal bands. Being a metal head since middle school, this peaked my interest like no other data set I've seen so far. After downloading the two data sets from here:
Data | Link | Stats | Date Accessed |
---|---|---|---|
Encyclopedia Metallum | Metal-Archives | 119474 Bands | 12/5/17 |
Wikipedia | Population Data | 240 Countries | 12/15/17 |
When I use to look up bands, I used the Encyclopedia Metallum and the database had just about every obscure band
you could find. Since they let users contribute to the databse (the acceptance process is heavily monitored),
small bands can put themselves into the databse. Instead of using the kaggle data, I decided to scrape the metal
archives myself since I felt that 5000 bands was lacking in terms of analysis for the entire world. For population data,
I opted for Wikipedia since most of their population information was based on a real time
world population clock.
import time
from multiprocessing import Pool
import requests
from bs4 import BeautifulSoup
import html5lib
import json
import pandas as pd
import numpy as np
import csv
import re
To scrap the countries page of Encyclopedia Metallum, we need to make a HTTP request to the URL. If the request was successful and a connection was created, the response should be:
<Response [200]>
enyclopaedia_metallum_country_url = 'https://www.metal-archives.com/browse/country'
countries_request = requests.get(enyclopaedia_metallum_country_url)
print(countries_request)
To convert the request response into usable HTML tags easily, BeautifulSoup
is the way to go. Make sure you have these imported:
from bs4 import BeautifulSoup
import html5lib
The html5lib
library helps convert the response to HTML, thus allowing the use of .find_all(tag)
. Your goal is to look at the HTML and figure which tags have the information you want and specifically target that tag to filter out as much as possible.
<div class="countryCol">
1) href or class field tag
<a href="https://www.metal-archives.com/lists/AF">Afghanistan"</a>
<div class="countryCol">
.find_all('a', href = True)
.find_all('div', class_ = True)
2) Specify field specifics with dictionaries
<div class="clear loading">
.find_all('a', {'class': 'clear loading'})
soup = BeautifulSoup(countries_request.content, 'html5lib')
countries_soup = soup.find_all('div',{'class': 'countryCol'})
countries_soup
As you can see from above, all the countries are in this tag format:
<a href="https://www.metal-archives.com/lists/(COUNTRY ID)">(COUNTRY)</a>
I simply extracted the countries out by looping through every country tag and using a regular expression's (regex) .groups()
capabilities. Don't forget to import the regex library before use. Regexes are pretty useful, take a look here for official documentation.
import re
re.compile('<a href="https://www.metal-archives.com\/lists\/(.+)">(.+)</a>')
With the parenthesis around the parts where I want to extract information, .groups()
will return a tuple in the form of (COUNTRY ID, COUNTRY)
.
# Create country dictionary where country is the key and the country id is the value
all_countries = {}
country_regex = re.compile('<a href="https://www.metal-archives.com\/lists\/(.+)">(.+)</a>')
for col in range(len(countries_soup)):
country_a_href = countries_soup[col].find_all('a') # Further filtering tags to only country tags
for a_href in range(len(country_a_href)):
tag = str(country_a_href[a_href]).strip()
matched = country_regex.match(tag).groups() # (ID, country)
all_countries[matched[0]] = (matched[1])
all_countries
This part is somewhat complex because of the way the Encyclopedia Metallum is made. When you click on a country, you will see all the bands for that specific country.
BeautifulSoup
the HTML code, you won't see any of the listed bands because the webpage is a JavaScript
rendered webpage. As you can see with the inspect element I captured through Internet Explorer (IE).1) Source for deriving solutions to #1 and #2: jonchar on GitHub wrote his own scraper for the Encyclopedia Metallum.
2) Solution to number #3 was from this Stack Overflow post and links to HTTP headers are here.
You can see the JavaScript
command to the AJAX Databse (this is where all their data on bands is stored). To specify which country you want to see, you simply put in the country's ID number, which we have extracted earlier.
var grid = createGrid("#bandListCountry", 500, 'browse/ajax-country/c/(COUNTRY ID)/json/1/', {aoColumns: [ null, null, null, { bSortable: false, sWidth: '80px'} ] });
Instead of doing an HTTP Request like usual, we have to add the AJAX call to the end of the URL. The AJAX database doesn't return HTML, it gives you data in JSON form, so essentially Python
dictionary form. Make sure you have the JSON library imported.
import json
When a country has more than 500 bands, there will be more than one page, therefore we use the params
part of HTTP requests and give it a payload. Look at jonchar's README.md for detail explanation of payload. Basically, to get the first page, you want:
payload = {'sEcho': 0, 'iDisplayStart': 0, 'iDisplayLength': 500}
For every consecutive page, you increase both start and end by 500. You want to do this until you've reached the last page.
Since we are querying one country (there's around 140 countries listed) at a time and each page at a time, we'll make maybe close to 200 requests to Encyclopedia Metallum. To prevent the website from disconnecting our HTTP request, we have to pretend to be browser accessing the page. The best way is to give the request a User-Agent along with the request. With the header below, I'm pretending to be Firefox:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1'}
# Helper function that returns all the bands listed for
# a specific country and the specific page requested
def get_page_query(country, CID, start, end):
# Full AJAX database call components
URL = 'http://www.metal-archives.com'
AJAX_REQUEST_BEGIN = '/browse/ajax-country/c/'
AJAX_REQUEST_END = '/json/1/'
payload = {'sEcho': 0,
'iDisplayStart': start,
'iDisplayLength': end}
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1'}
# Request with payload and headers
r = requests.get(URL + AJAX_REQUEST_BEGIN + CID + AJAX_REQUEST_END,
params = payload,
headers = headers)
# Converting response information into Python dictionary
page_content = json.loads(r.text)
total = page_content['iTotalRecords']
r.close()
return {'finished_query': total <= end, 'data': page_content['aaData'], 'total': total}
This helper function uses the above helper function. It extracts the information we want and writes it into the TSV passed in. This part mainly does the parsing where as the one above deals with the HTTP Request. Essentially a single call will write all the bands for the country requested into the designated TSV file.
# Helper function that uses the other helper function to get
# all the bands listed for a specific country
def get_bands_by_country(opened_file, writer, country,CID):
data = []
start = 0
end = 0
query_complete = False
# Regexes
name_ahref_tag = re.compile("<a href='(.*)'>(.*)</a>") # Name
status_spanclass_tag = re.compile('<span class="(.*)">(.*)</span>') # Status
website_regex = re.compile('https://www.metal-archives.com/bands/(.*)/(.*)') # Website and ID
# Until the last page of the country is reached, keep querying
while query_complete == False:
start = end + 1
end = end + 500
query = get_page_query(country, CID, start, end)
query_complete = query['finished_query']
query_data = query['data'] # Contains Genres and Location
band_info = []
for idx in range(len(query_data)):
band = query_data[idx]
name_match = name_ahref_tag.match(band[0]).groups()
status_match = status_spanclass_tag.match(band[3])
name = name_match[1]
website = name_match[0]
ID = website_regex.match(website).groups()[1]
genres = band[1]
location = band[2]
status = status_match.groups()[1]
band_info = [name,ID,
country,CID,location,
genres,status,website]
writer.writerow(band_info) # Write into passed in TSV file
print(country + ' (' + CID + '): SUCCESS [' + str(query['total']) + ']')
return data
This part of the code executes all the helper functions above to write the information into a Tabs Separated File (TSV). Because the genres have commas in the string, I decided to use TSV to prevent problems with reading the file later on. To do this, you need to import the csv library:
import csv
To make this a TSV instead of CSV, just add this
delimiter = '\t'
field in your CSV writer. You can actually make a file with any separator you would with this feature.
If your computer is fast enough and you're also patient enough, you can watch the progress via the print statements. It shows the country and the number of bands queried.
# Creates a new TSV file and uses the two helper functions
# above to write the data to a file
with open('encyclopedia_metallum_data.tsv', 'w') as tsvfile:
writer = csv.writer(tsvfile, delimiter = '\t')
headings = ['band_name','band_id',
'country', 'CID','location',
'genres', 'status',
'website']
writer.writerow(headings)
# Goes through every country retrieved earlier and calls helper function
for CID,country in all_countries.items():
get_bands_by_country(tsvfile,writer,country,CID)
tsvfile.close()
After all that hard work, let's reap the reward. Using Pandas
, we can easily read in a CSV,TSV, or any value separated file as a DataFrame
. Make sure you've imported the library:
import pandas as pd
with open('encyclopedia_metallum_data.tsv', 'r') as MA_tsv:
metal_data = pd.read_csv(MA_tsv, sep = '\t')
metal_data[['CID']] = metal_data[['CID']].replace(np.NAN, 'NA')
metal_data
I'm sorry we're still scrapping data. This is it, I promise, then we get to the fun stuff. Here we extract population data from Wikipedia. Just like the part where we scrapped the country's page, we use regexes and BeautifulSoup
.
Here's one row in the population data table for China:
<tr>
<td>1</td>
<td style="text-align:left;"><span class="flagicon" style="display:inline-block;width:25px;"><img alt="" class="thumbborder" data-file-height="600" data-file-width="900" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/23px-Flag_of_the_People%27s_Republic_of_China.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/35px-Flag_of_the_People%27s_Republic_of_China.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/45px-Flag_of_the_People%27s_Republic_of_China.svg.png 2x" width="23"/></span> <a href="/wiki/China" title="China">China</a><sup class="reference" id="cite_ref-5"><a href="#cite_note-5">[Note 2]</a></sup></td>
<td>1,388,060,000</td>
<td>December 14, 2017</td>
<td>18.3%</td>
<td style="text-align:left;"><a class="external text" href="http://worldpopulationclock.info/china" rel="nofollow">Official population clock</a></td>
</tr>
The country name is hidding in this tag:
<a href="/wiki/(COUNTRY LINK)" title="(COUNTRY)">(COUNTRY)</a>
We do the same thing we did earlier when extracting countries from Encyclopedia Metallum.
The population is in a <td>
tag, so we use:
.find_all('td', style = False)
and that will give you this in a list:
<td>1</td>
<td>1,388,060,000</td>
<td>December 14, 2017</td>
<td>18.3%</td>
We simply get the 2nd element and parse the population using regexes.
# HTTP Request
WORLD_POPULATION_URL = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
r = requests.get(WORLD_POPULATION_URL)
# BeautifulSoup & HTML
soup = BeautifulSoup(r.content, 'html5lib')
population_soup = soup.find('table',{'class': 'wikitable sortable'})
world_population = population_soup.find_all('tr')
world_population = world_population[1:] # Remove header row
# Regexes
population_regex = re.compile('<td>([0-9,]+)</td>')
country_regex = re.compile('<a href="/wiki/.*" title="(.*)">.*</a>')
country_redirect_regex = re.compile('<a class="mw-redirect" href="/wiki/.*" title="(.*)">.*</a>')
p_countries = []
population = []
with open('population_data_2017.csv', 'w') as population_csv:
writer = csv.writer(population_csv)
writer.writerow(['country', 'population']) # Data Header
# Parse tags
for i in range(len(world_population)):
country = world_population[i].find('a', href = True, class_ = False)
# Territory of other countries, name in redirect link
redirect = world_population[i].find('a', {'class', 'mw-redirect'})
population_of_country = world_population[i].find_all('td', style = False)[1]
c = country_regex.match(str(country).replace('\n',' '))
r = country_redirect_regex.match(str(redirect))
p = population_regex.match(str(population_of_country))
if redirect != None:
ctry = r.groups()[0]
elif c != None:
ctry = c.groups()[0]
if p != None:
pop = p.groups()[0].replace(',','')
writer.writerow([ctry,pop])
population_csv.close()
It's a thing of beauty, congratulations! You are now well versed in web scrapping with Python.
with open('population_data_2017.csv', 'r') as population_csv:
population_data = pd.read_csv(population_csv)
population_csv.close()
population_data.set_index('country', inplace = True)
population_data
!pip install folium
import folium
import matplotlib.pyplot as plt
import seaborn as sns
Our metal bands data has a genre column, but right now it isn't in a form where we can do any genre specific analysis. We need to determine the big major metal genre's and show whether the band belongs to any of these major groups. Encyclopedia Metallum has us covered with the list above. Metal genres usually have an umbrella term and anything attached to them are subgenres, like Melodic Death Metal for example. To do this task, were going to use regexes again and put the major umbrella terms as the keywords. Unlike regular regex matching, you want to use:
regex = re.compile(genre_str)
regex.search(band_genre)
Search, goes through the entire string instead of matching from the beginning only, very similar to most find tools to search for keywords. We also want to keep track of number of genre types each band is classified as. This will allow us to see whether the genre type is mostly pure or mixed with other genres.
If you're interested in learning more about metal genres, you can check out Patrick Galbraith and Nick Grant's Interactive Map of Metal. The map allows you to see the evolution of metal music.
genre_regex_str = ['Black', 'Death', 'Doom|Stoner|Sludge',
'Electronic|Industrial', 'Experimental|Avant-garde',
'Folk|Viking|Pagan', 'Gothic', 'Grindcore|Goregrind',
'Groove', 'Hard Rock', 'Heavy', 'Metalcore|Deathcore',
'Power', 'Progressive', 'Speed', 'Symphonic', 'Thrash']
genre_regexes = []
genre_count = {}
# Create genre keyword regexes
for g in range(len(genre_regex_str)):
genre_count[genre_regex_str[g]] = []
genre_regexes.append(re.compile(genre_regex_str[g]))
# For every band, determine if they belong to any of the major metal genre types
# Also count how many genre types each band belongs to
genre_mix = []
for idx,band in metal_data.iterrows():
mix = 0
for idx in range(len(genre_regexes)):
genre_group = genre_regex_str[idx]
if genre_regexes[idx].search(band['genres']) != None:
genre_count[genre_group].append(True)
mix = mix + 1
else:
genre_count[genre_group].append(False)
genre_mix.append(mix)
# Drop other columns not relevant to genre analysis
genre_data = metal_data.copy(deep = True)
genre_data = genre_data.drop(['band_id', 'location', 'website'], 1)
genre_data.insert(5, 'genre groups', genre_mix)
# Insert each genre column into original DataFrame
for idx in range(len(genre_regex_str)):
genre = genre_regex_str[idx]
genre_data.insert((idx + 6), genre, genre_count[genre])
genre_data
After the DataFrame
transformation, we can easily do band counts for each genre. Use .value_counts()
on the column you want to tally instead of going through the entire DataFrame
with a for loop. The DataFrame
is a nice presentation of the data, especially if you sort using:
.sort_values(column_name, ascending = False)
we can easily see which genres are the most popular, however there are some powerful graphing capabilities we can use to visualize the data instead of plain numbers.
genre_total_data = genre_data.copy(deep = True)
genre_count = []
genre_categories = list(genre_total_data.columns)[6:]
for idx in range(len(genre_categories)):
counts = genre_total_data[genre_categories[idx]].value_counts()
if True in counts:
genre_count.append(counts[True])
else:
genre_count.append(0)
g = {'genre': genre_categories, 'band count': genre_count}
genre_total_df = pd.DataFrame(data = g, index = g['genre'], columns = ['band count'])
genre_total_df.sort_values('band count', ascending = False)
For beautiful and easy to use plotting, import matplotlib and seaborn. You can use matplotlib
by itself, but seaborn
is much more flexible and makes fine tuning how a graph looks much easier.
import matplotlib.pyplot as plt
import seaborn as sns
If you want to look at color options for plots here are some useful links:
matplotlib
seaborn
DataFrame
we created above and we can answer one of our analysis questions.
Black, Death, Trash, Heavy, and Doom/Stoner/Sludge are the top 5 genres types in metal music. Which is not surprising if you're familiar with metal history, lower count genres tend to be newer genres where as the top 5 are ancestors to many of the other genre types and subgenres. It's pretty evident if you go look at the Map of Metal mentioned earlier.
sns.plt.title('Genre Classification Counts For Metal Bands') # Set graph title
genre_plot = sns.barplot(x = list(range(0,len(genre_total_df))),
y = genre_total_df['band count'],
palette = sns.color_palette("cubehelix", len(genre_total_df)))
genre_plot.set(xlabel = 'Genre Classification',
ylabel = 'Band Count',
xticklabels = list(genre_total_df.index)) # Set x-axis labels since python can't plot strings as values
sns.plt.xticks(rotation = 90) # Makes x-axis labels vertical
genre_plot.set_facecolor('white') # Changes tthe background to white
plt.show()
So we know the top overall metals genres, but let's look at what genres dominate each country. Instead of using .value_counts()
let's use pivot_table from Pandas
. The pivot_table
allows us to tally the bands with a genre type in a specific country.
genre_by_country = pd.pivot_table(data = genre_data,
index = ['country', 'CID'], # Combines all rows with the same values for these columns
values = genre_regex_str, # Give pandas the columns you want the aggregate function to apply
aggfunc = np.sum) # Specify what type of function, default is
genre_by_country.head()
Let's plot the DataFrame
from above to see the results. Since I'll be plotting 17 genre plots, let's make it a function so that we don't have to repeat code 17 times.
def genre_bargraph(df, title, ylabel, genre, palette):
plt.figure(figsize = (26,3)) # Change size of plot in inches
sns.plt.title(title)
ax = sns.barplot(x = list(df.index.labels[0]),
y = df[genre],
palette = palette)
ax.set(xlabel = 'Country',
ylabel = ylabel,
xticklabels = list(df.index.levels[0]))
sns.plt.xticks(rotation = 90)
ax.set_facecolor('white')
15/17 plots shows that the United States has a lot of bands in all 17 genres. However this doesn not help us answer the question of what genres dominate each country. So we have to take the band counts and divide by the total number of bands in that country.
genre_lst = list(genre_by_country.columns)
l = len(genre_by_country)
titles = ['Black Metal','Death Metal','Doom Metal','Electronic/Industrial Metal',
'Experimental/Avant-Garde Metal','Folk/Viking/Pagan Metal','Gothic Metal',
'Grindcore/Goregrind','Groove Metal','Hard Rock','Heavy Metal','Metalcore/Deathcore',
'Power Metal','Progressive Metal','Speed Metal','Symphonic Metal','Thrash Metal']
for idx in range(len(genre_lst)):
genre_bargraph(genre_by_country,
'Band Count of ' + titles[idx] + ' by Country',
'Band Count',
genre_lst[idx],
sns.cubehelix_palette(l, start = 0.5, rot = -0.75))
plt.show()
First We need to calculate how many bands are in each country.
band_count = pd.DataFrame(metal_data['country'].value_counts())
band_count = band_count.sort_index()
band_count.columns = ['band count']
band_count.head()
Now we can insert that into our original DataFrame
and let's use NumPy to our advantage to calculate genre percentage. If you have two NumPy
arrays of the same length, you can directly use:
a = np.array(dataA, dtype = np.float)
b = np.array(dataB, dtype = np.float)
list(a / b)
We do this for every genre and we're done!
genre_percentage_by_country = genre_by_country.copy(deep = True)
genre_percentage_by_country.insert(0, 'band count', list(band_count['band count'])) # Insert total band counts into DataFrame
lst = list(genre_percentage_by_country.columns)[1:]
for idx in range(len(lst)):
curr_genre = genre_percentage_by_country[lst[idx]]
genre_arr = np.array(curr_genre, dtype = np.float)
total_arr = np.array(genre_percentage_by_country['band count'], dtype = np.float)
genre_percentage_by_country[lst[idx]] = list(genre_arr / total_arr)
genre_percentage_by_country.head()
Let's regraph you can see for each country what genres are really prevalent in each country.
The answer is actually all of our 17 graphs!
Let's look at one as an example:
If the bars for that country is high, then this type of genre is a very popular genre for this specific country.
*You can double click on the graphs to enlarge them.
genre_lst = list(genre_percentage_by_country.columns)[1:]
l = len(genre_percentage_by_country)
for idx in range(len(genre_lst)):
genre_bargraph(genre_percentage_by_country,
'Percentage of ' + titles[idx] + ' by Country',
'Percentage',
genre_lst[idx],
sns.cubehelix_palette(l, start = 1.5, rot = -0.75, dark = 0, light = 0.95))
plt.show()
Looking at the graph we made more than 70,000 bands are pure one type of genre while around 40,000 bands are mix of 2 different genre types. Around 61% are pure and 34% are mix of 2, and around 40% are a mix of 2 or more genre types. I'm actually surprised since most bands are usually listed with so many genre tags.
purity_counts = pd.DataFrame(genre_purity_data['genre groups'].value_counts())
purity_counts
sns.plt.title('Mixed Metal Genre Counts') # Set graph title
genre_plot = sns.barplot(x = list(purity_counts.index),
y = purity_counts['genre groups'],
palette = sns.color_palette("cubehelix", len(purity_counts) * 2))
genre_plot.set(xlabel = 'Number of Genre Types', ylabel = 'Band Count') # Set x-axis and y-axis label names
genre_plot.set_facecolor('white') # Changes tthe background to white
plt.show()
c = list(band_count.index)
status_types = list(overall_status_counts.index)
status_count_data = {}
status_count_data['Total'] = list(band_count['band count'])
status_count_data['Active'] = []
status_count_data['Split-up'] = []
status_count_data['Changed name'] = []
status_count_data['On hold'] = []
status_count_data['Unknown'] = []
status_count_data['Disputed'] = []
for idx in range(len(c)):
status_counts = metal_data[metal_data['country'] == c[idx]]['status'].value_counts()
for st in range(len(status_types)):
if status_types[st] in status_counts:
status_count_data[status_types[st]].append(status_counts[status_types[st]])
else:
status_count_data[status_types[st]].append(0)
status_data = pd.DataFrame(data = status_count_data, index = c, columns = ['Total'] + status_types)
status_data.head()
We need to insert the population data we scrapped earlier into the new DataFrame
, we have to merge the two data sets together. Since there are some countries with different names in the data set, so I created an alias dictionary to solve the problem.
alias = {'Korea, South': 'South Korea',
'Macedonia (FYROM)': 'Republic of Macedonia',
'Georgia': 'Georgia (country)',
'Ireland': 'Republic of Ireland'}
pop_dataframe_countries = list(population_data.index)
countries = list(status_data.index)
population_info = []
for c in range(len(countries)):
country = countries[c]
if country in alias:
country = alias[country]
if country in pop_dataframe_countries:
country_population = population_data.loc[country]['population']
population_info.append(country_population)
else:
population_info.append(0)
status_data.insert(0, 'Population (2017)', population_info)
status_data = status_data[status_data['Population (2017)'] != 0] # Remove countries with no population match
status_data.head()
I wanted to calculate 3 different per capitas, one for overall, one for currently active bands, and one for bands that have split-up. This will tell us which countries have the most metal music activity in proportion to population.
The equation for calculating Per Capita (# of bands per person):
$\text{Per Capita} = \dfrac{\text{band count}}{\text{population}} \times \text{(scalar)}$
The scalar let's you control the proportion, if you want the original definition of per capita, the scalar is 1 for each person in the population. I'm going to do per every ten thousand individuals so my scalar is 10,000.
total_arr = np.array(list(status_data['Total']), dtype = np.float)
active_arr = np.array(list(status_data['Active']), dtype = np.float)
split_arr = np.array(list(status_data['Split-up']), dtype = np.float)
population_arr = np.array(list(status_data['Population (2017)']), dtype = np.float)
status_data.insert(2, 'Total Per Capita (10,000)', list((total_arr / population_arr) * 10000))
status_data.insert(4, 'Active Per Capita (10,000)', list((active_arr / population_arr) * 10000))
status_data.insert(6, 'Split-up Per Capita (10,000)', list((split_arr / population_arr) * 10000))
status_data.head()
Here, we're going to use folium make an interactive choropleth map, it'll let us see which country is the most metal better than a bar graph. I was inspired by this Metal Injection Blog Post.
!pip install folium
import folium
The GEOJSON file of the world is located here
It isn't exactly difficult if your data matches with the GEOJSON file exactly, but in our case, the file has information for a lot more countries than there are in the DataFrame
. The subsequent 2 cells contain code that modify the DataFrame
to include all the countries in the GEOJSON file and for countries that were not in our DataFrame
, we have to give them filler values.
choropleth_data = status_data.copy(deep = True)
alias = {'Serbia': 'Republic of Serbia',
'Macedonia (FYROM)': 'Macedonia',
'Korea, South': 'South Korea',
'United States': 'United States of America'}
# Going through the JSON file to extract all the countries
with open('countries.geo.json', 'r') as country_boundaries_geojson:
country_boundaries = json.load(country_boundaries_geojson)
cb = [country_boundaries['features'][i]['properties']['ADMIN']for i in range(len(country_boundaries['features']))]
country_boundaries_geojson.close()
l = list(choropleth_data.index)
geojson_country_name = []
# Making a column of the country name from the JSON file
# In order to make the map, all country names in DataFrame must
# Match the names in the JSON file.
for i in range(len(l)):
if l[i] in alias:
geojson_country_name.append(alias[l[i]])
elif l[i] in cb:
geojson_country_name.append(l[i])
choropleth_data.insert(0, 'geojson_country_name', geojson_country_name)
choropleth_data = choropleth_data.drop(['Unknown','Changed name','On hold','Disputed'], 1)
choropleth_data.head()
# Collect all the countries not matched from the JSON file
no_match = [cb[c] for c in range(len(cb)) if cb[c] not in geojson_country_name]
# Creating a DataFrame for the non-matched countries
# and giving them filler values
filler = [-1] * (len(no_match))
filler_data = {}
choropleth_col = list(choropleth_data.columns)
for idx in range(len(choropleth_data.columns)):
h = choropleth_col[idx]
if h == 'country':
filler_data[h] = [''] * (len(no_match))
elif h == 'CID':
filler_data[h] = ['--'] * (len(no_match))
elif h == 'geojson_country_name':
filler_data[h] = no_match
else:
filler_data[h] = filler
# Combine original DataFrame with non-match DataFrame
geojson_no_matches = pd.DataFrame(data = filler_data,columns = choropleth_col)
choropleth_data = choropleth_data.append(geojson_no_matches)
choropleth_data
folium
has a built in choropleth
function. Because I have to graph 3 of them, I made a function and I had them all added in as layers. Because the map is so tasking on the computer, saving it as an html file is a good option if your computer can't handle the large amount of graphics.
View the full map by downloading it here
With the map layers, we answer our 2nd question.
Finland and those Scandinavians! Finland is the top for all 3 maps and that shows that metal music activity in Finland is extremely high even though they are not the one of the most population dense countries in the world.
Here are some resources for choropleth maps:
# Function to add choropleth overlay to base map
def build_choropleth(base_map, name, field, threshold, color):
base_map.choropleth(name = name + 'Metal Bands Per Capita (10,000)',
geo_data = 'countries.geo.json', # JSON file containing the country boundary lines
data = choropleth_data, # DataFrame to plot from
columns = ['geojson_country_name', field + ' Per Capita (10,000)'], # Which columns to plot
key_on = 'feature.properties.ADMIN', # What property in the JSON file to match the column data with
threshold_scale = threshold, # Adjusts legend color scale
fill_color = color, fill_opacity = 0.5, line_opacity = 0.5, # Choropleth color overlay options
legend_name = name + 'Metal Bands Per Capita (10,000)',
reset = True, highlight = True)
base_map = folium.Map(location = [0,0], zoom_start = 1.5)
build_choropleth(base_map, '', 'Total', [0,1,2,5,6,7], 'BuPu')
build_choropleth(base_map, 'Active', 'Active', [-1,0,1,2,3,4], 'BuGn')
build_choropleth(base_map, 'Split-up', 'Split-up', [-1.0, -0.1, 0.4, 0.9, 1.4, 1.9], 'YlOrRd')
folium.LayerControl().add_to(base_map)
base_map.save('metalband_choropleth_map.html') # Save as html because most computers can't handle the loading the map
If you have stuck with me till the end, then congratulations! We did some incredible things in Python 3 and answered these metal music questions:
In comparison with the kaggle notebook, we had pretty similar results even though the data set here is 20 times larger. Our most metal country was Finland, but for the kaggle analysis it was the Faroes Islands, which is actually an island off of the Scandinavian paninsulas. A lot of the genre analysis was similar as well.
Thanks for reading and following along!