Introduction

In this section, I use the database built in Section 1.7 to define a player_profiler function that will take the data of each player's performances and then it will process it to compile three player profiles for each player, each corresponding to one of the play races of StarCraft 2. The section compiles this function into the profiler module.

Exportable Members

build_player_race_profiles

Querying the database

Once the ingestion process is done, the next step is to turn the replays data in the database into player profiles. To build these profiles, I need to separate the replays by player and then by race. To accomplish the former, I need to extract a list of usernames from the database.

The following code shows how to extract a list of player usernames looping through the replays collection created in the ingest process.

Tip: I will ignore the usernames that follow the pattern ’A.I. number (level)’ because they refer to the game’s A.I. opponents. Similarly, I ignore the names ’Player 2’ and those composed only by repeating the letter ’l’ (known as a barcode name). This condition is necessary because players use these two patterns to hide their identity by blending with other players that use the same username. Hence they cannot be used to separate players.

# Load database
working_db = set_up_db()

# Define username patters to ignore
ai_pat = re.compile(r'^A\.I\. [\d] [(][\w\s]*[)]$')
barcode_pat = re.compile(r'^l+$')

# Iterate through the records in the `replays` collection to get all valid
# user names.
players_match_count = dict()
for rec in working_db['replays'].find():
    for player in rec['players']:
        if not (ai_pat.findall(player['username']) 
                or barcode_pat.findall(player['username'])
                or player['username'] == 'Player 2'):
            players_match_count.setdefault(player['username'], 0)
            players_match_count[player['username']] += 1
            
# I will ignore players that only have one record in the database.
{name: count for name , count in players_match_count.items() if count >= 2}

{'HDEspino': 149,
 'DaveyC': 2,
 'Xnorms': 2,
 'Shah': 3,
 'Razer': 2,
 'gae': 2,
 'SenorCat': 2,
 'Worawit': 2,
 'aria': 2,
 'xiiaoyao': 2}

Of this players I will focus only on HDEspino given that the player has a substancial number of replays in the test database.

In any case, once I have a list of user names in a database, I can extract all the replays replative to that player with simple queries to the data base.

For example, the following queries extract all replays were HDEspino was playing either as player one or two.

print(len([rpl for rpl 
           in working_db['replays'].find({'players.0.username':'HDEspino',
                                          'players.0.race':'Protoss'})]))
print(len([rpl for rpl 
           in working_db['replays'].find({'players.1.username':'HDEspino',
                                          'players.1.race':'Protoss'})]))

91
39

Building the profile

Based on this list, I will build the Protoss profile for this player to illustrate what this process would entail.

First, I will query the system to identify the replays where the user was one of the players and was playing as Protoss. Then, I use that information to build a DataFrame containing all of the indicators for the player's performances in these replays.

# Query `replays` and build a list of replays the user played as
# Protoss and Player 1. 
player_1_protoss = [rpl['replay_name'] for rpl 
                   in working_db['replays'].
                      find({'players.0.username':'HDEspino', 
                            'players.0.race':'Protoss'},
                            {'replay_name':1, 'players':1})]

# Based on the list query `indicators` to get the performance scores of 
# Player 1 in each replay of the previous list.
working_repls = {}
for rpl in player_1_protoss:
    for cur in working_db['indicators'].find({'replay_name':rpl, 
                                              'player_id': 1}, 
                                             {'_id':0, 'replay_name':0,
                                              'player_username':0,
                                              'player_id': 0}):
        working_repls[rpl] = cur
        
len(working_repls)

91

# Repeat the process above but focused on the replays where the player
# played as Player 2.

player_2_protoss = [rpl['replay_name'] for rpl 
                   in working_db['replays'].
                      find({'players.1.username':'HDEspino', 
                            'players.1.race':'Protoss'},
                            {'replay_name':1, 'players':1})]

for rpl in player_2_protoss:
    for cur in working_db['indicators'].find({'replay_name':rpl, 
                                              'player_id': 2}, 
                                             {'_id':0, 'replay_name':0,
                                              'player_username':0,
                                              'player_id': 0}):
        working_repls[rpl] = cur
   
working_df = (pd.DataFrame(working_repls.values(), 
                           index=working_repls.keys()).reset_index()
                                                      .drop('index', axis=1))
working_df.info(memory_usage=False, show_counts=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130 entries, 0 to 129
Columns: 389 entries, unspent_minerals_avg_whole to late_started_zealot
dtypes: float64(97), int64(284), object(8)

After extracting all replays relative to a player and race, I group them into a DataFrame. In the sample case, the DataFrame has 130 entries and 385 columns. These columns represent the indicators stored by inventory_replays into the indicators collection.

More importantly, I see that there are three types of data stored in the columns (97 store decimals (type float64), 284 store integers (type int64) and 8 store other value types). In this case, the other value types are categorical values in the form of strings, which store the players' first and second prefered special abilities, as I show in the code below.

categorical_columns = working_df.dtypes[working_df.dtypes == object]

cat_features = working_df[[x for x in categorical_columns.index]]

# I only include 4 of the 8 columns for space.
pref_abil_df = working_df[['first_whole_pref_sab',
 'second_whole_pref_sab',
 'first_mid_pref_sab',
 'second_mid_pref_sab']]

# print(pref_abil_df.tail(5).to_markdown())

	first_whole_pref_sab	second_whole_pref_sab	first_mid_pref_sab	second_mid_pref_sab
125	ChronoBoostEnergyCost	None	None	None
126	ChronoBoostEnergyCost	UnloadTargetWarpPrism	ChronoBoostEnergyCost	UnloadTargetWarpPrism
127	ForceField	ChronoBoostEnergyCost	ForceField	GuardianShield
128	ChronoBoostEnergyCost	ForceField	ChronoBoostEnergyCost	ForceField
129	ChronoBoostEnergyCost	None	None	None

I can process this categories using the value_counts function to get the most common preffered ability. Next I define get_top_of_category to extract the most used attribute in a column.

Note: once I move into clustering I will need to turn this data into a numerical representation. For example, since this are cardinal categories I could convert the data into a binary matrix (one-hot-matrix).

get_top_of_category(pref_abil_df.first_whole_pref_sab)

'ChronoBoostEnergyCost'

cate_profile = cat_features.apply(get_top_of_category, axis=0)
cate_profile

first_whole_pref_sab     ChronoBoostEnergyCost
second_whole_pref_sab           GuardianShield
first_early_pref_sab     ChronoBoostEnergyCost
second_early_pref_sab                     None
first_mid_pref_sab       ChronoBoostEnergyCost
second_mid_pref_sab                       None
first_late_pref_sab                       None
second_late_pref_sab                      None
dtype: object

Meanwhile, I will simply average all other columns to get a single value for the players profile.

non_cat_columns = working_df.dtypes[working_df.dtypes != object]

non_cat_features = working_df[[x for x in non_cat_columns.index]]
non_cat_features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130 entries, 0 to 129
Columns: 381 entries, unspent_minerals_avg_whole to late_started_zealot
dtypes: float64(97), int64(284)
memory usage: 387.1 KB

non_cate_profile = non_cat_features.mean()
non_cate_profile

unspent_minerals_avg_whole    1068.323130
unspent_minerals_avg_early     184.596550
unspent_minerals_avg_mid       651.947283
unspent_minerals_avg_late     2307.746755
unspent_vespene_avg_whole      531.042321
                                 ...     
late_started_stalker             5.123077
late_started_tempest             0.584615
late_started_voidray             5.069231
late_started_warpprism           0.123077
late_started_zealot              3.215385
Length: 381, dtype: float64

Once these two sets of values are defined, I can join them in a single profile.

Note: When merging the two sets, I define a ’player_profile’ value as an identifier for the profile and a shared column that allows the merge.

profile_name = 'player_profile'
left = pd.DataFrame(non_cate_profile.to_dict(), index=[0])
left.insert(0, profile_name, 'HDEspino_protoss')
right = pd.DataFrame(cate_profile.to_dict(), index=[0])
right.insert(0, profile_name, 'HDEspino_protoss')

full_profile =  left.merge(right, how='inner', on=profile_name)
full_profile.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1 entries, 0 to 0
Columns: 390 entries, player_profile to second_late_pref_sab
dtypes: float64(381), object(9)
memory usage: 3.1+ KB

The following table shows the resultof the ten first and last indicators in the profile and their values.

Indicator	Value
player_profile	HDEspino_protoss
unspent_minerals_avg_whole	1068.32313026423
unspent_minerals_avg_early	184.59654999017428
unspent_minerals_avg_mid	651.9472831545554
unspent_minerals_avg_late	2307.7467547558495
unspent_vespene_avg_whole	531.0423213983343
unspent_vespene_avg_early	109.9563382651504
unspent_vespene_avg_mid	504.7169891278007
unspent_vespene_avg_late	1075.1499164552638
unspent_resources_avg_whole	1599.365451662564
late_started_warpprism	0.12307692307692308
late_started_zealot	3.2153846153846155
first_whole_pref_sab	ChronoBoostEnergyCost
second_whole_pref_sab	GuardianShield
first_early_pref_sab	ChronoBoostEnergyCost
second_early_pref_sab	None
first_mid_pref_sab	ChronoBoostEnergyCost
second_mid_pref_sab	None
first_late_pref_sab	None
second_late_pref_sab	None

Exportable function

Here, I define build_player_race_profiles as a function that converts all replays in a database into a set of player profiles. The function uses four helper functions:

Once, I run the function. There is one record in each of the profile databases; the profile of HDEspino for each race.

build_player_race_profiles()

Accessing: TEST_library
1 users found in database
Generating Player Profiles
Created the following profiles
Protoss: 1
Zerg: 1
Terran: 1

print(working_db['Protoss_Profiles'].estimated_document_count())
print(working_db['Terran_Profiles'].estimated_document_count())
print(working_db['Zerg_Profiles'].estimated_document_count())

1
1
1

10 - Player Profiler

Introduction

Exportable Members

Querying the database

Building the profile

Exportable function

`build_player_race_profiles`[source]

10 - Player Profiler

Introduction

Exportable Members

Querying the database

Building the profile

Exportable function

build_player_race_profiles[source]

`build_player_race_profiles`[source]