# Building a powerful cryptocurrency investment portfolio using a multi-factor strategy Data preprocessing chapter

### Building a powerful cryptocurrency investment portfolio using a multi-factor strategy Data preprocessing chapter

Building a powerful cryptocurrency investment portfolio using a multi-factor strategy Data preprocessing chapter

Maximizing Cryptocurrency Investments A Comprehensive Guide to Data Preprocessing and Multi-Factor Strategy for Building a Strong Portfolio

Foreword

In the previous article, we published the first part of the series titled “

[Constructing a Strong Cryptocurrency Portfolio Using Multi-Factor Strategies](/?s=Constructing+a+Strong+Cryptocurrency+Portfolio+Using+Multi-Factor+Strategies)“, which is the theoretical foundation. This article is the second part – Data Preprocessing.

Before/after computing factor data and testing the effectiveness of individual factors, it is necessary to preprocess the relevant data. The specific data preprocessing involves handling duplicates, outliers/missing values/extreme values, normalization, and data frequency.

1. Duplicates

Data Related Definitions:

–

Key: Represents a unique index. E.g., For a dataset with data for all tokens across different dates, the key is “token_id/contract_address – date”.

–

Value: The object indexed by the key is referred to as the “value”.

Detecting duplicates first requires understanding what the data “should” look like. Typically, data takes the following forms:

–

Time series data. Key is “time”. E.g., Price data for an individual token over 5 years.

–

Cross-sectional data. Key is “individual”. E.g., Price data for all tokens in the crypto market on November 1st, 2023.

–

Panel data. Key is a combination of “individual-time”. E.g., Price data for all tokens from January 1st, 2019 to November 1st, 2023.

Principle: Once the index (key) of the data is determined, it is possible to know at what level the data should not have duplicates.

Methods to Check:

–

pd.DataFrame.duplicated(subset=[key1, key2, …])

–

Check the number of duplicates:

pd.DataFrame.duplicated(subset=[key1, key2, …]).sum()

–

Sample the duplicate records:

df[df.duplicated(subset=[…])].sample(), and then use

df.locto select all duplicate records corresponding to that index.

–

–

pd.merge(df1, df2, on=[key1, key2, …], indicator=True, validate=’1:1′)

–

In the function for horizontal merging, add the

indicatorparameter, which will generate the

_mergefield. Using

dfm[‘_merge’].value_counts()will allow you to check the number of records from different sources after merging.

–

By including the

validateparameter, you can verify whether the indexes in the merged dataset match the expected ones (

1 to 1,

1 to many, or

many to many, with the last one actually not requiring validation). If it doesn’t match the expectation, the merging process will throw an error and stop execution.

–

2. Outliers/Missing Values/Extreme Values

Common Causes of Outliers:

–

Extreme situations. For example, a token with a price of $0.000001 or a market cap of only $500,000, where even a small change can result in a return rate of several times.

–

Data characteristics. For example, if token price data starts from January 1st, 2020, it is impossible to calculate the return rate data for January 1st, 2020, because there is no closing price for the previous day.

–

Data errors. Data providers are inevitably prone to mistakes, such as recording $12 per token as $1.2 per token.

Principles for handling outliers and missing values:

–

Delete. For outliers that cannot be reasonably corrected or adjusted, deletion can be considered.

–

Replace. This is usually used for extreme values, such as Winsorizing or taking the logarithm (less commonly used).

–

Fill. Missing values can also be filled in a reasonable way. Common methods include mean (or moving average), interpolation, filling with 0

df.fillna(0), forward filling

df.fillna(‘ffill’), or backward filling

df.fillna(‘bfill’), but the assumptions underlying the filling should be considered.

Be cautious when using backward filling in machine learning, as it carries a risk of look-ahead bias.

Methods for handling extreme values:

1. Percentile method.

By arranging the data in ascending order, replace the data that exceeds the minimum and maximum proportions with critical data. This method is relatively rough and may not be suitable for datasets with abundant historical data, as forcefully deleting a fixed proportion of data may result in a certain loss.

2. 3σ / Three standard deviations method.

The standard deviation σ factor reflects the dispersion of data distribution, i.e., volatility. Use μ±3×σ range to identify and replace outliers in the dataset, where approximately 99.73% of the data falls within this range. This method requires the factor data to follow a normal distribution, i.e., X∼N(μ,σ2).

Where μ=∑ⁿᵢ₌₁⋅Xi/N, σ²=∑ⁿᵢ₌₁=(xi-μ)²/n, and the reasonable range of factor values is [μ−3×σ,μ+3×σ].

Make the following adjustments to all factors within the data range:

The limitation of this method is that the commonly used data in the quantitative field, such as stock prices or token prices, often exhibit a peaked heavy-tailed distribution, which does not adhere to the assumption of a normal distribution. In this case, using the 3σ method may mistakenly identify a large amount of data as outliers.

3. Median Absolute Deviation (MAD) method.

This method is based on the median and absolute deviation, making the processed data less sensitive to extreme values or outliers. It is more robust compared to methods based on mean and standard deviation.

The median MAD value for absolute deviation is MAD=median( ∑ⁿᵢ₌₁(Xi – Xmedian) ).

The reasonable range of factor values is [ Xmedian-n×MAD, Xmedian + n×MAD]. Make the following adjustments to all factors within the data range:

# Handling Extreme Values of Factor Data

class Extreme(object):

def __init__(s, ini_data):

s.ini_data = ini_data

def three_sigma(s,n=3):

mean = s.ini_data.mean()

std = s.ini_data.std()

low = mean – n*std

high = mean + n*std

return np.clip(s.ini_data,low,high)

def mad(s, n=3):

median = s.ini_data.median()

mad_median = abs(s.ini_data – median).median()

high = median + n * mad_median

low = median – n * mad_median

return np.clip(s.ini_data, low, high)

def quantile(s,l = 0.025, h = 0.975):

low = s.ini_data.quantile(l)

high = s.ini_data.quantile(h)

return np.clip(s.ini_data, low, high)

III. Standardization

1. Z-score Standardization

–

Assumption: X ~ N(μ,σ)

–

This method is sensitive to outliers as it uses standard deviation.

2. Min-Max Scaling

Transforms each factor data into data ranging from 0 to 1, allowing for comparison of data with different scales or ranges. It does not change the internal distribution of the data or sum it up to 1.

–

Due to consideration of extreme values, it is sensitive to outliers.

–

Uniform measurement, convenient for comparing data in different dimensions.

3. Rank Scaling

Converts data features into their rankings and converts these rankings into scores between 0 and 1, typically their percentiles in the dataset.*

–

As rankings are not affected by outliers, this method is not sensitive to outliers.

–

Does not maintain the absolute distances between data points, but converts them into relative rankings.

NormRankᵢ=(Rankₓᵢ−min(Rankₓᵢ))/max(Rankₓ)−min(Rankₓ)=Rankₓᵢ/N

where min(Rankₓ)=0, N is the total number of data points in the interval.

# Standardizing Factor Data

class Scale(object):

def __init__(s, ini_data,date):

s.ini_data = ini_data

s.date = date

def zscore(s):

mean = s.ini_data.mean()

std = s.ini_data.std()

return s.ini_data.sub(mean).div(std)

def maxmin(s):

min = s.ini_data.min()

max = s.ini_data.max()

return s.ini_data.sub(min).div(max – min)

def normRank(s):

# Ranks the specified columns, method=’min’ means that same values have the same rank instead of average ranking

ranks = s.ini_data.rank(method=’min’)

return ranks.div(ranks.max())

IV. Data Frequency

Sometimes, the data we obtain is not in the frequency we need for analysis. For example, if the analysis is conducted on a monthly basis, but the original data has a daily frequency, then “downsampling” is needed to aggregate the data into monthly frequency.

Downsampling

Refers to aggregating the data in a set into a single row of data, such as aggregating daily data into monthly data. At this time, the characteristics of each aggregated indicator need to be considered, and the usual operations include:

–

First value/Last value

–

Mean/Median

–

Standard deviation

Upsampling

Refers to breaking down one row of data into multiple rows of data, such as using annual data for monthly analysis. In this case, simple repetition is usually sufficient, but sometimes it is necessary to distribute the annual data proportionally among the months.

Falcon (

(/?s=https://falcon.lucida.fund)) is the next-generation Web3 investment infrastructure based on multifactor models, helping users “select,” “buy,” “manage,” and “sell” crypto assets. Falcon was incubated by Lucida in June 2022.

For more information, visit

(/?s=https://linktr.ee/lucida_and_falcon)

We will continue to update

Blocking; if you have any questions or suggestions, please [contact us!](/cdn-cgi/l/email-protection#4b2f253d0b27223d2e65282426)

Was this article helpful?

93 out of 132 found this helpful