Building a powerful cryptocurrency investment portfolio using a multi-factor strategy Data preprocessing chapter

Building a powerful cryptocurrency investment portfolio using a multi-factor strategy Data preprocessing chapter

Building a powerful cryptocurrency investment portfolio using a multi-factor strategy Data preprocessing chapter
Maximizing Cryptocurrency Investments A Comprehensive Guide to Data Preprocessing and Multi-Factor Strategy for Building a Strong Portfolio
In the previous article, we published the first part of the series titled “
[Constructing a Strong Cryptocurrency Portfolio Using Multi-Factor Strategies](/?s=Constructing+a+Strong+Cryptocurrency+Portfolio+Using+Multi-Factor+Strategies)“, which is the theoretical foundation. This article is the second part – Data Preprocessing.
Before/after computing factor data and testing the effectiveness of individual factors, it is necessary to preprocess the relevant data. The specific data preprocessing involves handling duplicates, outliers/missing values/extreme values, normalization, and data frequency.
1. Duplicates
Data Related Definitions:

Key: Represents a unique index. E.g., For a dataset with data for all tokens across different dates, the key is “token_id/contract_address – date”.

Value: The object indexed by the key is referred to as the “value”.
Detecting duplicates first requires understanding what the data “should” look like. Typically, data takes the following forms:

Time series data. Key is “time”. E.g., Price data for an individual token over 5 years.

Cross-sectional data. Key is “individual”. E.g., Price data for all tokens in the crypto market on November 1st, 2023.

Panel data. Key is a combination of “individual-time”. E.g., Price data for all tokens from January 1st, 2019 to November 1st, 2023.
Principle: Once the index (key) of the data is determined, it is possible to know at what level the data should not have duplicates.
Methods to Check:

pd.DataFrame.duplicated(subset=[key1, key2, …])

Check the number of duplicates:
pd.DataFrame.duplicated(subset=[key1, key2, …]).sum()

Sample the duplicate records:
df[df.duplicated(subset=[…])].sample(), and then use
df.locto select all duplicate records corresponding to that index.

pd.merge(df1, df2, on=[key1, key2, …], indicator=True, validate=’1:1′)

In the function for horizontal merging, add the
indicatorparameter, which will generate the
_mergefield. Using
dfm[‘_merge’].value_counts()will allow you to check the number of records from different sources after merging.

By including the
validateparameter, you can verify whether the indexes in the merged dataset match the expected ones (
1 to 1,
1 to many, or
many to many, with the last one actually not requiring validation). If it doesn’t match the expectation, the merging process will throw an error and stop execution.

2. Outliers/Missing Values/Extreme Values
Common Causes of Outliers:

Extreme situations. For example, a token with a price of $0.000001 or a market cap of only $500,000, where even a small change can result in a return rate of several times.

Data characteristics. For example, if token price data starts from January 1st, 2020, it is impossible to calculate the return rate data for January 1st, 2020, because there is no closing price for the previous day.

Data errors. Data providers are inevitably prone to mistakes, such as recording $12 per token as $1.2 per token.
Principles for handling outliers and missing values:

Delete. For outliers that cannot be reasonably corrected or adjusted, deletion can be considered.

Replace. This is usually used for extreme values, such as Winsorizing or taking the logarithm (less commonly used).

Fill. Missing values can also be filled in a reasonable way. Common methods include mean (or moving average), interpolation, filling with 0
df.fillna(0), forward filling
df.fillna(‘ffill’), or backward filling
df.fillna(‘bfill’), but the assumptions underlying the filling should be considered.
Be cautious when using backward filling in machine learning, as it carries a risk of look-ahead bias.
Methods for handling extreme values:
1. Percentile method.
By arranging the data in ascending order, replace the data that exceeds the minimum and maximum proportions with critical data. This method is relatively rough and may not be suitable for datasets with abundant historical data, as forcefully deleting a fixed proportion of data may result in a certain loss.
2. 3σ / Three standard deviations method.
The standard deviation σ factor reflects the dispersion of data distribution, i.e., volatility. Use μ±3×σ range to identify and replace outliers in the dataset, where approximately 99.73% of the data falls within this range. This method requires the factor data to follow a normal distribution, i.e., X∼N(μ,σ2).
Where μ=∑ⁿᵢ₌₁⋅Xi/N, σ²=∑ⁿᵢ₌₁=(xi-μ)²/n, and the reasonable range of factor values is [μ−3×σ,μ+3×σ].
Make the following adjustments to all factors within the data range:
The limitation of this method is that the commonly used data in the quantitative field, such as stock prices or token prices, often exhibit a peaked heavy-tailed distribution, which does not adhere to the assumption of a normal distribution. In this case, using the 3σ method may mistakenly identify a large amount of data as outliers.
3. Median Absolute Deviation (MAD) method.
This method is based on the median and absolute deviation, making the processed data less sensitive to extreme values or outliers. It is more robust compared to methods based on mean and standard deviation.
The median MAD value for absolute deviation is MAD=median( ∑ⁿᵢ₌₁(Xi – Xmedian) ).
The reasonable range of factor values is [ Xmedian-n×MAD, Xmedian + n×MAD]. Make the following adjustments to all factors within the data range:
# Handling Extreme Values of Factor Data
class Extreme(object):
def __init__(s, ini_data):
s.ini_data = ini_data
def three_sigma(s,n=3):
mean = s.ini_data.mean()
std = s.ini_data.std()
low = mean – n*std
high = mean + n*std
return np.clip(s.ini_data,low,high)
def mad(s, n=3):
median = s.ini_data.median()
mad_median = abs(s.ini_data – median).median()
high = median + n * mad_median
low = median – n * mad_median
return np.clip(s.ini_data, low, high)
def quantile(s,l = 0.025, h = 0.975):
low = s.ini_data.quantile(l)
high = s.ini_data.quantile(h)
return np.clip(s.ini_data, low, high)
III. Standardization
1. Z-score Standardization

Assumption: X ~ N(μ,σ)

This method is sensitive to outliers as it uses standard deviation.
2. Min-Max Scaling
Transforms each factor data into data ranging from 0 to 1, allowing for comparison of data with different scales or ranges. It does not change the internal distribution of the data or sum it up to 1.

Due to consideration of extreme values, it is sensitive to outliers.

Uniform measurement, convenient for comparing data in different dimensions.
3. Rank Scaling
Converts data features into their rankings and converts these rankings into scores between 0 and 1, typically their percentiles in the dataset.*

As rankings are not affected by outliers, this method is not sensitive to outliers.

Does not maintain the absolute distances between data points, but converts them into relative rankings.
where min(Rankₓ)=0, N is the total number of data points in the interval.
# Standardizing Factor Data
class Scale(object):
def __init__(s, ini_data,date):
s.ini_data = ini_data = date
def zscore(s):
mean = s.ini_data.mean()
std = s.ini_data.std()
return s.ini_data.sub(mean).div(std)
def maxmin(s):
min = s.ini_data.min()
max = s.ini_data.max()
return s.ini_data.sub(min).div(max – min)
def normRank(s):
# Ranks the specified columns, method=’min’ means that same values have the same rank instead of average ranking
ranks = s.ini_data.rank(method=’min’)
return ranks.div(ranks.max())
IV. Data Frequency
Sometimes, the data we obtain is not in the frequency we need for analysis. For example, if the analysis is conducted on a monthly basis, but the original data has a daily frequency, then “downsampling” is needed to aggregate the data into monthly frequency.
Refers to aggregating the data in a set into a single row of data, such as aggregating daily data into monthly data. At this time, the characteristics of each aggregated indicator need to be considered, and the usual operations include:

First value/Last value


Standard deviation
Refers to breaking down one row of data into multiple rows of data, such as using annual data for monthly analysis. In this case, simple repetition is usually sufficient, but sometimes it is necessary to distribute the annual data proportionally among the months.
Falcon (
(/?s= is the next-generation Web3 investment infrastructure based on multifactor models, helping users “select,” “buy,” “manage,” and “sell” crypto assets. Falcon was incubated by Lucida in June 2022.
For more information, visit
We will continue to update
Blocking; if you have any questions or suggestions, please [contact us!](/cdn-cgi/l/email-protection#4b2f253d0b27223d2e65282426)
Was this article helpful?
93 out of 132 found this helpful

Read More…

Leave a Reply

Your email address will not be published. Required fields are marked *

nine + 12 =