Python to STATA Cheat Sheet
Python has become the go-to language in the field of Data Science. However, other programming languages or statistical software are still being actively used depending on the particular industry or field you are immersed in. One best example is the field of Economics Research. Many economists and post-doctoral employees use this program called STATA for data wrangling, running regressions and plotting graphs. Most econometrics classes ask students to use STATA instead of Python for class projects and assignments. How is STATA different from Python or any other programming languages commonly used for Data Science?
I would say the characteristic of STATA that differentiates it the most from Python is that it is not a programming language. It’s more of a “software” that enables economists who don’t know how to code in any programming language to more intuitively deal with datasets to perform economics research. For example, as opposed to Python, a object oriented programming language, STATA does not allow for this “assignment” of data into “objects” or “variables”. The closest alternative STATA has is the “temporary files” which can be run using the tempfile command. As such, due to this fundamental difference and also some major syntax disparities, I myself had a pretty steep learning curve for getting used to STATA.
Furthermore, I found it very difficult to do “googling” to find additional information on how to perform specific tasks for STATA. Unlike Python which has myriads of e-books, blog posts, communities (e.g. stackoverflow) and tutorials online for various things you would like to do with your data, I had a hard time with STATA in that regard.
The purpose of this post is to provide a cheat sheet of how the commonly used Python commands and operations you might have used for Data Science translate into STATA so that you don’t have to spend hours searching for information on statalist (similar to stackoverflow but for STATA; does not have the most useful info or user-friendly interface) for something you could have so easily done in Python.
Reading and Saving Files
Reading .dta(stata) file
Stata) use file.dtaPython) pd.read_stata("file.dta")
Using .dta file with only specific columns
Stata) use var1 var2 using file.dta
Python) pd.read_stata("file.dta", columns=['var1','var2'])
Read in .xlsx or .csv files
Stata) import excel using excel_file.xlsx
Python) pd.read_excel('excel_file.xlsx')Stata) import delimited using csvfile.csv
Python) pd.read_csv("csvfile.csv")
Save as .dta file to certain directory
Stata)save file.dta, replacePython) df.to_stata("file.dta") ORdf.to_pickle("file.dta") (for Python-native file type)
The parameter “replace” above in the STATA command means you are replacing the pre-existing file.dta file with the current version that you are saving.
Save as .csv or excel file to certain directory
Stata) outsheet using csvfile, comma
Python) df.to_csv("csvfile.csv")Stata)export excel using excelfilePython)df.to_excel("excelfile.xlsx")
Description and Summary Statistics
Breakdown of values and their counts
Stata) tab var
==> count, percentage and cumulative percentagePython) df[‘var’].value_counts( )
==> just count breakdown
STATA allows abbreviated commands. For example, the above command “tab” will work with just “ta”.
Describe
Stata)describePython) df.info( ) OR df['var'].dtype
Total Number of Observations
Stata) countPython) df.shape[0] OR len(df).
==> Here df.shape returns a tuple with the length and width of the DataFrame.
If you combine the “count” command in STATA with the “if statement”, then you can count the number of observations that satisfy certain conditions.
e.g. count if name_last == “kim” (number of observations whose last name is kim)
Describe specific variable
Stata) summ varsumm var if <condition>Python) df['var'].describe( )df[<condition>][‘var’].describe( ) OR
df.loc[condition, ‘var’].describe()
Basic Data Wrangling and Operations (e.g. creation or dropping of new variable)
Sort observations using var or multiple vars
Stata) sort var1 var2Python) df.sort_values(['var1', 'var2'])
Rename variable names
Stata) rename var1 var2
==> rename var1 to var2rename (var1 var3) (var2 var4)
==> rename var1 to var3 and rename var3 to var4Python) df.rename(columns = {‘var1’:‘var2’}, inplace=True)
Keep or drop observations based on conditions
Stata) keep if <condition>drop if <condition>
Python)df = df[<condition>]df = df[~(<condition>)]
Keep or drop columns/variables
Stata) keep var1 var2…drop var1 var2…Python) df = df[[‘var1’, ‘var2’]]del df[‘var1’] OR df = df.drop(labels= [‘var1’,’var2’], axis=1)
Create new variables and replace
Stata) gen newvar = <expression>gen newvar = <expression> if <condition>replace newvar = <expression> if <condition>Python)df[‘newvar’] = <expression>
STATA also has this “egen” command which does the same thing as “gen” but is used for statistical or aggregate functions as you can see below:
Stata) egen newvar = max(var)egen newvar = mean(var)Python) newvar = df[‘var’].max()newvar = df[‘var’].mean()
Other Operations
Checking if certain values are included in a variable
Stata) inlist(var, “val1”, “val2”)Python) df[‘var’].isin([“val1”, “val2”])
Checking if values in a column/variable are within a specific range
Stata) inrange(<var>, ‘val1’, ‘val2’)Python) df[‘var’].between(‘val1’, ‘val2’)
**inclusive = False / True
Preserve & Restore Operations
Stata)
preserve
keep if missing(dob)
drop dob
save “mystata/dob_missing.dta”, replace
restorePython) Assign modified/wrangled dataframe in another variable
Preserve, Restore operation is something unique to STATA. This operation is useful because STATA does not allow for assignment of current data to an object or variable in memory (as I have mentioned in the beginning of the post of how STATA is not an object oriented programming language but rather a software like SPSS). In the above example, say you want to save a list of observations whose date of birth is missing. You keep only the observations with missing date of births and save it separately. Now you want to go back to the original FULL DATA to do some other operations. Unfortunately, the original full data is non-present and only the modified data is in the current memory. preserve-restore operation comes in handy in such cases.
try & except equivalent in STATA
Stata)
capture {
<some stata commands>
}
if _rc==0 {
display "your calculation is wrong"
}Python)
try:
2 + 2 == 5
except:
print("your calculation is wrong")
String Operations
Split operations
Stata) split n, p(“,“) gen(n_)
==> split the values in variable “n” on comma and make them into separate new columns named n_1, n_2…Python)df[‘n’].str.split( pat= “,”)
Removing Blanks
Stata)Say s is a stringstritrim(s): remove blanks
stritrim(s) returns s with all consecutive, internal blanks collapsed to one blank
strltrim(s) returns s with leading blanks removed
strrtrim(s) returns s with trailing blanks removed
strtrim(s) returns s with leading and trailing blanks removed.e.g. replace name_last = strtrim(stritrim(lower(name_last)))Python)df['last_name'] = df['last_name'].str.strip( )OR df['last_name'] = df['last_name'].str.replace(" ","")
Regular Expression
Stata)
gen last_name = "kim" if regexm(n, “kim([0-9]+)$”)
==> apply regex “kim([0-9]+)$” on variable/column "n" and if it satisfies that regex, create a new variable named last_name and fill in with "kim"Python)
https://docs.python.org/3/library/re.html
Parsing subset of strings
Stata)gen dob_yr = substr(DOB, -4, . ) # parse 4th from last to last stringsubstr(raw_name, 1, 3) # parse 3 characters starting from 1st indexed character from values in raw_name columnPython)df[‘dob_yr’] = df.DOB.apply(lambda x: x[-4: ])
Substituting strings with some other value/string
Stata) subinstr(var, “x”, “”, . ) #replace “x” with empty string(“”) for all values in column varPython) df[‘var’] = df.var.replace({“x”: “”})
Merge and Join Operations
append
Stata) use df1.dta
append using df2.dtaPython) df_joint = df1.append(df2)
merge
In STATA, there are four types of merges — 1:1, 1:m, m:1, and m:m.
Stata) use df1.dta
merge 1:1 <vars> using "df2.dta"Python) df_joint = pd.merge(df1, df2, on=<vars>)
One thing to keep in mind is that when you are doing 1:1 or 1:m in STATA, you need to make sure that the master data (whatever data you are using currently; FYI “using” data is the data that is being merged or appended onto this master data) is uniquely defined in every observation across these variables specified in <vars>. Otherwise, the system will complain to you with an error that says “observations in master not uniquely defined by var1 var2….” in red font.
STATA merging also has this parameter called “keep” which allows you to control which observation are kept and dropped during the merge process. It is basically the same as the “how” parameter in Python merging.
Python | Statahow='left' | keep(1, 3)
==> Keeps all observations in the "left" DFhow='right' | keep(2, 3)
==> Keeps all observations in the "right" DFhow='inner' | keep(3)
Keeps observations that are in both DFhow='outer' | keep(1 2 3)
==> Keeps all observations.
For more detailed information on merging in STATA, refer to this documentation: https://www.stata.com/manuals/dmerge.pd
Econometrics
T-test
Stata) ttest <var>, by(<var>)Python)
from scipy.stats import ttest_ind
ttest_ind(<array1>, <array2>)
Simple Linear Regression under certain conditions
Stata) reg <yvar> <xvar> if <condition>, rPython)
import econtools.metrics as mt
results = mt.reg(df[<condition>], <yvar>, <xvar>, robust=True)
A lot of people ask me which programming language or software is the best for data science. There is no straightforward answer. In fact, time allowing, being versatile in several programming languages and programs (though in different levels of proficiency) is highly beneficial because each language / program is best suited for different data science tasks. For instance, if you want to build simple but pretty visualizations, R’s ggplot might be the way to go instead of Python’s matplotlib which requires more coding and syntax to create the same plots. Similarly, having STATA in your toolbox allows you to maybe perform certain tasks more efficiently and faster than when you use Python to finish the same tasks. In that light, I hope this article was of help to those who would like to have both Python and STATA in their tool kits, ready to use any of them that is more appropriate to use for a certain task.
References
[1] G. Rodríguez, Stata Tutorial, Germán Rodríguez Personal Website
[2] D. M. Sullivan, Stata to Python Equivalents, D. M. Sullivan Personal Website
[3] Pandas Comparison with Stata, Pandas Documentation Page
I can be reached at the following :)
Website [ https://seungjun-data-science.github.io/ ]
LinkedIn [ https://www.linkedin.com/in/kimseungjun/ ]
Github [ https://github.com/Seungjun-Data-Science ]
Kaggle [ https://www.kaggle.com/juminator ]
Youtube [ https://www.youtube.com/user/juny1226 ]
Instagram [ https://www.instagram.com/juny.